From: verec
Subject: reduced size symbols/keywords
Date: 
Message-ID: <48b78b82$0$623$5a6aecb4@news.aaisp.net.uk>
Toying, toying ... rather than doing useful work ... (Kenny must be
right after all :-)

Out of sheer curiosity (rather than need) I'm wondering if there
is a way to reduce the size (memory footprint) of symbols.

CL-USER 11 > (describe 'keyword)

keyword is a symbol
name          "KEYWORD"
value         #<unbound value>
function      #<unbound function>
plist         (type::direct-type-predicate keywordp)
package       #<The COMMON-LISP package, 3/4 internal, 978/1024 external>

Here I'm counting five slots, for a (32 bit machine) total size
of 20 bytes, just for the structure, when the actual "payload"
is only 12 bytes, or about 60% "waste"

So, I'm wondering if there's an (entirely portable) way to define
"trimmed down" symbols that would exhibit much less overhead.

Obviously, I thought that keywords must be that trimmed down version
of symbols ... except they are not:

CL-USER 12 > (describe :keyword)

:keyword is a symbol
name          "KEYWORD"
value         :keyword
function      #<unbound function>
plist         nil
package       #<The KEYWORD package, 0/4 internal, 5196/8192 external>

This question is not (entirely) rhetoric, as I'm thinking of some
application reading 1,000,000s words out of text files and turning them
into symbols/keywords for processing. Cutting down the memory
footprint by half or more would be extremely significant, while still
preserving "most" of the properties of symbols, while sacrifing a few,
ie, while

CL-USER 15 > (setf (symbol-function :cons) #'cons)

Error: Defining function :cons visible from package KEYWORD.
  1 (continue) Redefine it anyway.
  2 (abort) Return to level 0.
  3 Return to top loop level 0.

Type :b for backtrace, :c <option number> to proceed,  or :? for other options

CL-USER 16 : 1 > :c 1
#<Function ((compiler::def-nil-function cons) . cons) 204AD802>

CL-USER 17 > (:cons 1 2)
(1 . 2)

is cute, I'm ok with those new keywords/symbols to forego everything 
(function, plist, package) but the value slot.

How would I go about

(defun/defmacro word (..) ... )

such that

(word 'some-word)

and

CL-USER 19 > (describe 'some-word)

some-word is a symbol
name          "SOME-WORD"

And that's it?

Many Thanks
--
JFB

From: Kenny
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <48b7c792$0$29518$607ed4bc@cv.net>
verec wrote:
> Toying, toying ... rather than doing useful work ... (Kenny must be
> right after all :-)

No, I was wrong: I pegged you as a troll, you are still Lisping.

What is wrong with defstruct? Your problem description includes the 
solution: symbols are too heavyweight for your requirement. (Solution: 
don't use symbols. Or buy some RAM.)

hth,kenny
From: Tim Bradshaw
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <16301560-381c-4e7b-8e58-6a5a976ef103@c58g2000hsc.googlegroups.com>
On Aug 29, 6:39 am, verec <·····@mac.com> wrote:
>
> This question is not (entirely) rhetoric, as I'm thinking of some
> application reading 1,000,000s words out of text files and turning them
> into symbols/keywords for processing. Cutting down the memory
> footprint by half or more would be extremely significant, while still
> preserving "most" of the properties of symbols, while sacrifing a few,
> ie, while

The thing you are looking for is a string. You want to have only one
instance of each string: that's what a hashtable does.

If you want your lexer to avoid consing even temporary instances of
the strings it reads, then you probably want to use a trie (unlike the
above two you will have to write your own of these).  Generally it is
probably not worth the effort though, unless you have other reasons to
want to use a trie - the ephemeral string consing will be pretty
cheap.

It is true that Neanderthal Lisp programs did use symbols for strings
a lot, but I don't think people would generally do that now (outside
of AI depts, where most of the Lisp programmers are still pretty
neanderthal).
From: Marco Antoniotti
Subject: Keyword abuse (Re: reduced size symbols/keywords)
Date: 
Message-ID: <1b93f63a-623a-4daa-8b2b-8a31bd41b607@m73g2000hsh.googlegroups.com>
On Aug 29, 4:15 pm, Tim Bradshaw <··········@tfeb.org> wrote:
> On Aug 29, 6:39 am, verec <·····@mac.com> wrote:
>
>
>
> > This question is not (entirely) rhetoric, as I'm thinking of some
> > application reading 1,000,000s words out of text files and turning them
> > into symbols/keywords for processing. Cutting down the memory
> > footprint by half or more would be extremely significant, while still
> > preserving "most" of the properties of symbols, while sacrifing a few,
> > ie, while
>
> The thing you are looking for is a string. You want to have only one
> instance of each string: that's what a hashtable does.
>
> If you want your lexer to avoid consing even temporary instances of
> the strings it reads, then you probably want to use a trie (unlike the
> above two you will have to write your own of these).  Generally it is
> probably not worth the effort though, unless you have other reasons to
> want to use a trie - the ephemeral string consing will be pretty
> cheap.
>
> It is true that Neanderthal Lisp programs did use symbols for strings
> a lot, but I don't think people would generally do that now (outside
> of AI depts, where most of the Lisp programmers are still pretty
> neanderthal).

Apart from all the previous comments which I agree with, there is an
issue that has been nagging me for some time.  I see several libraries
(I won't name names!) that "abuse" keywords.  The typical example is
the home brew HTML replacement that does the following

(:html (:head ...) (:body (:h1 ... (:ul ...))))

you get the idea.

I would advocate against this (and similar) use of keywords.  Packaged
symbols are perfect for this kind of job.

Cheers
--
Marco
From: John Thingstad
Subject: Re: Keyword abuse (Re: reduced size symbols/keywords)
Date: 
Message-ID: <op.ugnjdq1aut4oq5@pandora.alfanett.no>
P� Fri, 29 Aug 2008 16:01:09 +0200, skrev Marco Antoniotti  
<·······@gmail.com>:

>
> Apart from all the previous comments which I agree with, there is an
> issue that has been nagging me for some time.  I see several libraries
> (I won't name names!) that "abuse" keywords.  The typical example is
> the home brew HTML replacement that does the following
>
> (:html (:head ...) (:body (:h1 ... (:ul ...))))
>
> you get the idea.
>
> I would advocate against this (and similar) use of keywords.  Packaged
> symbols are perfect for this kind of job.
>

Why is this abuse?
The idea of keywords is they are visible in all packages. Which in this  
case is exactly what you want.

(HTML:HTML (HTML:HEAD ...) (HTML:BODY...)
would be a real pain. After all HTML is typically in use on several  
packages and it is important that the refer to the same symbol.
If something unrelated to HTML uses the key like :body no harm is done.
If you have a 'body local to the package and a 'body in a different  
package they are not the same symbol.
This could cause errors.
It is far better to have it all in one place the KEYWORD package.

--------------
John Thingstad
From: Marco Antoniotti
Subject: Re: Keyword abuse (Re: reduced size symbols/keywords)
Date: 
Message-ID: <8461f16e-f363-4339-b0c7-8d94732796f7@k30g2000hse.googlegroups.com>
On Aug 29, 5:18 pm, "John Thingstad" <·······@online.no> wrote:
> På Fri, 29 Aug 2008 16:01:09 +0200, skrev Marco Antoniotti  
> <·······@gmail.com>:
>
>
>
> > Apart from all the previous comments which I agree with, there is an
> > issue that has been nagging me for some time.  I see several libraries
> > (I won't name names!) that "abuse" keywords.  The typical example is
> > the home brew HTML replacement that does the following
>
> > (:html (:head ...) (:body (:h1 ... (:ul ...))))
>
> > you get the idea.
>
> > I would advocate against this (and similar) use of keywords.  Packaged
> > symbols are perfect for this kind of job.
>
> Why is this abuse?

If not abuse it is certainly over-use.

> The idea of keywords is they are visible in all packages. Which in this  
> case is exactly what you want.
>
> (HTML:HTML (HTML:HEAD ...) (HTML:BODY...)
> would be a real pain. After all HTML is typically in use on several  
> packages and it is important that the refer to the same symbol.
> If something unrelated to HTML uses the key like :body no harm is done.
> If you have a 'body local to the package and a 'body in a different  
> package they are not the same symbol.
> This could cause errors.
> It is far better to have it all in one place the KEYWORD package.

The HTML case is only an example.  And yes.  I find it better to use
(MY-HTML:HTML (MY-HTML:HEAD ....)); this way I would get welcome
errors when I tried to overuse the structure.

Another example comes from libraries that define various kinds of "low
level" types.  Take unsigned "int"s.  The problem is that :UINT in
library A may mean something different from :UINT in library B.

All in all I know that this overuse of keywords does not cause much
pain to many people.  I just advocate to tone it down, as proper
packaged symbols in place of keywords are IMHO safer. YMMV and I do
not have any data to support this hunch of mine.

Cheers
--
Marco




>
> --------------
> John Thingstad
From: Tim Bradshaw
Subject: Re: Keyword abuse (Re: reduced size symbols/keywords)
Date: 
Message-ID: <43402016-c753-4ac2-8676-4d896c727cde@25g2000hsx.googlegroups.com>
On Aug 29, 3:18 pm, "John Thingstad" <·······@online.no> wrote:

> (HTML:HTML (HTML:HEAD ...) (HTML:BODY...)
> would be a real pain. After all HTML is typically in use on several  
> packages and it is important that the refer to the same symbol.
> If something unrelated to HTML uses the key like :body no harm is done.
> If you have a 'body local to the package and a 'body in a different  
> package they are not the same symbol.
> This could cause errors.
> It is far better to have it all in one place the KEYWORD package.

And in fact it is the only realistic option if you want to remain
agnostic about what tags can exist in an HTML/XML document, unless you
enjoy spending a huge amount of time declaring all the tags in
advance.

--tim
From: Scott Burson
Subject: Re: Keyword abuse (Re: reduced size symbols/keywords)
Date: 
Message-ID: <677257e7-74ec-4e15-9fc2-d3c8583359f7@a8g2000prf.googlegroups.com>
On Aug 31, 3:13 am, Tim Bradshaw <··········@tfeb.org> wrote:
> On Aug 29, 3:18 pm, "John Thingstad" <·······@online.no> wrote:
> > It is far better to have it all in one place the KEYWORD package.
>
> And in fact it is the only realistic option if you want to remain
> agnostic about what tags can exist in an HTML/XML document, unless you
> enjoy spending a huge amount of time declaring all the tags in
> advance.

And as long as you're not setting the symbol-value or symbol-function,
what does it matter?  Even the plist can easily be used without
collision -- just make sure your property indicators are NOT keywords.

-- Scott
From: Tim Bradshaw
Subject: Re: Keyword abuse (Re: reduced size symbols/keywords)
Date: 
Message-ID: <5b8669ce-92d1-40b3-a319-069e37f275ed@2g2000hsn.googlegroups.com>
On Aug 31, 6:12 pm, Scott Burson <········@gmail.com> wrote:

> And as long as you're not setting the symbol-value or symbol-function,
> what does it matter?  Even the plist can easily be used without
> collision -- just make sure your property indicators are NOT keywords.

In the implementation I'm familiar with the only thing the symbols
were used for was their names.  So really the only problem is that it
interned a bunch of symbols in the KEYWORD package.  In fact it could
happily work with any other package, or with any other object (there
is a hook in the code which lets you define what it counts as an HTML
element).  The advantages of using keywords are: easy to type, always
available, do not name functions etc (though may be they can, I forget
what the story is here).
From: Michael Weber
Subject: Re: Keyword abuse (Re: reduced size symbols/keywords)
Date: 
Message-ID: <e61be166-5ea1-4b3d-8003-6fb948a8ad57@x35g2000hsb.googlegroups.com>
On Aug 31, 12:13 pm, Tim Bradshaw <··········@tfeb.org> wrote:
> On Aug 29, 3:18 pm, "John Thingstad" <·······@online.no> wrote:
> > (HTML:HTML (HTML:HEAD ...) (HTML:BODY...)
> > would be a real pain. After all HTML is typically in use on several  
> > packages and it is important that the refer to the same symbol.

I can import symbols if I don't like to use them qualified.  Thereby I
waive certain rights what I can do with the symbol but that is the
price of convenience.

> > If something unrelated to HTML uses the key like :body no harm is done.

Not true. If your idea of body disagrees with mine (for example,
because I want to use XHTML and you use HTML), there is no way to keep
them apart.  With symbols from my own package, there is.

See also CLHS 11.1.2.3.2: http://www.lispworks.com/documentation/HyperSpec/Body/11_abcb.htm

> > If you have a 'body local to the package and a 'body in a different  
> > package they are not the same symbol.
> > This could cause errors.

Or it could be exactly what was wanted.

> > It is far better to have it all in one place the KEYWORD package.
>
> And in fact it is the only realistic option if you want to remain
> agnostic about what tags can exist in an HTML/XML document, unless you
> enjoy spending a huge amount of time declaring all the tags in
> advance.

I don't see it:

* Why not use a default for undeclared tags?  That way, you do not
have to declare all tags.  However...

* How do you deal with empty elements like <HR>, <BR>, etc.?  They
need to be distinguished from non-empty elements, hence there will be
some declarations, at least.  Also, elements like <PRE> will likely
need some special-casing for pretty-printing.

* Given the above, why not generate the necessary declarations from a
DTD or XSD?  That way, I could get validation almost for free, too.

--
M/
From: Tim Bradshaw
Subject: Re: Keyword abuse (Re: reduced size symbols/keywords)
Date: 
Message-ID: <0d461308-09b8-4658-aa8d-803713d26bde@f63g2000hsf.googlegroups.com>
On Sep 8, 11:35 am, Michael Weber <·········@foldr.org> wrote:
>
> * Why not use a default for undeclared tags?  That way, you do not
> have to declare all tags.  However...
>

How do you know a tag *is* a tag, not a function call or something
else (remember that the whole point is to interweave code and
markup)?  You need some distinguishing feature.  Being in a specific
package is one obvious one.  The keyword package is the only package
which (portably) does not require a great mass of declarative junk to
support a reasonable syntax.  The idea was to avoid declarative junk
and get on with something useful.

> * How do you deal with empty elements like <HR>, <BR>, etc.?  They
> need to be distinguished from non-empty elements, hence there will be
> some declarations, at least.  Also, elements like <PRE> will likely
> need some special-casing for pretty-printing.

You need special cases for empty elements in SGML-based markup
languages.  You don't for XML (which is half the point of XML).  I
don't remember if HTOUT was clever enough to deal with that.  In any
case there are about 2 declarations needed to support that.

You don't need anything clever for PRE.

>
> * Given the above, why not generate the necessary declarations from a
> DTD or XSD?  That way, I could get validation almost for free, too.
>

The whole point of the exercise was to avoid the enormous pain of
going anywhere near anything that was even slightly like a DTD, or
anything that would have to validate, or any of that crap.
From: John Thingstad
Subject: Re: Keyword abuse (Re: reduced size symbols/keywords)
Date: 
Message-ID: <op.ug5uc7fcut4oq5@pandora.alfanett.no>
P� Mon, 08 Sep 2008 12:35:34 +0200, skrev Michael Weber  
<·········@foldr.org>:

> On Aug 31, 12:13�pm, Tim Bradshaw <··········@tfeb.org> wrote:
>> On Aug 29, 3:18�pm, "John Thingstad" <·······@online.no> wrote:
>> > (HTML:HTML (HTML:HEAD ...) (HTML:BODY...)
>> > would be a real pain. After all HTML is typically in use on several �
>> > packages and it is important that the refer to the same symbol.
>
> I can import symbols if I don't like to use them qualified.  Thereby I
> waive certain rights what I can do with the symbol but that is the
> price of convenience.
>
>> > If something unrelated to HTML uses the key like :body no harm is  
>> done.
>
> Not true. If your idea of body disagrees with mine (for example,
> because I want to use XHTML and you use HTML), there is no way to keep
> them apart.  With symbols fom my own package, there is.

And no need to either. A keyword is is just a stand in for a string.
Why should "html:body" be different from "xhtml:body" when the strings are  
identical.
You should never assign a function, macro or value to a keyword. That  
WOULD be keyword abuse.

>>
>> And in fact it is the only realistic option if you want to remain
>> agnostic about what tags can exist in an HTML/XML document, unless you
>> enjoy spending a huge amount of time declaring all the tags in
>> advance.
>
> I don't see it:
>
> * Why not use a default for undeclared tags?  That way, you do not
> have to declare all tags.  However...


You just use symbol-name to get the string. ':attrib val' get's treated  
similarly except you would substitute attrib="val".
This is much simpler to do with a simple pattern matcher than with heaps  
of macroes or functions.

'((?n (* : ?ai ?vi) ?b) --> (<?n (* ?ai = " ?vi ")> ?b </?n>))

> * How do you deal with empty elements like <HR>, <BR>, etc.?  They
> need to be distinguished from non-empty elements, hence there will be
> some declarations, at least.  Also, elements like <PRE> will likely
> need some special-casing for pretty-printing.
>
> * Given the above, why not generate the necessary declarations from a
> DTD or XSD?  That way, I could get validation almost for free, too.

Lot's of work..

>
> --
> M/



-- 
--------------
John Thingstad
From: Tim Bradshaw
Subject: Re: Keyword abuse (Re: reduced size symbols/keywords)
Date: 
Message-ID: <aef55b0d-be33-4655-b772-c1909d886d61@59g2000hsb.googlegroups.com>
On Sep 8, 12:32 pm, "John Thingstad" <·······@online.no> wrote:

> Lot's of work..
>

And just to reinforce my previous article: that's exactly the point.
The problem at hand is to generate some pretty looking output.  You
have a couple of choices:
1. Do it "properly". 2 years later you will have something which is 9
million lines of code, has a manual you can't lift, and is deeply
painful to use.
2. Just do something cheap and cheerful.  In the 2 years it takes the
"proper" solution to become merely extremely laborious to use, you
write the system, write a documentation system that uses it, and
several multi-hundred-page documents which use this system, for which
you get paid.

I know which I did, and I'd advise anyone to do the same.
From: John Thingstad
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <op.ugm09rgiut4oq5@pandora.alfanett.no>
P� Fri, 29 Aug 2008 07:39:14 +0200, skrev verec <·····@mac.com>:

> Toying, toying ... rather than doing useful work ... (Kenny must be
> right after all :-)
>

As far as I know there is no way to make symbols any smaller.
A more serious problem that the size is that it makes a package a leaky  
abstraction.
You export a function you automatically export the variable with the same  
name as well as the plist, class etc.

>
> This question is not (entirely) rhetoric, as I'm thinking of some
> application reading 1,000,000s words out of text files and turning them
> into symbols/keywords for processing. Cutting down the memory
> footprint by half or more would be extremely significant, while still
> preserving "most" of the properties of symbols, while sacrifing a few,
> ie, while
>

But not 1,000,000 DIFFERENT words I trust.
With that voulume wouldn't a hash table be a better choice anyhow?
I find that plist's are best for 100 elements or less.


The standard way of dealing with this is to Goedelize it yourself.
Your file reads the words as strings.
They are stored in a hash table and assigned a number.
Each time you see a string look the number up in the hash table. If it is  
not there generate a new number and store (string - value) there.
This is more compact.

--------------
John Thingstad
From: Pascal J. Bourguignon
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <7ck5e0qkol.fsf@pbourguignon.anevia.com>
"John Thingstad" <·······@online.no> writes:

> P� Fri, 29 Aug 2008 07:39:14 +0200, skrev verec <·····@mac.com>:
>
>> Toying, toying ... rather than doing useful work ... (Kenny must be
>> right after all :-)
>>
>
> As far as I know there is no way to make symbols any smaller.
> A more serious problem that the size is that it makes a package a
> leaky  abstraction.
> You export a function you automatically export the variable with the
> same  name as well as the plist, class etc.
>
>>
>> This question is not (entirely) rhetoric, as I'm thinking of some
>> application reading 1,000,000s words out of text files and turning them
>> into symbols/keywords for processing. Cutting down the memory
>> footprint by half or more would be extremely significant, while still
>> preserving "most" of the properties of symbols, while sacrifing a few,
>> ie, while
>>
>
> But not 1,000,000 DIFFERENT words I trust.

The great languages have more than three million words.  Most of them
are technical and jargon, but nonetheless you can read one million
different words.

Of course it depends on what you mean by "word", if you mean the
roots, or if you mean the various forms a word can take.  But when
reading words, I guess that the various forms is what is read.


> With that voulume wouldn't a hash table be a better choice anyhow?
> I find that plist's are best for 100 elements or less.
>
>
> The standard way of dealing with this is to Goedelize it yourself.
> Your file reads the words as strings.
> They are stored in a hash table and assigned a number.
> Each time you see a string look the number up in the hash table. If it
> is  not there generate a new number and store (string - value) there.
> This is more compact.
>
> --------------
> John Thingstad

-- 
__Pascal Bourguignon__
From: Rainer Joswig
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <joswig-5ABD9F.10231029082008@news-europe.giganews.com>
In article <·······················@news.aaisp.net.uk>,
 verec <·····@mac.com> wrote:

> Toying, toying ... rather than doing useful work ... (Kenny must be
> right after all :-)
> 
> Out of sheer curiosity (rather than need) I'm wondering if there
> is a way to reduce the size (memory footprint) of symbols.
> 
> CL-USER 11 > (describe 'keyword)
> 
> keyword is a symbol
> name          "KEYWORD"
> value         #<unbound value>
> function      #<unbound function>
> plist         (type::direct-type-predicate keywordp)
> package       #<The COMMON-LISP package, 3/4 internal, 978/1024 external>
> 
> Here I'm counting five slots, for a (32 bit machine) total size
> of 20 bytes, just for the structure, when the actual "payload"
> is only 12 bytes, or about 60% "waste"
> 
> So, I'm wondering if there's an (entirely portable) way to define
> "trimmed down" symbols that would exhibit much less overhead.
> 
> Obviously, I thought that keywords must be that trimmed down version
> of symbols ... except they are not:
> 
> CL-USER 12 > (describe :keyword)
> 
> :keyword is a symbol
> name          "KEYWORD"
> value         :keyword
> function      #<unbound function>
> plist         nil
> package       #<The KEYWORD package, 0/4 internal, 5196/8192 external>
> 
> This question is not (entirely) rhetoric, as I'm thinking of some
> application reading 1,000,000s words out of text files and turning them
> into symbols/keywords for processing. Cutting down the memory
> footprint by half or more would be extremely significant, while still
> preserving "most" of the properties of symbols, while sacrifing a few,
> ie, while
> 
> CL-USER 15 > (setf (symbol-function :cons) #'cons)
> 
> Error: Defining function :cons visible from package KEYWORD.
>   1 (continue) Redefine it anyway.
>   2 (abort) Return to level 0.
>   3 Return to top loop level 0.
> 
> Type :b for backtrace, :c <option number> to proceed,  or :? for other options
> 
> CL-USER 16 : 1 > :c 1
> #<Function ((compiler::def-nil-function cons) . cons) 204AD802>
> 
> CL-USER 17 > (:cons 1 2)
> (1 . 2)
> 
> is cute, I'm ok with those new keywords/symbols to forego everything 
> (function, plist, package) but the value slot.
> 
> How would I go about
> 
> (defun/defmacro word (..) ... )
> 
> such that
> 
> (word 'some-word)
> 
> and
> 
> CL-USER 19 > (describe 'some-word)
> 
> some-word is a symbol
> name          "SOME-WORD"
> 
> And that's it?
> 
> Many Thanks
> --
> JFB

You need to write your own symbols then. Use a simple structure
with a name and a value and a hashtable for these structure objects.

Or just live with the 'overhead' of normal symbols.

-- 
http://lispm.dyndns.org/
From: Pascal J. Bourguignon
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <7cfxooqjqt.fsf@pbourguignon.anevia.com>
verec <·····@mac.com> writes:

> Toying, toying ... rather than doing useful work ... (Kenny must be
> right after all :-)
>
> Out of sheer curiosity (rather than need) I'm wondering if there
> is a way to reduce the size (memory footprint) of symbols.

This is a question for an implementation.  Who said the implementation
used a structure of five slots for a symbol?  Early implementations
used conses, and put the various slots in a property list (which is
the origin of the symbol-plist).  In this kind of implementation a
symbol without value or function takes less space.

For example, you don't need to keep the home package with the symbol,
if each package keeps its interned symbols in a different table than
the imported symbols.  You can then find the symbol-package by
scanning the packages and finding the one that has the symbol in it's
homed symbol table.


> So, I'm wondering if there's an (entirely portable) way to define
> "trimmed down" symbols that would exhibit much less overhead.

That wouldn't be a symbol anymore.  As indicated by John, use a
hashtable to unify your strings.   You don't necessarily need to cons
integers:

(defparameter *unique-strings* (make-hash-table :test (function equal)))

(defun intern-string (s) 
  (or (gethash s *unique-strings*)  
      (setf (gethash s *unique-strings*) s)))
   

(let ((us1 (intern-string (read-line)))
      (us2 (intern-string (read-line))))
  (when (eq us1 us2)
     (princ "Two identical successive strings.")))
   


> is cute, I'm ok with those new keywords/symbols to forego everything
> (function, plist, package) but the value slot.

What do you need the value slot for?

-- 
__Pascal Bourguignon__
From: John Thingstad
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <op.ugm63jk3ut4oq5@pandora.alfanett.no>
P� Fri, 29 Aug 2008 10:59:54 +0200, skrev Pascal J. Bourguignon  
<···@informatimago.com>:

>
> That wouldn't be a symbol anymore.  As indicated by John, use a
> hashtable to unify your strings.   You don't necessarily need to cons
> integers:
>
> (defparameter *unique-strings* (make-hash-table :test (function equal)))
>
> (defun intern-string (s)
>   (or (gethash s *unique-strings*)
>       (setf (gethash s *unique-strings*) s)))
>
> (let ((us1 (intern-string (read-line)))
>       (us2 (intern-string (read-line))))
>   (when (eq us1 us2)
>      (princ "Two identical successive strings.")))
>
>
>> is cute, I'm ok with those new keywords/symbols to forego everything
>> (function, plist, package) but the value slot.
>
> What do you need the value slot for?
>

Fair enough. I was thinking more in terms of a lexer.
If you have more control over the value returned you can use this.
consider

(get-symval "reserved1")
...
(get-symval "reservedn")

(setf *reserved-word* (get-symcounter))

(defun reserved-wordp (sym)
   (< sym *reserved-word*))

So you use the fact that reserved words are inserted before the ones  
introduced by the user.

--------------
John Thingstad
From: Rob Warnock
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <r5adnfP2ood6MCXVnZ2dnUVZ_vudnZ2d@speakeasy.net>
John Thingstad <·······@online.no> wrote:
+---------------
| Pascal J. Bourguignon <···@informatimago.com>:
| > That wouldn't be a symbol anymore.  As indicated by John, use a
| > hashtable to unify your strings.  ...
| > (defparameter *unique-strings* (make-hash-table :test (function equal)))
| > (defun intern-string (s)
| >   (or (gethash s *unique-strings*)
| >       (setf (gethash s *unique-strings*) s)))
...
| > What do you need the value slot for?
| 
| Fair enough. I was thinking more in terms of a lexer.
| If you have more control over the value returned you can use this.
| consider
| (get-symval "reserved1")
| ...
| (get-symval "reservedn")
| (setf *reserved-word* (get-symcounter))
| (defun reserved-wordp (sym)
|    (< sym *reserved-word*))
| 
| So you use the fact that reserved words are inserted before the ones  
| introduced by the user.
+---------------

You can still do this with a hash table, by using the optional
default of GETHASH. Just change Pascal's suggestion to:

    (defparameter *unique-strings* (make-hash-table :test (function equal)))
    (defparameter *unique-count* 0)
    (defstruct interned-string count string other-data)
    (defun intern-string (s &optional other-data)
      (or (gethash s *unique-strings*)
	  (setf (gethash s *unique-strings*)
		(make-interned-string :count (incf *unique-count*)
				      :string s
				      :other-data other-data))))

Then your desired code becomes:

    (intern-string "reserved1" {magic data for "reserved1"})
    ...
    (intern-string "reservedn" {magic data for "reservedn"})
    (setf *reserved-word* *unique-count*)
    (defun reserved-word-p (sym)		; See rules for when "-P".
       (<= (interned-string-count sym) *reserved-word*)) ; Note "<=".


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Alex Mizrahi
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <48b80000$0$90271$14726298@news.sunsite.dk>
 v> Here I'm counting five slots, for a (32 bit machine) total size
 v> of 20 bytes, just for the structure, when the actual "payload"
 v> is only 12 bytes, or about 60% "waste"

yep, _just for structure_.  for each symbol there is a string allocated
on heap. two or four bytes per char on unicode-supporting implementation.
plus overhead for type information, memory allocation etc. plus entry in
package's hash table.

it would be not 60% "waste", but more like 5-10% -- it's very unlikely
it can affect you. 
From: Pascal J. Bourguignon
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <7cy72fq4ki.fsf@pbourguignon.anevia.com>
"Alex Mizrahi" <········@users.sourceforge.net> writes:

>  v> Here I'm counting five slots, for a (32 bit machine) total size
>  v> of 20 bytes, just for the structure, when the actual "payload"
>  v> is only 12 bytes, or about 60% "waste"
>
> yep, _just for structure_.  for each symbol there is a string allocated
> on heap. two or four bytes per char on unicode-supporting implementation.
> plus overhead for type information, memory allocation etc. plus entry in
> package's hash table.
>
> it would be not 60% "waste", but more like 5-10% -- it's very unlikely
> it can affect you. 


Here is what I have in my image:

C/USER[37]> (let ((n 0) (b 0) (f 0) (h (make-hash-table)))
              (dolist (p (list-all-packages))
                (do-symbols (s p)
                  (setf (gethash s h) s)))
              (maphash (lambda (s s)
                         (incf n)
                         (when (boundp s) (incf b))
                         (when (fboundp s) (incf f)))
                       h)
              (values n b (coerce (/ b n) 'float) f (coerce (/ f n) 'float)))
23912                                   ; total number of symbols
5216                                    ; bound symbols
0.21813315                              ; bound/total
10460                                   ; fbound symbols
0.43743727                              ; fbound/total
C/USER[38]> 


This is much more bound symbols than what I expected.  It looks like
implementations wouldn't spare much space (about 60KB here)  by
eliding the symbol value until needed.  Even less for the function
slot.


-- 
__Pascal Bourguignon__
From: Kaz Kylheku
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <20080829114459.547@gmail.com>
On 2008-08-29, verec <·····@mac.com> wrote:
> This question is not (entirely) rhetoric, as I'm thinking of some
> application reading 1,000,000s words out of text files and turning them
> into symbols/keywords for processing.

Millions of /unique/ words? Come on. :)

> Cutting down the memory
> footprint by half or more would be extremely significant, while still
> preserving "most" of the properties of symbols, while sacrifing a few,
> ie, while

Maybe you should simply be treating the words as character strings.  The
interning property of symbols can be simulated by entering the words into a
hash table.

> is cute, I'm ok with those new keywords/symbols to forego everything 
> (function, plist, package) but the value slot.

If you have a hash table keyed on strings, the values associated with the
keys can be your value slot.
From: Dan Weinreb
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <b139eb83-bbb1-484f-94ab-1aa3bf31f418@34g2000hsh.googlegroups.com>
On Aug 29, 1:39 am, verec <·····@mac.com> wrote:
> Toying, toying ... rather than doing useful work ... (Kenny must be
> right after all :-)
>
> Out of sheer curiosity (rather than need) I'm wondering if there
> is a way to reduce the size (memory footprint) of symbols.
>
> CL-USER 11 > (describe 'keyword)
>
> keyword is a symbol
> name          "KEYWORD"
> value         #<unbound value>
> function      #<unbound function>
> plist         (type::direct-type-predicate keywordp)
> package       #<The COMMON-LISP package, 3/4 internal, 978/1024 external>
>
> Here I'm counting five slots, for a (32 bit machine) total size
> of 20 bytes, just for the structure, when the actual "payload"
> is only 12 bytes, or about 60% "waste"
>
> So, I'm wondering if there's an (entirely portable) way to define
> "trimmed down" symbols that would exhibit much less overhead.
>
> Obviously, I thought that keywords must be that trimmed down version
> of symbols ... except they are not:
>
> CL-USER 12 > (describe :keyword)
>
> :keyword is a symbol
> name          "KEYWORD"
> value         :keyword
> function      #<unbound function>
> plist         nil
> package       #<The KEYWORD package, 0/4 internal, 5196/8192 external>
>
> This question is not (entirely) rhetoric, as I'm thinking of some
> application reading 1,000,000s words out of text files and turning them
> into symbols/keywords for processing. Cutting down the memory
> footprint by half or more would be extremely significant, while still
> preserving "most" of the properties of symbols, while sacrifing a few,
> ie, while
>
> CL-USER 15 > (setf (symbol-function :cons) #'cons)
>
> Error: Defining function :cons visible from package KEYWORD.
>   1 (continue) Redefine it anyway.
>   2 (abort) Return to level 0.
>   3 Return to top loop level 0.
>
> Type :b for backtrace, :c <option number> to proceed,  or :? for other options
>
> CL-USER 16 : 1 > :c 1
> #<Function ((compiler::def-nil-function cons) . cons) 204AD802>
>
> CL-USER 17 > (:cons 1 2)
> (1 . 2)
>
> is cute, I'm ok with those new keywords/symbols to forego everything
> (function, plist, package) but the value slot.
>
> How would I go about
>
> (defun/defmacro word (..) ... )
>
> such that
>
> (word 'some-word)
>
> and
>
> CL-USER 19 > (describe 'some-word)
>
> some-word is a symbol
> name          "SOME-WORD"
>
> And that's it?
>
> Many Thanks
> --
> JFB

It would certainly be possible to implement Common Lisp in such a way
as to greatly reduce the size of symbols, at the cost of making it
more expensive to get and set things like the value cell and function
cell.  You'd use some kind of compression, analogously to the way
UTF-8 is used to represent Unicode.

But in your case, I'm not sure why you want to use symbols at all,
rather than just using strings.
From: Tim Bradshaw
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <f8a5cd27-eca0-4f5a-aae6-f6b37eb8f6c3@x35g2000hsb.googlegroups.com>
On Sep 2, 9:25 am, Dan Weinreb <····@alum.mit.edu> wrote:

> It would certainly be possible to implement Common Lisp in such a way
> as to greatly reduce the size of symbols, at the cost of making it
> more expensive to get and set things like the value cell and function
> cell.  You'd use some kind of compression, analogously to the way
> UTF-8 is used to represent Unicode.

I think it would be pretty straigntforward to implement a conforming
CL in which symbols were just interned strings.  All the other
attributes of symbols can be stored, when needed, in hashtables keyed
on the symbol.  Obviously you would pay some performance penalty, but
I don't think you need "compression" (unless you count this as
compression).

--tim
From: John Thingstad
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <op.ugus4rpsut4oq5@pandora.alfanett.no>
P� Tue, 02 Sep 2008 11:49:01 +0200, skrev Tim Bradshaw  
<··········@tfeb.org>:

> On Sep 2, 9:25�am, Dan Weinreb <····@alum.mit.edu> wrote:
>
>> It would certainly be possible to implement Common Lisp in such a way
>> as to greatly reduce the size of symbols, at the cost of making it
>> more expensive to get and set things like the value cell and function
>> cell. �You'd use some kind of compression, analogously to the way
>> UTF-8 is used to represent Unicode.
>
> I think it would be pretty straigntforward to implement a conforming
> CL in which symbols were just interned strings.  All the other
> attributes of symbols can be stored, when needed, in hashtables keyed
> on the symbol.  Obviously you would pay some performance penalty, but
> I don't think you need "compression" (unless you count this as
> compression).
>
> --tim

I don't think that is what he ment.
On my system I have 57000 symbols in the image.
Of those about 40% are functions and about 10% variables.
But a symbol needs 5 slots of which most are never used.
For instance no of the symbols in KEYWORD use any of the slots except  
slot-name.
So what you want is a bitmask to tell you which ones are bound.
Then following the bitmask follow the slots that are bound.
Thus there is some computation involved to find the position from the  
bitmask and as such is a bit like utf8.


--------------
John Thingstad
From: Tim Bradshaw
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <a3998d80-fdb0-478c-9679-3678fec09f48@c65g2000hsa.googlegroups.com>
On Sep 2, 1:32 pm, "John Thingstad" <·······@online.no> wrote:

> I don't think that is what he ment.

I don't either.


> So what you want is a bitmask to tell you which ones are bound.
> Then following the bitmask follow the slots that are bound.
> Thus there is some computation involved to find the position from the  
> bitmask and as such is a bit like utf8.

My point was that no, you don't.  You just don't have any slots at all
but do everything via tables keyed on the symbols.

You could be really extreme with this approach and not even have
names: a symbol would then be just be raw identity (of course most
would have names, but gensyms would not need them)  Symbols could then
be immediate objects, even (I have not thought this through - how
would it work with GC?)

--tim
From: Barry Margolin
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <barmar-353B4F.08590702092008@newsgroups.comcast.net>
In article 
<····································@c65g2000hsa.googlegroups.com>,
 Tim Bradshaw <··········@tfeb.org> wrote:

> On Sep 2, 1:32�pm, "John Thingstad" <·······@online.no> wrote:
> 
> > I don't think that is what he ment.
> 
> I don't either.
> 
> 
> > So what you want is a bitmask to tell you which ones are bound.
> > Then following the bitmask follow the slots that are bound.
> > Thus there is some computation involved to find the position from the �
> > bitmask and as such is a bit like utf8.
> 
> My point was that no, you don't.  You just don't have any slots at all
> but do everything via tables keyed on the symbols.

Some early Lisps put everything in the property list.  MACLISP had a 
special value cell, but function and macro bindings were still in the 
plist.

-- 
Barry Margolin, ······@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***
From: Pascal J. Bourguignon
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <7cr682ogqs.fsf@pbourguignon.anevia.com>
"John Thingstad" <·······@online.no> writes:

> P� Tue, 02 Sep 2008 11:49:01 +0200, skrev Tim Bradshaw
> <··········@tfeb.org>:
>
>> On Sep 2, 9:25�am, Dan Weinreb <····@alum.mit.edu> wrote:
>>
>>> It would certainly be possible to implement Common Lisp in such a way
>>> as to greatly reduce the size of symbols, at the cost of making it
>>> more expensive to get and set things like the value cell and function
>>> cell. �You'd use some kind of compression, analogously to the way
>>> UTF-8 is used to represent Unicode.
>>
>> I think it would be pretty straigntforward to implement a conforming
>> CL in which symbols were just interned strings.  All the other
>> attributes of symbols can be stored, when needed, in hashtables keyed
>> on the symbol.  Obviously you would pay some performance penalty, but
>> I don't think you need "compression" (unless you count this as
>> compression).
>>
>> --tim
>
> I don't think that is what he ment.
> On my system I have 57000 symbols in the image.
> Of those about 40% are functions and about 10% variables.
> But a symbol needs 5 slots of which most are never used.
> For instance no of the symbols in KEYWORD use any of the slots except
> slot-name.
> So what you want is a bitmask to tell you which ones are bound.
> Then following the bitmask follow the slots that are bound.
> Thus there is some computation involved to find the position from the
> bitmask and as such is a bit like utf8.

Yes, but the memory spared by this is also minimal: 60000 is a very
small number in today memories.  

In the case of SYMBOL-PACKAGE, AFAIK there's no otherway to retrieve
the home package.  Keeping a vector or hash of "homed" symbols with
the package would be more space-costly.

Remains the plist, the function and the name. (* 3 4 60000) -> 720,000
bytes, which is insignificant when  you have at least 1 GB of RAM.
The savings are not worth the complexity.  On a 8- or 16-bit processor
perhaps, but not on a 32- or 64-bit processor.

That's why, when the need is to manage one million strings, we orient
the discussion over a hash of strings rather than over symbols.

-- 
__Pascal Bourguignon__
From: ···············@gmail.com
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <115ad743-4dcf-45d5-a684-2dbba7cd2886@e53g2000hsa.googlegroups.com>
On Sep 2, 8:48 am, ····@informatimago.com (Pascal J. Bourguignon)
wrote:
> "John Thingstad" <·······@online.no> writes:
> > På Tue, 02 Sep 2008 11:49:01 +0200, skrev Tim Bradshaw
> > <··········@tfeb.org>:
>
> >> On Sep 2, 9:25 am, Dan Weinreb <····@alum.mit.edu> wrote:
>
> >>> It would certainly be possible to implement Common Lisp in such a way
> >>> as to greatly reduce the size of symbols, at the cost of making it
> >>> more expensive to get and set things like the value cell and function
> >>> cell.  You'd use some kind of compression, analogously to the way
> >>> UTF-8 is used to represent Unicode.

[PJB]
>
> That's why, when the need is to manage one million strings, we orient
> the discussion over a hash of strings rather than over symbols.
>

I think this is an uncomfortable position to take. If the user needs
most of the characteristics of the symbol data-type (e.g. interning
for quick comparison), asking him to roll his own just because he
wants a million of them seems hostile.

Lisp was intended to make this kind of symbolic processing easy by
building powerful facilities into the language.

Lisp implementers don't tell folks to use a separate bignum library or
roll their own when they need integers greater than 28 or 32 bits,
they build (hopefully reasonably efficient) bignum support into their
implementations. Lousy bignum performance isn't a reason to roll your
own, its a reason to submit bug reports to your implementors.

If symbol tables with a million symbols are useful to actual
applications, we ought to encourage implementors to consider
strategies which make these applications efficient without having to
recode the applications.
From: Pascal J. Bourguignon
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <7ciqteobjr.fsf@pbourguignon.anevia.com>
················@gmail.com" <············@gmail.com> writes:

> On Sep 2, 8:48�am, ····@informatimago.com (Pascal J. Bourguignon)
> wrote:
>> "John Thingstad" <·······@online.no> writes:
>> > P� Tue, 02 Sep 2008 11:49:01 +0200, skrev Tim Bradshaw
>> > <··········@tfeb.org>:
>>
>> >> On Sep 2, 9:25�am, Dan Weinreb <····@alum.mit.edu> wrote:
>>
>> >>> It would certainly be possible to implement Common Lisp in such a way
>> >>> as to greatly reduce the size of symbols, at the cost of making it
>> >>> more expensive to get and set things like the value cell and function
>> >>> cell. �You'd use some kind of compression, analogously to the way
>> >>> UTF-8 is used to represent Unicode.
>
> [PJB]
>>
>> That's why, when the need is to manage one million strings, we orient
>> the discussion over a hash of strings rather than over symbols.
>>
>
> I think this is an uncomfortable position to take. If the user needs
> most of the characteristics of the symbol data-type (e.g. interning
> for quick comparison), asking him to roll his own just because he
> wants a million of them seems hostile.
>
> Lisp was intended to make this kind of symbolic processing easy by
> building powerful facilities into the language.
>
> Lisp implementers don't tell folks to use a separate bignum library or
> roll their own when they need integers greater than 28 or 32 bits,
> they build (hopefully reasonably efficient) bignum support into their
> implementations. Lousy bignum performance isn't a reason to roll your
> own, its a reason to submit bug reports to your implementors.
>
> If symbol tables with a million symbols are useful to actual
> applications, we ought to encourage implementors to consider
> strategies which make these applications efficient without having to
> recode the applications.

Agreed.  That said, even with unoptimized space-wise symbols, one
million of them takes only 5 million pointers ie. 20 MB on a 32-bit
system (plus the names).  This is still very manageable.

-- 
__Pascal Bourguignon__
From: Tim Bradshaw
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <555b932e-75bd-4e42-abf2-7bc7770c225f@l64g2000hse.googlegroups.com>
On Sep 2, 1:48 pm, ····@informatimago.com (Pascal J. Bourguignon)
wrote:

> Remains the plist, the function and the name. (* 3 4 60000) -> 720,000
> bytes, which is insignificant when  you have at least 1 GB of RAM.
> The savings are not worth the complexity.  On a 8- or 16-bit processor
> perhaps, but not on a 32- or 64-bit processor.
>

That's actually not clear (apart from it being a matter of physical
memory, not word size of the processor).  Most modern systems actually
have fairly small amounts of memory attached to them, and tricks which
are friendly to this memory are often very useful.  Of course, they
call this memory "first-level cache".

(I would not want anyone to think I am seriously suggesting some make-
symbols-smaller trickery here: obviously if you care you'd be using
some more appropriate data type).
From: Rob Warnock
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <eNOdnVIIisrYpiDVnZ2dnUVZ_tjinZ2d@speakeasy.net>
Pascal J. Bourguignon <···@informatimago.com> wrote:
+---------------
| "John Thingstad" <·······@online.no> writes:
| > Tim Bradshaw <··········@tfeb.org> wrote::
| >> I think it would be pretty straigntforward to implement a conforming
| >> CL in which symbols were just interned strings.  All the other
| >> attributes of symbols can be stored, when needed, in hashtables keyed
| >> on the symbol.  Obviously you would pay some performance penalty...
...
| > But a symbol needs 5 slots of which most are never used.
...
| Remains the plist, the function and the name. (* 3 4 60000) -> 720,000
| bytes, which is insignificant when  you have at least 1 GB of RAM.
| The savings are not worth the complexity. ...
+---------------

Note that CMUCL actually works a bit like Dan/Tim suggest. In CMUCL,
symbols have slots only for name, package, value, and plist... but
*NOT* for function!! The symbol-function information is stored in
CMUCL's "info" mechanism [accessed either through RAW-DEFINITION
a.k.a. %COERCE-TO-FUNCTION or FDEFINITION, all of which bottom out
in (EXTENSIONS:INFO FUNCTION DEFINITION name)]. This cause no
significant performance problem since the lookup is typically done only
once per occurence [*not* once per call!] -- even for interpreted code,
during minimal compilation.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: verec
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <48bda265$0$524$5a6aecb4@news.aaisp.net.uk>
On 2008-09-02 13:48:59 +0100, ···@informatimago.com (Pascal J. 
Bourguignon) said:

> In the case of SYMBOL-PACKAGE, AFAIK there's no otherway to retrieve
> the home package.  Keeping a vector or hash of "homed" symbols with
> the package would be more space-costly.

This is not so. There are other ways that do not consume *any* space.
For example, have each package hold an address range in which the
symbols it owns are to be found. You can always resize/reallocate
at GC time, but with this scheme, a binary search on the package
"symbol address range" is enough to answer SYMBOL-PACKAGE
--
JFB
From: Madhu
Subject: Re: reduced size symbols/keywords
Date: 
Message-ID: <m3vdxerbt0.fsf@moon.robolove.meer.net>
* ·······@tfeb.org Wrote on Tue, 2 Sep 2008 02:49:01 -0700 (PDT):

| On Sep 2, 9:25 am, Dan Weinreb wrote
|
|> It would certainly be possible to implement Common Lisp in such a way
|> as to greatly reduce the size of symbols, at the cost of making it
|> more expensive to get and set things like the value cell and function
|> cell.  You'd use some kind of compression, analogously to the way
|> UTF-8 is used to represent Unicode.
|
| I think it would be pretty straigntforward to implement a conforming
| CL in which symbols were just interned strings.  All the other
| attributes of symbols can be stored, when needed, in hashtables keyed
| on the symbol.  Obviously you would pay some performance penalty, but
| I don't think you need "compression" (unless you count this as
| compression).

I did not follow what scheme Dan Weinreb was hinting at, the UTF-8 clue
did not help.  Maybe Dan could elaborate (I'm sure it's something
concrete)

Compressing strings (symbol names) in a trie offered little or no
advantages over over interning them in hash tables, when I benchmarked
two alternative implementations (for an application dealing with english
text a few years ago).  The lisp implementations I used handled large
hashtables well.  Also, as you noted upthread, the cost of ephemeral
string consing is "not much".

I'm including some trie code in case the OP would care to benchmark it:
The code is for interning strings in a trie, (which can then be compared
with EQ). Use INTERN-STRING to avoid the consing up of a new string for
a call to GETHASH if the string was already available --- say in an
array read by READ-SEQUENCE or READ-LINE.


;;; Trie representation:
;;;
;;; TRIE = (VAL . ALIST)         |  CDR TRIE == (SUB-KEY . SUB-TRIE)+
;;; ALIST = (SUB-KEY . TRIE)+    |
;;; KEY = SUB-KEY+               |

(defun trie-descend (trie sub-key)
  "Internal. Return sub-trie associated with SUB-KEY."
  (let ((cons (assoc sub-key (cdr trie))))
    (when cons (cdr cons))))

(defvar *strings-trie* (cons nil nil))

(defun intern-string (string &key (start 0) end (strings-trie *strings-trie*)
		      (copy-p t))
  ;; docstring elided to keep kt happy
  (declare (type string string) (type (integer 0) start)
	   (type (or null (integer 1)) end) (type list strings-trie))
  (loop	for i of-type fixnum from start below (or end (length string))
	for sub-key = (char string i)
	for trie = strings-trie then sub-trie
	for sub-trie = (trie-descend trie sub-key)
	if (endp sub-trie)
	do (push (cons sub-key (setq sub-trie (cons nil nil))) (cdr trie))
	finally	(return (cond ((null (car sub-trie))
			       (setf (car sub-trie)
				     (if (and (zerop start)
					      (or (null end)
						  (= end (length string)))
					      (not copy-p))
					 string
					 (subseq string start end))))
			      (t (car sub-trie))))))