Strings or Symbols?

From: proton
Subject: Strings or Symbols?
Date: Fri, 25 Jan 2008 07:35:56 +0000
Message-ID: <7fc547c5-4c72-46da-a9d0-1e4e906a2f77@j78g2000hsd.googlegroups.com>

I am writing a program for natural language processing and I need to
handle thousands of words in English and other languages. I have a
basic decision to make: should I represent the words as strings or
symbols?
The advantage of symbols is that they have built-in features which
could be handy (property lists, etc..). However, I think it would be
very inefficient if you have to add prefixes and suffixes to a word.
Besides, I have the feeling that for memory usage, strings take less
space than symbols. Also, I guess that when the evaluator searches for
a symbol it will take longer if there are thousands of them defined,
am I right? Would this be different with strings?
Can anyone give some advice? Thanks very much.

Re: Strings or Symbols? Tim Bradshaw
Re: Strings or Symbols? Pascal J. Bourguignon
Re: Strings or Symbols? Øyvin Halfdan Thuv
Re: Strings or Symbols? vanekl
- Re: Strings or Symbols? Rainer Joswig
Re: Strings or Symbols? Rainer Joswig
Re: Strings or Symbols? Luigi Panzeri

From: Tim Bradshaw
Subject: Re: Strings or Symbols?
Date: Fri, 25 Jan 2008 10:46:46 +0000
Message-ID: <f9bff959-888f-4552-80df-f74045020608@f47g2000hsd.googlegroups.com>

On Jan 25, 7:35 am, proton <··········@gmail.com> wrote:
> I am writing a program for natural language processing and I need to
> handle thousands of words in English and other languages. I have a
> basic decision to make: should I represent the words as strings or
> symbols?

I'm with Pascal: I don't think that is the decision you have at all.
Instead you need to think what you your "word" object to be like and
then implement something which does that.

However, it is worth noting that if you go and look at the kind of
neanderthal AI where Lisp had its roots, symbols were commonly used to
represent things like words &c.  To some extent that's where property
lists and so on come from.  I bet plenty of more recent (I hesitate to
use the word "modern") AI/computational linguistics programs still do
this.

It's not the case that lookup slows substantially depending on the
number of symbols - hashing (or other techniques) ensure that.  Of
course, you can construct hash tables whose keys are strings as well,
and if you use some more complicated object to represent a word, you
can hash it based on the string which represents the written form &c.

From: Pascal J. Bourguignon
Subject: Re: Strings or Symbols?
Date: Fri, 25 Jan 2008 09:34:05 +0000
Message-ID: <7cir1i9rle.fsf@pbourguignon.anevia.com>

proton <··········@gmail.com> writes:

> I am writing a program for natural language processing and I need to
> handle thousands of words in English and other languages. I have a
> basic decision to make: should I represent the words as strings or
> symbols?

Words are neither strings or symbols.  Words are words.

> The advantage of symbols is that they have built-in features which
> could be handy (property lists, etc..). 

Hash-tables, structures, objects too have built-in features which
could be handy...

> However, I think it would be
> very inefficient if you have to add prefixes and suffixes to a word.

Words are not pieces of iron either, that you can forge out (let's
heath and hit 'won' and little and make it into a 'win') or sold to
some other prefix iron piece.

Take words as other mathematical objects, and process them
functionnaly.  If you want to make 'unwise' from 'wise', do it
functionnaly:

(defun prepend-prefix (prefix word) 
   (intern (format nil "~A~A" prefix word)))
(prepend-prefix 'un 'wise)

> Besides, I have the feeling that for memory usage, strings take less
> space than symbols. Also, I guess that when the evaluator searches for
> a symbol it will take longer if there are thousands of them defined,
> am I right? Would this be different with strings?
> Can anyone give some advice? Thanks very much.

If you want to write a quick-and-dirty trick, a little toy program,
then you may use lisp symbols to represent words, after all they were
invented just for that, so they're not a too bad match.  But if you
have serrious needs, then you should probably use some more complex
data structure, I'd use plain CLOS objects.

(defmethod prepend-prefix ((prefix prefix) (word word))
  (make-instance 'word
       :letters (combine-letters prefix (letters word))
       :meaning (combine-meaning prefix (meaning word))
       :language (language word)
       :category (category word)))

so you can get:

(prepend-prefix *prefix-un* (find-word "wise" :english))
--> #<WORD :ENGLISH :ADJECTIVE "unwise" (contrary-of "wise")>

-- 
__Pascal Bourguignon__

From: Øyvin Halfdan Thuv
Subject: Re: Strings or Symbols?
Date: Fri, 25 Jan 2008 11:13:39 +0000
Message-ID: <slrnfpjh34.aho.oyvinht@decibel.pvv.ntnu.no>

On 2008-01-25, proton <··········@gmail.com> wrote:
> I am writing a program for natural language processing and I need to
> handle thousands of words in English and other languages. I have a
> basic decision to make: should I represent the words as strings or
> symbols?
> The advantage of symbols is that they have built-in features which
> could be handy (property lists, etc..). However, I think it would be
> very inefficient if you have to add prefixes and suffixes to a word.
> Besides, I have the feeling that for memory usage, strings take less
> space than symbols. Also, I guess that when the evaluator searches for
> a symbol it will take longer if there are thousands of them defined,
> am I right? Would this be different with strings?
> Can anyone give some advice? Thanks very much.

Hi. This is an interesting question.

I have tried to write code using symbols to represent words. I eventually ended
up having to do a lot of symbol->string conversion to do all kinds of 
simple things (like concatenating words, which is /very/ common in a synthetic
language (like Norwegian)).

It is probably a better idea to represent words with some data structure that
you can specify your self. Structs, for example, are both efficient and
flexible (and give you a lot of code for free). Then you can do things like:

(defstruct word
 (gender)
 (stem)
 ...)

and perhaps put it in an hash table indexed on the string representation of
the word ... You could also include things like subclassing (then CLOS might be
a better choice) if you want to represent semantic relations etc. It all
depends on what you want to do.

What kind of program is it you are creating? What Lisp do you use?

-- 
Oyvin

From: vanekl
Subject: Re: Strings or Symbols?
Date: Fri, 25 Jan 2008 12:19:33 +0000
Message-ID: <fnck14$9h3$1@aioe.org>

proton wrote:
> I am writing a program for natural language processing and I need to
> handle thousands of words in English and other languages. I have a
> basic decision to make: should I represent the words as strings or
> symbols?
> The advantage of symbols is that they have built-in features which
> could be handy (property lists, etc..). However, I think it would be
> very inefficient if you have to add prefixes and suffixes to a word.
> Besides, I have the feeling that for memory usage, strings take less
> space than symbols. Also, I guess that when the evaluator searches for
> a symbol it will take longer if there are thousands of them defined,
> am I right? Would this be different with strings?
> Can anyone give some advice? Thanks very much.

You want to use Prolog for this. Seriously.
It was built for this type of work.
And this course of action precludes you from having to make all
these basic decisions.

You can have a basic NLP parser running in a day if you go that route.
Bratko has written an excellent Prolog book that even has a
chapter or two on this. It's in its second edition,
last I checked. I found a copy in a local library.

Oh, by the way. It's not the word lookup that's going to kill you.
It's the exponential time search when parsing sentences.
This is definitely not a polynomial algorithm, so if you are
going to prematurely optimize your program now, you may as well
spend the time on the real bottleneck. Space considerations
are not your major concern unless you plan on porting this
to a Nokia, in which case you still have more pressing problems.

Do some research before jumping into the coding phase.
Build on past work.
Pick the right tool,
and think before do,
because if you don't, the next step you take may take you backwards.
--
Tries have nice properties.

From: Rainer Joswig
Subject: Re: Strings or Symbols?
Date: Fri, 25 Jan 2008 12:37:05 +0000
Message-ID: <joswig-ED8960.13370525012008@news-europe.giganews.com>

In article <············@aioe.org>, vanekl <·····@acd.net> wrote:

> proton wrote:
> > I am writing a program for natural language processing and I need to
> > handle thousands of words in English and other languages. I have a
> > basic decision to make: should I represent the words as strings or
> > symbols?
> > The advantage of symbols is that they have built-in features which
> > could be handy (property lists, etc..). However, I think it would be
> > very inefficient if you have to add prefixes and suffixes to a word.
> > Besides, I have the feeling that for memory usage, strings take less
> > space than symbols. Also, I guess that when the evaluator searches for
> > a symbol it will take longer if there are thousands of them defined,
> > am I right? Would this be different with strings?
> > Can anyone give some advice? Thanks very much.
> 
> You want to use Prolog for this. Seriously.
> It was built for this type of work.

Lisp, too.

> And this course of action precludes you from having to make all
> these basic decisions.
> 
> You can have a basic NLP parser running in a day if you go that route.
> Bratko has written an excellent Prolog book that even has a
> chapter or two on this. It's in its second edition,
> last I checked. I found a copy in a local library.
> 
> Oh, by the way. It's not the word lookup that's going to kill you.
> It's the exponential time search when parsing sentences.
> This is definitely not a polynomial algorithm, so if you are
> going to prematurely optimize your program now, you may as well
> spend the time on the real bottleneck. Space considerations
> are not your major concern unless you plan on porting this
> to a Nokia, in which case you still have more pressing problems.
> 
> Do some research before jumping into the coding phase.
> Build on past work.
> Pick the right tool,
> and think before do,
> because if you don't, the next step you take may take you backwards.
> --
> Tries have nice properties.

From: Rainer Joswig
Subject: Re: Strings or Symbols?
Date: Fri, 25 Jan 2008 12:18:55 +0000
Message-ID: <joswig-F35A19.13185525012008@news-europe.giganews.com>

In article 
<····································@j78g2000hsd.googlegroups.com>,
 proton <··········@gmail.com> wrote:

> I am writing a program for natural language processing and I need to
> handle thousands of words in English and other languages.

'handle' thousands word is kind of vague. Thousands of words
as such is not much of a problem.

> I have a
> basic decision to make: should I represent the words as strings or
> symbols?

> The advantage of symbols is that they have built-in features which
> could be handy (property lists, etc..).

Symbols offer several services:

1) Symbols have property lists, so you can attach information to them.
2) simple symbols print and read easy, don't need to quoted
3) symbols are made unique when they are INTERNed in a package
4) symbols can be efficiently compared with EQ
5) Symbols have a name, which is a string
6) the lookup of symbols by name in a package is fast
...

Common Lisp has built-in hash-tables which you can
use for looking up strings.

> However, I think it would be
> very inefficient if you have to add prefixes and suffixes to a word.
> Besides, I have the feeling that for memory usage, strings take less
> space than symbols. Also, I guess that when the evaluator searches for
> a symbol it will take longer if there are thousands of them defined,

The evaluator usually does not search for symbols.

> am I right? Would this be different with strings?
> Can anyone give some advice? Thanks very much.

Symbols were used in early AI times as data-structures in programs.
Nowadays symbols are often used as elements in the programming
language (for identifiers, for example). For many purposes
(say, experimenting) you still might use just symbols and be fine.
For more ambitious projects you might need special purpose
data-structures to represent words, sentences, dictionaries.
With strings alone you also would not get very far.

Here is a book about old-style Lisp programming for
Natural Language Processing:

http://www.informatics.sussex.ac.uk/research/groups/nlp/gazdar/nlp-in-lisp/
Natural Language Processing in Lisp
Gerald Gazdar, Chris Mellish

I would not really recommend the coding style in that book, because it
is very old-fashioned.

For a MUCH better introduction into GOFAI (Good Old-Fashioned Artificial Intelligence)
and natural language processing with Lisp:

  Paradigms of Artificial Intelligence Programming
  Case Studies in Common Lisp
  Peter Norvig, 1992
  http://norvig.com/paip.html

It has several chapters devoted to natural language problems.
Fortunately Peter Norvig has a really good programming style.

I'd recommend Peter Norvig's book for anybody with interest in Lisp
programming. It covers a lot of material very nicely.
(Though you can see that Peter Norvig was not much of a CLOS user).

From: Luigi Panzeri
Subject: Re: Strings or Symbols?
Date: Fri, 25 Jan 2008 15:25:14 +0000
Message-ID: <m2abmudj1h.fsf@matley.muppetslab.org>

You should also consider a trie [1] data structure. it is not
difficult to build your own implementation. Otherwise you can try a
package i developed some time ago implementing a trie structure[2]

proton <··········@gmail.com> writes:

> I am writing a program for natural language processing and I need to
> handle thousands of words in English and other languages. I have a
> basic decision to make: should I represent the words as strings or
> symbols?
> The advantage of symbols is that they have built-in features which
> could be handy (property lists, etc..). However, I think it would be
> very inefficient if you have to add prefixes and suffixes to a word.
> Besides, I have the feeling that for memory usage, strings take less
> space than symbols. Also, I guess that when the evaluator searches for
> a symbol it will take longer if there are thousands of them defined,
> am I right? Would this be different with strings?
> Can anyone give some advice? Thanks very much.


Footnotes: 
[1]  http://en.wikipedia.org/wiki/Trie

[2]  http://www.innerloop.it/~matley/trie.lisp

-- 
Luigi Panzeri aka Matley

Why Lisp? http://alu.cliki.net/RtL%20Highlight%20Film
Quotes on Lisp: http://lispers.org/