help with NLP library

From: abhi
Subject: help with NLP library
Date: Tue, 03 Feb 2009 07:55:23 +0000
Message-ID: <b05516b6-3647-4c60-8bc0-d566344bbd8b@q9g2000yqc.googlegroups.com>

Hi,
   I am looking for a common lisp library which can check if a given
sentence is syntactically correct ie., according to the rules of
English grammar.
  I have taken a look at Basic English Grammar library but it is not
sufficient since it tags only the individual words and does not have
syntactic checking as part of the functionality. I also tried to
compile the code from PAIP (on sbcl) but it threw a lot of errors
which I was not able to debug with my limited knowledge of CL.
   I would appreciate it if someone could point me to a library (if
there is one at all) which does what I am looking for.

Thanks
Abhijith

Re: help with NLP library maximinus
- Re: help with NLP library Andrew Philpot
  - Re: help with NLP library abhi
    - Re: help with NLP library D Herring
      - Re: help with NLP library Rob Warnock
        Re: help with NLP library abhi
        Re: help with NLP library Pascal J. Bourguignon
        Re: help with NLP library abhi
        Re: help with NLP library K Livingston
        Re: help with NLP library Rob Warnock
        Re: help with NLP library GP lisper
        Re: help with NLP library K Livingston
        Re: help with NLP library GP lisper
        Re: help with NLP library abhi
        Re: help with NLP library GP lisper
        Re: help with NLP library Pascal J. Bourguignon
        Re: help with NLP library GP lisper
        Re: help with NLP library Thomas A. Russ
        Re: help with NLP library GP lisper
        Re: help with NLP library Thomas A. Russ
        Re: help with NLP library GP lisper
        Re: help with NLP library George Neuner
Re: help with NLP library George Neuner

From: maximinus
Subject: Re: help with NLP library
Date: Tue, 03 Feb 2009 14:39:04 +0000
Message-ID: <607e5d64-e4e0-4c3a-bc18-34320596663a@r41g2000prr.googlegroups.com>

On Feb 3, 3:55 pm, abhi <·········@gmail.com> wrote:
> Hi,
>    I am looking for a common lisp library which can check if a given
> sentence is syntactically correct ie., according to the rules of
> English grammar.
>   I have taken a look at Basic English Grammar library but it is not
> sufficient since it tags only the individual words and does not have
> syntactic checking as part of the functionality. I also tried to
> compile the code from PAIP (on sbcl) but it threw a lot of errors
> which I was not able to debug with my limited knowledge of CL.
>    I would appreciate it if someone could point me to a library (if
> there is one at all) which does what I am looking for.
>
> Thanks
> Abhijith

Speaking as a full-time English teacher, I'd doubt that this is even
possible - the language is manifestly not logical in some situations.

From: Andrew Philpot
Subject: Re: help with NLP library
Date: Tue, 03 Feb 2009 15:25:59 +0000
Message-ID: <slrngogogb.1pa.philpot@ubirr.isi.edu>

On 2009-02-03, maximinus <·········@gmail.com> wrote:
> On Feb 3, 3:550m, abhi <·········@gmail.com> wrote:
>> Hi,
>>  	 am looking for a common lisp library which can check if a given
>> sentence is syntactically correct ie., according to the rules of
>> English grammar.
>>  I have taken a look at Basic English Grammar library [...]. I also tried to
>> compile the code from PAIP [...]
>> I would appreciate it if someone could point me to a library (if
>> there is one at all) which does what I am looking for.
>>
>> Thanks
>> Abhijith
>

The state of the art parsers these days are statistical, using lots of
precompiled data about words and word classes.  See Collins, Charniak
for two -- they are not Lisp but the output format is easily processed
in Lisp.  Norvig's PAIP is a good starting point, but it's much less
sophisticated than those two, quite naturally as its purpose is
largely to illustrate implementation methods.

-- 
Andrew Philpot
USC Information Sciences Institute
·······@isi.edu

From: abhi
Subject: Re: help with NLP library
Date: Tue, 03 Feb 2009 17:11:00 +0000
Message-ID: <af38d830-e70a-40cb-a947-724ca180f83e@v39g2000pro.googlegroups.com>

On Feb 3, 8:25 pm, Andrew Philpot <·······@isi.edu> wrote:
> On 2009-02-03, maximinus <·········@gmail.com> wrote:
>
> > On Feb 3, 3:550m, abhi <·········@gmail.com> wrote:
> >> Hi,
> >>         am looking for a common lisp library which can check if a given
> >> sentence is syntactically correct ie., according to the rules of
> >> English grammar.
> >>  I have taken a look at Basic English Grammar library [...]. I also tried to
> >> compile the code from PAIP [...]
> >> I would appreciate it if someone could point me to a library (if
> >> there is one at all) which does what I am looking for.
>
> >> Thanks
> >> Abhijith
>
> The state of the art parsers these days are statistical, using lots of
> precompiled data about words and word classes.  See Collins, Charniak
> for two -- they are not Lisp but the output format is easily processed
> in Lisp.  Norvig's PAIP is a good starting point, but it's much less
> sophisticated than those two, quite naturally as its purpose is
> largely to illustrate implementation methods.
>
> --
> Andrew Philpot
> USC Information Sciences Institute
> ·······@isi.edu




> the language is manifestly not logical in some situations.
Could you please give me an instance ?

@Andrew Philpot
Thank you for your suggestion. I will take a look at the resource that
you pointed out.

Thanks
Abhijith

From: D Herring
Subject: Re: help with NLP library
Date: Tue, 03 Feb 2009 17:57:16 +0000
Message-ID: <49888572$0$3339$6e1ede2f@read.cnntp.org>

abhi wrote:
> On Feb 3, 8:25 pm, Andrew Philpot <·······@isi.edu> wrote:
>> the language is manifestly not logical in some situations.
> Could you please give me an instance ?

Its not illogical; there are just several rule sets to choose from 
(e.g. Old English, French, German, Latin) and nuances to apply.  This 
lets us verb nouns and noun verbs but never elephant words.

:)

Much of English grammar is mixed up with the actual words being used. 
  Noun, pronoun, preposition, adjective, verb, and the rest are broad 
categories that do not adequately capture what is legal and what is 
not.  In school, I remember several texts directly contradicting each 
other with respect to proper grammar; much of our grammar is stylistic 
in nature.  "Eat beef" is generally acceptable while "eat cow" sounds 
wrong to a native speaker.

The best grammar checker I've had the misfortune of using was bundled 
with MS Word.  It was clearly pattern matching key words and phrases 
(via punctuation, etc.) and frequently offered nonsensical improvements.

- Daniel

From: Rob Warnock
Subject: Re: help with NLP library
Date: Wed, 04 Feb 2009 05:45:37 +0000
Message-ID: <vMSdnUHsRpocthTUnZ2dnUVZ_oDinZ2d@speakeasy.net>

D Herring <········@at.tentpost.dot.com> wrote:
+---------------
| In school, I remember several texts directly contradicting each 
| other with respect to proper grammar; much of our grammar is stylistic 
| in nature.  "Eat beef" is generally acceptable while "eat cow" sounds 
| wrong to a native speaker.
+---------------

Except in such usages as, "I'm so hungry I could eat a cow!"
[Or sometimes "...a whole cow!"]

The cow/beef (boeuf), pig/pork (porc), sheep/mutton (mouton) thingy
is a relic of English history -- the Anglo-Saxon commoners versus
Norman (French) nobility situation, q.v.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: abhi
Subject: Re: help with NLP library
Date: Wed, 04 Feb 2009 10:27:42 +0000
Message-ID: <b920d7a1-1388-47e3-aad9-5bedd3cef47a@o40g2000yqb.googlegroups.com>

>"Eat beef" is generally acceptable while "eat cow" sounds wrong to a native speaker.

The sentence "Eat Beef" is I presume derived from one of the rules in
the grammar, which could look something like:

=> VERB optional-adjective NOUN.
VERB -> Eat
NOUN -> Beef

So, going by the above rule, wouldn't I get the sentence "Eat cow"
which is syntactically right, even though it does not make any sense
semantically.

All I am concerned about is whether a sentence is *syntactically*
correct and not the semantics/meaning of the sentence. I am fine with
a sentence like 'Eat cow' or 'Ball saw the man' as long as the parts
of speech are in their right places.


> If you're ambitious, Princeton's Wordnet library
> (http://wordnet.princeton.edu/) provides a large lexical database with
> C bindings ... you might be able to use it as a starting point for
> your own grammar system.

Yes. WordNet is an option. Thanks for the link.

--
Abhijith


--
Thanks
Abhijith

From: Pascal J. Bourguignon
Subject: Re: help with NLP library
Date: Wed, 04 Feb 2009 13:20:23 +0000
Message-ID: <7c1vue2vag.fsf@pbourguignon.anevia.com>

abhi <·········@gmail.com> writes:

>>"Eat beef" is generally acceptable while "eat cow" sounds wrong to a native speaker.
>
> The sentence "Eat Beef" is I presume derived from one of the rules in
> the grammar, which could look something like:
>
> => VERB optional-adjective NOUN.
> VERB -> Eat
> NOUN -> Beef
>
> So, going by the above rule, wouldn't I get the sentence "Eat cow"
> which is syntactically right, even though it does not make any sense
> semantically.

Well that's your problem, because around here, we do eat cows, at
least when they don't produce milk anymore.  And even bulls!

-- 
__Pascal Bourguignon__

From: abhi
Subject: Re: help with NLP library
Date: Wed, 04 Feb 2009 17:14:51 +0000
Message-ID: <a26937e1-bfd0-4e4e-9f49-396a7fd71d47@r36g2000prf.googlegroups.com>

On Feb 4, 6:20 pm, ····@informatimago.com (Pascal J. Bourguignon)
wrote:
> abhi <·········@gmail.com> writes:
> >>"Eat beef" is generally acceptable while "eat cow" sounds wrong to a native speaker.
>
> > The sentence "Eat Beef" is I presume derived from one of the rules in
> > the grammar, which could look something like:
>
> > => VERB optional-adjective NOUN.
> > VERB -> Eat
> > NOUN -> Beef
>
> > So, going by the above rule, wouldn't I get the sentence "Eat cow"
> > which is syntactically right, even though it does not make any sense
> > semantically.
>
> Well that's your problem, because around here, we do eat cows, at
> least when they don't produce milk anymore.  And even bulls!
>
> --
> __Pascal Bourguignon__

Like I mentioned before, I am fine with a sentence like 'Eat cow' or
'Ball saw the man' as long as the parts of speech are in their right
places. :)

From: K Livingston
Subject: Re: help with NLP library
Date: Thu, 05 Feb 2009 00:26:42 +0000
Message-ID: <757b32a4-25a6-46e1-897c-4f20af4f0fd0@w39g2000prb.googlegroups.com>

On Feb 4, 11:14 am, abhi <·········@gmail.com> wrote:
> Like I mentioned before, I am fine with a sentence like 'Eat cow' or
> 'Ball saw the man' as long as the parts of speech are in their right
> places. :)

If you are only concerned with identifying *a* possibly syntactically
valid parse there are probably some tools to help you.  If you want
*the* correct/intended parse, things get a lot more hairy for all the
problems mentioned above, and I'm surprised that no one has even
brought up those pesky "time flies," I really hate when they eat my
arrows. ("Time flies like an arrow," has at least 7 different ways to
be interpreted.)  Andrew mentioned a few names to look at, another
thing to look for is projects that operate on the "Penn TreeBank"
there is a lot of work on statistical language parsing / sentence
diagramming.  There's a whole ton of research and ideas surrounding
PCFG (probabilistic context-free grammars) and related work.  In their
evaluation they do pretty well, when they have a good corpus to train
on, but nothing is perfect.

All that said, Abhi, out of curiosity what are you actually trying to
accomplish?  (It's hard to see sentence diagramming as a goal in and
of its self, unless I go back to my 4th grade "language arts" class.)

good luck,
Kevin

From: Rob Warnock
Subject: Re: help with NLP library
Date: Thu, 05 Feb 2009 01:19:37 +0000
Message-ID: <1NidnazcsbM0oxfUnZ2dnUVZ_sninZ2d@speakeasy.net>

K Livingston  <······················@gmail.com> wrote:
+---------------
| If you want *the* correct/intended parse, things get a lot more
| hairy for all the problems mentioned above, and I'm surprised that
| no one has even brought up those pesky "time flies," I really hate
| when they eat my arrows. ("Time flies like an arrow," has at least
| 7 different ways to be interpreted.)
+---------------

Ah, yezz, and since no one else has mentioned it yet, let's
see a program parse *this* grammatically correct sentence:  ;-}

  Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.


-Rob

p.s. Wikipedia actually has an article with that exact title!
Includes traditional sentence diagram of the parse.

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: GP lisper
Subject: Re: help with NLP library
Date: Thu, 05 Feb 2009 07:24:50 +0000
Message-ID: <slrngol522.hjh.spambait@phoenix.clouddancer.com>

On Wed, 04 Feb 2009 19:19:37 -0600, <····@rpw3.org> wrote:
> K Livingston  <······················@gmail.com> wrote:
> +---------------
>| If you want *the* correct/intended parse, things get a lot more
>| hairy for all the problems mentioned above, and I'm surprised that
>| no one has even brought up those pesky "time flies," I really hate
>| when they eat my arrows. ("Time flies like an arrow," has at least
>| 7 different ways to be interpreted.)
> +---------------
>
> Ah, yezz, and since no one else has mentioned it yet, let's
> see a program parse *this* grammatically correct sentence:  ;-}
>
>   Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

I'd settle for someone being able to grok it without outside
assistance.  It's clearly an outlier to humans.

________

Since the subject has been raised, I want to parse (??  I'm not sure
what the correct terminology is for the desired outcome) the following:

Hillenbrand grounded out (shortstop to first);
Finley flied to right;
Bautista singled to right;
Cintron doubled to center [Bautista scored, Cintron to third (error by Estrada)];
Hammock was called out on strikes;
DeRosa flied to right;
Green grounded out (shortstop to first);
Hampton struck out; 
Gonzalez grounded out (pitcher to first);
Hillenbrand grounded out (third to first);
Finley flied to right;

Each line is an event in a baseball game.
This sentence structure seems to be
Player -> action -> location

The possible locations and actions are a very limited set and could be
looked up in some dictionary.  A complication is that the 'location' can
be a small phrase at times, so how is it identified?

The action tends to be a single word, maybe two words (I'm still
collecting data and haven't sorted it yet).  Some player names
potentially match actions... or at least some part of baseball.  I've
forgotten the examples of collisions.

The player name can be two words at times, and comes from a set of a
few thousand names.  For a particular game, other information sources
can limit that to two dozen.

The goal is to simplify the English into a state machine description
of the game.  The possible states are limited to the eight
possiblities for runners on the 3 bases, with 3 outs in a side.
Player names can be replaced with the index to a player database.

So far, my divide and conquer scheme is to attack both ends, recognize
player names and locations and work into the middle.  While writing
this, I think that I should be able to get player and action easily,
and the rest must be location.  It should require nothing beyond
matching.

The above samples are the easiest from the world of descriptive
baseball, I'd like to apply some technique that will expand to the
more difficult sentences (they apply to a state machine with a richer
set of states)

Is there (cross fingers) some magic lisp package just sitting around
to apply to this problem?  Someone mentioned langutils, but getting
that running is like getting cl-persistance running.

-- 
Lisp : 'My God, it's full of cars!'

From: K Livingston
Subject: Re: help with NLP library
Date: Thu, 05 Feb 2009 20:30:10 +0000
Message-ID: <408275f1-8867-49b3-997f-8636da14018b@r37g2000prr.googlegroups.com>

On Feb 5, 1:24 am, GP lisper <········@CloudDancer.com> wrote:
> On Wed, 04 Feb 2009 19:19:37 -0600, <····@rpw3.org> wrote:
> > K Livingston  <······················@gmail.com> wrote:
> > +---------------
> >| If you want *the* correct/intended parse, things get a lot more
> >| hairy for all the problems mentioned above, and I'm surprised that
> >| no one has even brought up those pesky "time flies," I really hate
> >| when they eat my arrows. ("Time flies like an arrow," has at least
> >| 7 different ways to be interpreted.)
> > +---------------
>
> > Ah, yezz, and since no one else has mentioned it yet, let's
> > see a program parse *this* grammatically correct sentence:  ;-}
>
> >   Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
>
> I'd settle for someone being able to grok it without outside
> assistance.  It's clearly an outlier to humans.
>
> ________
>
> Since the subject has been raised, I want to parse (??  I'm not sure
> what the correct terminology is for the desired outcome) the following:

your handle is different than the OP,.. are there two projects here,
or two of you in there?

> Hillenbrand grounded out (shortstop to first);
> Finley flied to right;
> Bautista singled to right;
> Cintron doubled to center [Bautista scored, Cintron to third (error by Estrada)];
> Hammock was called out on strikes;
> DeRosa flied to right;
> Green grounded out (shortstop to first);
> Hampton struck out;
> Gonzalez grounded out (pitcher to first);
> Hillenbrand grounded out (third to first);
> Finley flied to right;
>
> Each line is an event in a baseball game.
> This sentence structure seems to be
> Player -> action -> location
>
> The possible locations and actions are a very limited set and could be
> looked up in some dictionary.  A complication is that the 'location' can
> be a small phrase at times, so how is it identified?
>
> The action tends to be a single word, maybe two words (I'm still
> collecting data and haven't sorted it yet).  Some player names
> potentially match actions... or at least some part of baseball.  I've
> forgotten the examples of collisions.
>
> The player name can be two words at times, and comes from a set of a
> few thousand names.  For a particular game, other information sources
> can limit that to two dozen.
>
> The goal is to simplify the English into a state machine description
> of the game.  The possible states are limited to the eight
> possiblities for runners on the 3 bases, with 3 outs in a side.
> Player names can be replaced with the index to a player database.

Look at all that great extra domain information you have!  You
probably do not want a general form solution.  The statistical parsers
trained on newspaper stories are probably only going to fair so well
here (my prediction).  You could retrain them on your domain (you
might have to annotate a lot of examples), then they might do really
well.  But this domain is actually really small.  How many things
actually happen in a baseball game?  A few events cover 80% of the
cases.

I would look instead at the type of work done by the IE (Information
Extraction) community.  They do things like "<x> singled to <y>".  And
they would also do things like entity recognition and normalization on
the matched parts.  The system KnowItAll is a good reference.
(although I don't know what they have available to give you)  Also a
kind of hierarchical pattern matching might work well as you indicated
you are exploring.  There you have patterns like "<Player> singled to
<Location>" and then you have other patterns or pieces recognizing and
annotating parts of the text as a player or a location etc.  I do this
kind of pattern matching in my NLU work.  Systems like DMAP match
language this way.  (I am not affiliated with the OpenDMAP work, but
they have a system you can download, that might help you bootstrap.
Although it won't have any knowledge/patterns relating to baseball.
You may just want to start with a toy pattern matcher first and see
how far you get.)

Leveraging domain knowledge and writing domain specific patterns might
actually make you more noise tolerant.  For example, speech
recognition systems that can operate in a domain or in a task context
where there are only so many options for what should come next can
recognize speech with a significantly higher degree of accuracy than
open/free-form speech recognizers.  Because they can leverage domain
knowledge, and predict what's next, if they have to guess at what
happened they have a more targeted list of options.  Domain knowledge
is also needed to help disambiguate pronouns or other anaphora (which
you may or may not have).  E.g. "he" or saying something like "the
mound" to refer to the pitcher.  (although, in-domain those references
might be very clear and just be more names, "blue" is an unambiguous
reference in baseball, although, I guess in MLB they wear black
now?...)

Also using a larger context could help you in the long run with
recognition errors.  For example, if you know there is a man on first
and second, and then something happens you can't recognize.  Then you
hear that that bases are loaded.  You can reasonably infer that the
something that happened involved the batter reaching first.  You are
still stuck disambiguating walk, hit by pitch, balk, etc. but you at
least know who's on first.  (ok that just came out, honest I wasn't
doing all that set up for a joke.)  Likewise in the same situation you
hear that a guy from the other team is coming up to bat, and there
were 2 outs before, oh well then that unknown action was an out of
some kind.

> So far, my divide and conquer scheme is to attack both ends, recognize
> player names and locations and work into the middle.  While writing
> this, I think that I should be able to get player and action easily,
> and the rest must be location.  It should require nothing beyond
> matching.

I agree. So why try to diagram the sentence, just start matching ;)

> The above samples are the easiest from the world of descriptive
> baseball, I'd like to apply some technique that will expand to the
> more difficult sentences (they apply to a state machine with a richer
> set of states)

Do you think those systems will look significantly different?  Also if
you are trying to actively track the state of the machine, I believe,
that all that added knowledge can help you better (and more easily)
understand the language than a general/traditional NLP approach.

> Is there (cross fingers) some magic lisp package just sitting around
> to apply to this problem?  Someone mentioned langutils, but getting
> that running is like getting cl-persistance running.
>
> --
> Lisp : 'My God, it's full of cars!'

Good luck,
Kevin

From: GP lisper
Subject: Re: help with NLP library
Date: Fri, 06 Feb 2009 03:44:55 +0000
Message-ID: <slrngonchn.i2n.spambait@phoenix.clouddancer.com>

On Thu, 5 Feb 2009 12:30:10 -0800 (PST), <······················@gmail.com> wrote:
> On Feb 5, 1:24�am, GP lisper <········@CloudDancer.com> wrote:
>> On Wed, 04 Feb 2009 19:19:37 -0600, <····@rpw3.org> wrote:
>> > K Livingston �<······················@gmail.com> wrote:
>> > +---------------
>> >| If you want *the* correct/intended parse, things get a lot more
>> >| hairy for all the problems mentioned above, and I'm surprised that
>> >| no one has even brought up those pesky "time flies," I really hate
>> >| when they eat my arrows. ("Time flies like an arrow," has at least
>> >| 7 different ways to be interpreted.)
>> > +---------------
>>
>> > Ah, yezz, and since no one else has mentioned it yet, let's
>> > see a program parse *this* grammatically correct sentence: �;-}
>>
>> > � Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
>>
>> I'd settle for someone being able to grok it without outside
>> assistance. �It's clearly an outlier to humans.
>>
>> ________
>>
>> Since the subject has been raised, I want to parse (?? �I'm not sure
>> what the correct terminology is for the desired outcome) the following:
>
>
> your handle is different than the OP,.. are there two projects here,
> or two of you in there?

I was reading the thread since it had 'NLP' in the subject, then
anything Rob Warnock writes, I generally read.  Completely different
project and poster.

Looking over the suggestions.  Thanks all!

From: abhi
Subject: Re: help with NLP library
Date: Fri, 06 Feb 2009 04:44:02 +0000
Message-ID: <14cc053a-31a2-4fe2-8ce2-21d31aa7366e@v39g2000pro.googlegroups.com>

I have got enough information to keep me busy and thinking for a long
time :).
Thanks everyone!

--
Regards
Abhijith

From: GP lisper
Subject: Re: help with NLP library
Date: Fri, 06 Feb 2009 05:35:58 +0000
Message-ID: <slrngonj1u.i2n.spambait@phoenix.clouddancer.com>

On Thu, 5 Feb 2009 12:30:10 -0800 (PST), <······················@gmail.com> wrote:
> On Feb 5, 1:24�am, GP lisper <········@CloudDancer.com> wrote:
>
>> Hillenbrand grounded out (shortstop to first);
>>
>> Each line is an event in a baseball game.
>> This sentence structure seems to be
>> Player -> action -> location
>>
>> The goal is to simplify the English into a state machine description
>> of the game. �The possible states are limited to the eight
>> possiblities for runners on the 3 bases, with 3 outs in a side.
>> Player names can be replaced with the index to a player database.
>
> How many things actually happen in a baseball game?  A few events
> cover 80% of the cases.

It depends on the level of detail one is after, but that is strongly
limited by the descriptions available.  An out is the most common
event, but knowning that it came from a swinging third strike is
different from knowing that the ball was caught on the centerfield
warning track for evaluating a pitcher.  Well, not just a single
event, but 50 such events are useful.


> I would look instead at the type of work done by the IE (Information
> Extraction) community.  They do things like "<x> singled to <y>".  And
> they would also do things like entity recognition and normalization on
> the matched parts.  The system KnowItAll is a good reference.
> (although I don't know what they have available to give you)  Also a
> kind of hierarchical pattern matching might work well as you indicated
> you are exploring.  There you have patterns like "<Player> singled to
><Location>" and then you have other patterns or pieces recognizing and
> annotating parts of the text as a player or a location etc.  I do this

I'll google the KnowItAll and DMAP to see what's up.  But it occurs to
me that if I have a pattern matcher for <Player>, another for <action>
and a third for <location> there are bound to be some conflicts where
two matchers claim the same word.


> kind of pattern matching in my NLU work.  Systems like DMAP match
> language this way.  (I am not affiliated with the OpenDMAP work, but
> they have a system you can download, that might help you bootstrap.
> Although it won't have any knowledge/patterns relating to baseball.
> You may just want to start with a toy pattern matcher first and see
> how far you get.)

OK, I see what that can do, I'm just about finished gathering 'input'.

> Leveraging domain knowledge and writing domain specific patterns might
> actually make you more noise tolerant.  For example, speech
> recognition systems that can operate in a domain or in a task context
> where there are only so many options for what should come next can
> recognize speech with a significantly higher degree of accuracy than
> open/free-form speech recognizers.  Because they can leverage domain
> knowledge, and predict what's next, if they have to guess at what
> happened they have a more targeted list of options.  Domain knowledge
> is also needed to help disambiguate pronouns or other anaphora (which
> you may or may not have).  E.g. "he" or saying something like "the
> mound" to refer to the pitcher.  (although, in-domain those references
> might be very clear and just be more names, "blue" is an unambiguous
> reference in baseball, although, I guess in MLB they wear black
> now?...)

Umpires?  You cannot tell the color of their uniform in the old black
and white archives (but you can see that they were slim).  There is a
factor in the game that does depend on umpires, but if you specialize
into that level of detail, you end up with too few cases to be able to
make a statistical judgement.

> Also using a larger context could help you in the long run with
> recognition errors.  For example, if you know there is a man on first
> and second, and then something happens you can't recognize.  Then you
> hear that that bases are loaded.  You can reasonably infer that the
> something that happened involved the batter reaching first.  You are
> still stuck disambiguating walk, hit by pitch, balk, etc. but you at
> least know who's on first.  (ok that just came out, honest I wasn't
> doing all that set up for a joke.)  Likewise in the same situation you
> hear that a guy from the other team is coming up to bat, and there
> were 2 outs before, oh well then that unknown action was an out of
> some kind.

Well, as you and others have pointed out, the domain is limited but it
is also rich in auxilary information.  By referring to the box score,
you can cut down on the possible reasons for a change of state when
encountering ambiguity.  Having to manually solve a handful of games
out of a couple thousand wouldn't be a hardship either.

> Good luck,
> Kevin

Thank you for your comments Kevin.


-- 
Lisp : 'My God, it's full of cars!'

From: Pascal J. Bourguignon
Subject: Re: help with NLP library
Date: Thu, 05 Feb 2009 12:28:12 +0000
Message-ID: <87hc39hxur.fsf@galatea.local>

GP lisper <········@CloudDancer.com> writes:
> Is there (cross fingers) some magic lisp package just sitting around
> to apply to this problem?  Someone mentioned langutils, but getting
> that running is like getting cl-persistance running.

There is a NLP linux distribution.  I don't remember if they have lisp
packages in it.  But in any case, you could use FFI to use them from
lisp. http://morphix-nlp.berlios.de/

-- 
__Pascal Bourguignon__

From: GP lisper
Subject: Re: help with NLP library
Date: Fri, 06 Feb 2009 04:36:29 +0000
Message-ID: <slrngonfid.i2n.spambait@phoenix.clouddancer.com>

On Thu, 05 Feb 2009 13:28:12 +0100, <···@informatimago.com> wrote:
>
>
> GP lisper <········@CloudDancer.com> writes:
>> Is there (cross fingers) some magic lisp package just sitting around
>> to apply to this problem?  Someone mentioned langutils, but getting
>> that running is like getting cl-persistance running.
>
> There is a NLP linux distribution.  I don't remember if they have lisp
> packages in it.  But in any case, you could use FFI to use them from
> lisp. http://morphix-nlp.berlios.de/

A bootable CD is a nice touch.  I can pop that on a laptop and see
what can be done.  Thank you.

-- 
Lisp : 'My God, it's full of cars!'

From: Thomas A. Russ
Subject: Re: help with NLP library
Date: Thu, 05 Feb 2009 23:09:01 +0000
Message-ID: <ymiljsk8os2.fsf@blackcat.isi.edu>

GP lisper <········@CloudDancer.com> writes:

> Hillenbrand grounded out (shortstop to first);
> Finley flied to right;
> Bautista singled to right;
> Cintron doubled to center [Bautista scored, Cintron to third (error by Estrada)];

I doubt that an off-the-shelf parser will help too much with this.  Most
parsers operate on the assumption that the text is grammatical, written
English.  Perhaps looking at parsers used in speech recognition systems
would be more fruitful, since they would be more tolerant of phrases,
short text bits and ungrammatical input.

It also appears that there are a number of short-hand phrases that are
very domain specific.  For example, the sentence "Cintron to third" is
really hard to parse into anything meaningful unless you knew that the
domain is baseball, since there is an implied verb "moved" or "advanced"
that doesn't appear in the input.  It also helps to know that "third" is
not really an ordinal number but rather the (shorthand) name of a
location.

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: GP lisper
Subject: Re: help with NLP library
Date: Fri, 06 Feb 2009 04:22:02 +0000
Message-ID: <slrngonena.i2n.spambait@phoenix.clouddancer.com>

On 05 Feb 2009 15:09:01 -0800, <···@sevak.isi.edu> wrote:
> GP lisper <········@CloudDancer.com> writes:
>
>> Hillenbrand grounded out (shortstop to first);
>> Finley flied to right;
>> Bautista singled to right;
>> Cintron doubled to center [Bautista scored, Cintron to third (error by Estrada)];
>
> I doubt that an off-the-shelf parser will help too much with this.  Most
> parsers operate on the assumption that the text is grammatical, written
> English.  Perhaps looking at parsers used in speech recognition systems
> would be more fruitful, since they would be more tolerant of phrases,
> short text bits and ungrammatical input.

That is a useful insight for me.  I didn't relish learning enough NLP
to be able to utilize the tools, I'm still slumming on the XML stuff.
A custom solution is fine.  I suppose I'll learn enough from that
effort to see the next steps myself.

> It also appears that there are a number of short-hand phrases that are
> very domain specific.  For example, the sentence "Cintron to third" is
> really hard to parse into anything meaningful unless you knew that the
> domain is baseball, since there is an implied verb "moved" or "advanced"
> that doesn't appear in the input.  It also helps to know that "third" is
> not really an ordinal number but rather the (shorthand) name of a
> location.

I was considering that the play descriptions should be considered as a
synthetic language, which would account for the features you point out.

There are a few sources of play-by-play, depending on the year of
interest.  In some cases, the events are described in this fashion.

-------------------------------------------------------
Matt Kemp grounds out, second baseman Anderson Hernandez to first
baseman Ronnie Belliard.

Andre Ethier singles on a line drive to left fielder Willie Harris.

Manny Ramirez homers (27) on a fly ball to left field.  Andre Ethier
scores.

Casey Blake strikes out on foul tip.

Danny Ardoin pops out to first baseman Ronnie Belliard in foul
territory. 
-------------------------------------------------------

The sentences are fuller, does that change your opinion any?  I guess
I can use the various levels of detail (the top examples vs these
bottom examples) to validate a decoding into a expanded state machine
description of the game.

 -- Lisp : 'My God, it's full of cars!'

From: Thomas A. Russ
Subject: Re: help with NLP library
Date: Fri, 06 Feb 2009 19:36:34 +0000
Message-ID: <ymiwsc3l5ml.fsf@blackcat.isi.edu>

GP lisper <········@CloudDancer.com> writes:

> The sentences are fuller, does that change your opinion any?  I guess
> I can use the various levels of detail (the top examples vs these
> bottom examples) to validate a decoding into a expanded state machine
> description of the game.

Well, I'm not so well-versed in parsers to be able to venture a guess,
although I did try a couple of these in Stanford's link parser. (See below)

This page is a nice redirect to several parsers, if you want to
investigate.  Some of them also have web-interfaces that you can use to
try out the results.  I'm not sure if any of them are actually
implemented in lisp, though.

 http://ai.stanford.edu/~rion/parsing/index.html

Some examples from Stanford's link parser:

** Andre Ethier singles on a line drive to left fielder Willie Harris.

++++Time                                          0.01 seconds (38.24 total)
Found 16 linkages (16 with no P.P. violations)
  Linkage 1, cost vector = (UNUSED=0 DIS=3 AND=0 LEN=25)

                 +------------MVp-----------+                             
                 +---------Osn--------+     +-------------Js-------------+
                 |        +-----Ds----+     |            +-------GN------+
  +---G--+---Ss--+---K--+ |    +--AN--+     |    +---A---+        +---G--+
  |      |       |      | |    |      |     |    |       |        |      |
Andre Ethier singles.v on a line.n drive.n to left.a fielder.n Willie Harris 

Constituent tree:

(S (NP Andre Ethier)
   (VP singles
       (PRT on)
       (NP a line drive)
       (PP to
           (NP (NP left fielder)
               Willie Harris))))


** Casey Blake strikes out on foul tip.

No complete linkages found.
++++Time                                          0.01 seconds (38.23 total)
Found 4 linkages (4 with no P.P. violations) at null count 1
  Linkage 1, cost vector = (UNUSED=1 DIS=0 AND=0 LEN=9)

    +-----------------------Xp----------------------+
    +------Wd-----+       +----MVp---+              |
    |       +--G--+---Ss--+---K--+   +-Jp-+         |
    |       |     |       |      |   |    |         |
LEFT-WALL Casey Blake strikes.v out on foul.n [tip] . 

Constituent tree:

(S (S (NP Casey Blake)
      (VP strikes
          (PRT out)
          (PP on
              (NP foul))))
   tip .)




-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: GP lisper
Subject: Re: help with NLP library
Date: Sun, 08 Feb 2009 11:07:03 +0000
Message-ID: <slrngotf6n.dbs.spambait@phoenix.clouddancer.com>

On 06 Feb 2009 11:36:34 -0800, <···@sevak.isi.edu> wrote:
> GP lisper <········@CloudDancer.com> writes:
>
> Some examples from Stanford's link parser:
>
> ** Andre Ethier singles on a line drive to left fielder Willie Harris.
>
> ++++Time                                          0.01 seconds (38.24 total)
> Found 16 linkages (16 with no P.P. violations)
>   Linkage 1, cost vector = (UNUSED=0 DIS=3 AND=0 LEN=25)
>
>                  +------------MVp-----------+                             
>                  +---------Osn--------+     +-------------Js-------------+
>                  |        +-----Ds----+     |            +-------GN------+
>   +---G--+---Ss--+---K--+ |    +--AN--+     |    +---A---+        +---G--+
>  |      |       |      | |    |      |     |    |       |        |      |
> Andre Ethier singles.v on a line.n drive.n to left.a fielder.n Willie Harris 
>
> Constituent tree:
>
> (S (NP Andre Ethier)
>    (VP singles
>        (PRT on)
>        (NP a line drive)
>        (PP to
>            (NP (NP left fielder)
>                Willie Harris))))

When I saw the above, I smiled since the "problem" of recognizing
playernames looked solved.  Then the light bulb finally lit up and I
remembered that names are Proper Nouns.  I did a little coding, and
there is some phrasing that is also capitalized in test case of 130k
events (that's about half a season), but mostly this problem looks
solved to me.  Once names can be removed, the events look like

P singles on a line drive to left fielder P.

and that can be resolved with a lookup table.

Thanks to
 K Livingston
 Thomas A. Russ

for your help.  :-)

From: George Neuner
Subject: Re: help with NLP library
Date: Fri, 06 Feb 2009 07:55:25 +0000
Message-ID: <86rno4d30vdpbnunanarpuh99so5qifgp1@4ax.com>

On Wed, 4 Feb 2009 23:24:50 -0800, GP lisper
<········@CloudDancer.com> wrote:

>On Wed, 04 Feb 2009 19:19:37 -0600, <····@rpw3.org> wrote:
>> K Livingston  <······················@gmail.com> wrote:
>>
>>   Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
>
>I'd settle for someone being able to grok it without outside
>assistance.  It's clearly an outlier to humans.

http://www.cse.buffalo.edu/~rapaport/BuffaloBuffalo/buffalobuffalo.html

George

From: George Neuner
Subject: Re: help with NLP library
Date: Tue, 03 Feb 2009 20:37:28 +0000
Message-ID: <vj9ho4drl067bj7gkh54ucr78edaug6nga@4ax.com>

On Mon, 2 Feb 2009 23:55:23 -0800 (PST), abhi <·········@gmail.com>
wrote:

>Hi,
>   I am looking for a common lisp library which can check if a given
>sentence is syntactically correct ie., according to the rules of
>English grammar.
>  I have taken a look at Basic English Grammar library but it is not
>sufficient since it tags only the individual words and does not have
>syntactic checking as part of the functionality. I also tried to
>compile the code from PAIP (on sbcl) but it threw a lot of errors
>which I was not able to debug with my limited knowledge of CL.
>   I would appreciate it if someone could point me to a library (if
>there is one at all) which does what I am looking for.
>
>Thanks
>Abhijith

Don't know offhand of any library for doing that.  Beside which,
English grammar has several dozen exceptions to its handful of basic
rules.  Most grammar checkers I have encountered are easily confused
by correct complex sentences.  I have yet to see one correctly handle
3rd person passive voice (the way technical writing should be done).

Thinking about it, though, does OpenOffice have a grammar checker?  If
so, perhaps you could use it.

If you're ambitious, Princeton's Wordnet library
(http://wordnet.princeton.edu/) provides a large lexical database with
C bindings ... you might be able to use it as a starting point for
your own grammar system.

George