Recently I've been trying to extract facts from an English language book
and I've been amazed at just how difficult this is. Of course, being a
programmer by trade, I've tried developing a program to do the work for
me. Haven't had a lot of success.
I was thinking that probably somebody has already started and/or
finished such code, and that it may be in the public domain. Can anyone
point me to such?
TIA, Steve Graham
Steve, what you propose to do is not entirely clear, but it sounds like
the holy grail of Natural Language Processing. We have devised some
pretty good techniques for classifying text content and retrieving
broadly-defined categories of information, but you aren't going to find
any simple code that processes large amounts of text with anything
approaching the accuracy of humans. It would probably be much more
cost-effective just to read the book and write down the facts that seem
relevant to you. Unfortunately, natural language does not really
contain all of the information necessary to extract information. When
you read a book (or any text), the author is only helping you to
annotate information that largely exists inside your head. Unlike
computer code, natural language is extremely ambiguous, and we are only
able to understand text and conversations because our vast store of
knowledge helps us to disambiguate. Unless you can construct a program
that already knows a lot about the subject of the text, it will become
hopelessly confused by all the possible different interpretations that
words and phrases can have.
Could you be a little more specific about the types of "facts" you
propose to extract? What is it that you want to do with them?
Steve Graham wrote:
> Recently I've been trying to extract facts from an English language book
> and I've been amazed at just how difficult this is. Of course, being a
> programmer by trade, I've tried developing a program to do the work for
> me. Haven't had a lot of success.
>
> I was thinking that probably somebody has already started and/or
> finished such code, and that it may be in the public domain. Can anyone
> point me to such?
>
> TIA, Steve Graham
Steve Graham <·········@comcast.net> writes:
> I was thinking that probably somebody has already started and/or
> finished such code, and that it may be in the public domain. Can
> anyone point me to such?
Welcome to the world of Computational Linguistics! I used to be a
CL/semantics ph.d.student/researcher/teacher before I left academia
and started hacking. The short answer is that this is incredibly
difficult. In fact, my personal view is that real high quality CL is
tightly tied to solving the general problem of Artificial
Intelligence, i.e. something I don't expect to see in my own life-
time.
However, there's a lot of fun to do without solving all the problems.
If you want to look at natural language parsing from a programmer's
viewpoint, have a look at Norvig's Paradigm of Artificial Intelligence
Programming. It's really fun and easy to understand for anyone that
has some lisp experience. If you like to understand the mathematical
underpinnings of parsers and automata, I recommend Hopcroft and Ullmans
classic - now in revised version:
http://www.aw-bc.com/catalog/academic/product/0,4096,0201441241,00.html
Have fun!
--
(espen)
There are several ways to go about parsing english.
One is a petry net. Go to ww.norvig.com and check out the example code
here.
Another is site you might want to check out is alicebot.com.
Lots of ideas here. I don't think a version of alice for lisp has been
released yet.
The creators want to clean up the code and remove the web server code
first.
On Fri, 12 Mar 2004 04:26:14 GMT, Steve Graham <·········@comcast.net>
wrote:
> Recently I've been trying to extract facts from an English language book
> and I've been amazed at just how difficult this is. Of course, being a
> programmer by trade, I've tried developing a program to do the work for
> me. Haven't had a lot of success.
>
> I was thinking that probably somebody has already started and/or
> finished such code, and that it may be in the public domain. Can anyone
> point me to such?
>
> TIA, Steve Graham
--
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Well my example was fairly vague, wasn't it? If I was to parse the
following verse
I, NEPHI, having been born of goodly parents, therefore I was taught
somewhat in all the learning of my father; and having seen many
afflictions in the course of my days, nevertheless, having been highly
favored of the Lord in all my days; yea, having had a great knowledge of
the goodness and the mysteries of God, therefore I make a record of my
proceedings in my days.
I could glean the following:
1. Nephi had goodly parents
2. He was taught in the learning of his father
3. He saw many afflictions
4. He was highly favored of the Lord
5. He had a great knowledge of the goodness of God
6. He had a great knowledge of the mysteries of God
7. He mad a record of his proceedings
Steve
---
Steve Graham wrote:
> Recently I've been trying to extract facts from an English language book
> and I've been amazed at just how difficult this is. Of course, being a
> programmer by trade, I've tried developing a program to do the work for
> me. Haven't had a lot of success.
>
> I was thinking that probably somebody has already started and/or
> finished such code, and that it may be in the public domain. Can anyone
> point me to such?
>
> TIA, Steve Graham
Steve Graham wrote:
> Well my example was fairly vague, wasn't it? If I was to parse the
> following verse
>
> I, NEPHI, having been born of goodly parents, therefore I was taught
> somewhat in all the learning of my father; and having seen many
> afflictions in the course of my days, nevertheless, having been highly
> favored of the Lord in all my days; yea, having had a great knowledge of
> the goodness and the mysteries of God, therefore I make a record of my
> proceedings in my days.
>
> I could glean the following:
> 1. Nephi had goodly parents
> 2. He was taught in the learning of his father
Would that be his father, or our Father?
> 3. He saw many afflictions
> 4. He was highly favored of the Lord
> 5. He had a great knowledge of the goodness of God
> 6. He had a great knowledge of the mysteries of God
He said he had a great knowledge of "the goodness", and of the mysteries
of God. Parsing that to mean "the goodness of God" could be tricky.
> 7. He mad[e] a record of his proceedings
No, he makes a record of his proceedings. How did you infer the past tense?
You could also glean that all people born of handsome and/or unusually
large parents are taught somewhat in the learnings of their/our
father/Father, and that people who know (or think they know) the
mysteries of God tend to wax autobiographical.
There's also the issue of what students of fiction call the "unreliable
narrator" problem -- If not all statements are (or can be) true, how
does the computer decide which ones are false?
Seriously, it is my understanding that the current state of the art in
computational linguistics has enough difficulty parsing series of
simple, well formed sentences whose words are being used in one of their
mainstream senses. Parsing run-on sentences full of esoteric usage and
dodgy syntax would be a mighty challenge indeed.
--
Cameron MacKinnon
Toronto, Canada
Cameron MacKinnon <··········@clearspot.net> writes:
> Seriously, it is my understanding that the current state of the art in
> computational linguistics has enough difficulty parsing series of
> simple, well formed sentences whose words are being used in one of
> their mainstream senses. Parsing run-on sentences full of esoteric
> usage and dodgy syntax would be a mighty challenge indeed.
I have a problem doing that myself. Nevermind writing a program to
do it. The programming language would not even be an issue for me.
The problem itself is just too hard for me to solve.
--
Those who do not remember the history of Lisp are doomed to repeat it,
badly.
> (dwim x)
NIL
You might want to browse out of
http://www.cl.cam.ac.uk/~asa28/useful_semiotics_research_links.htm
There is a link grammar parser at
http://www.link.cs.cmu.edu/link/
which I've heard good things about, but never personally used.
Pete
> From: Steve Graham <·········@comcast.net>
> Recently I've been trying to extract facts from an English language
> book and I've been amazed at just how difficult this is. Of course,
> being a programmer by trade, I've tried developing a program to do
> the work for me. Haven't had a lot of success.
Because the task is ***EXTREMELY*** difficult to figure out how to do
it, either the way we do it or some new way. Many people have spend
their entire professional lives working on the problem without much
success. The algorithm probably has to figure out what frame
(environment) it's in, in order to understand what meanings of words
are most likely, and the program needs to know most of the standard
scripts (joseki) in that frame in order to fill in the gaps that aren't
stated explicitly. For example, "She went in, found an empty table, sat
down, and started looking at the menu." would make sense in a
restaurant. Left unsaid is the fact she was looking at the menu in
order to decide what food to order, and that next a waiter would
probably come over and ask her what she'd like, and she'd pick some
appetizer or drink from the menu, order it (tell the waiter what she
had selected), waiter would walk away and come back a few minutes later
with the requested item, put it on the table, walk away, and she'd
start eating that delivered item. But there are so many variations on
the script within that single frame: She might be waiting for her date
to show up. She might need to use the toilet before ordering. She might
not like the prices and decide to leave without ordering. The program
would need to have a way to turn all those sequences of words into
precise meaning.
If any natural-language-understanding researchers feel the inclination,
I would like them to set up WebServer applications that demonstrate the
innerds of their particular methodology. For example, if the Web user
types in "She asked the waiter if there were any specials of the day."
the n.l.u. server might tag each ambiguous word with a specific
meaning, either in natural language or in program-innerds data
structures, to illustrate how the program understands the frame and the
special meanings of that word in that frame. It might also generate a
syntactic parse (like a "sentence diagram" you might have learned in
high-school English class, but more precise in structure). With lots of
such n.l.u.-fragment demo-servers available, we might be able to browse
the to get an idea what such fragments are the state of the art, and
some good researcher might figure out how to put some of the pieces
together to yield tomorrow's state of art n.l.u. software.
Appx. ten years ago some research center was seeking employees to code
massive quantitites of common knowledge into databases that might
eventually be used by n.l.u. software. I haven't heard of such efforts
recently. I wonder whatever happened to those projects (other than the
fact that the recession probably ended their funding).
Robert Maas wrote:
> Appx. ten years ago some research center was seeking employees to code
> massive quantitites of common knowledge into databases that might
> eventually be used by n.l.u. software. I haven't heard of such efforts
> recently. I wonder whatever happened to those projects (other than the
> fact that the recession probably ended their funding).
There's a company called Cycorp that's been doing this for some time;
to the best of my knowledge they're still at it. I think their
database of facts is intended for more than "just" NL applications.
--
Gareth McCaughan
.sig under construc
On Sun, 14 Mar 2004 23:33:42 -0800, <··········@YahooGroups.Com> wrote:
Have you heard of AIML a xml format for expressing english responces.
Try www.alicebot.org and follow references from there.
The version I use here is a java version but a lisp version can be tried
at the above address.
They expect to come withe a public source release when they have seperated
the http server code from the AI code and cleaned it up. Some time this
year perhaps.
The nice thing is that there is a lot of precoded rules and general
knowlege so you don't have
to work from scratch. You can bind sentences to shell commands and have
access to a javascript like language.
I have just played with it a little but it seems workable.
It is a generalization of the same methodology used by eliza.
Hope this is usefull.
Good luck
>> From: Steve Graham <·········@comcast.net>
>> Recently I've been trying to extract facts from an English language
>> book and I've been amazed at just how difficult this is. Of course,
>> being a programmer by trade, I've tried developing a program to do
>> the work for me. Haven't had a lot of success.
--
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Thanks to all those who shared their knowledge/experience. Guess I
shouldn't feel so bad that a database programmer had a few problems with
this domain.
Steve Graham
---
Steve Graham wrote:
> Recently I've been trying to extract facts from an English language book
> and I've been amazed at just how difficult this is. Of course, being a
> programmer by trade, I've tried developing a program to do the work for
> me. Haven't had a lot of success.
>
> I was thinking that probably somebody has already started and/or
> finished such code, and that it may be in the public domain. Can anyone
> point me to such?
>
> TIA, Steve Graham