From: ····@my-dejanews.com
Subject: Weird problem
Date: 
Message-ID: <78ig03$2p7$1@nnrp1.dejanews.com>
   I have been trying to code a lisp routine that is able
to split a word according to some "rules", with no result.

   The next is an example of what the routine should do:

         >(break-apart "astronaut")
         ("ast" "ro" "na" "ut")

   The break-apart routine uses 4 diferent lists to separate
the original word. Each of this lists is composed of a number
of small words. If the original word, in this case "astronaut"
can be broken apart using one small word from each of the
lists, the routine should return such result.

   Plus, not only does the word has to be "breakable", that
the small words composing it exist on the list, but they also
have to be in order. This means that "ast" belongs to LIST_ONE,
"ro" belongs to LIST_TWO, "na" belongs to LIST_THREE and "ut"
to LIST_FOUR.

   This are the lists

        (setq list_one '("co" "iyu" "ast" "ml"))

        (setq list_two '("ro" "mp" "awo" "mjwl"))

        (setq list_three '("ute" "kjd" "ast" "ml" "mnw" "na"))

        (setq list_four '("qwd" "ut" "r"))


   So, if i try:

        >(break-apart "computer")
        ("co" "mp" "ute" "r")

   But if I try:

        >(break-apart "conmuter")
        NIL

   There is a "co" in LIST_ONE, so far we are ok. Since we already
found "co" we only have to search with the rest of the word, that
is "nmuter". We continue with the words from LIST_TWO but we find
that there is no word that matches "nmuter" in anyway. If LIST_TWO
had had "nmute" or "nm" or any word that matched the begining of
"nmuter" we would be ok. Since this didnt happend, there is
no break-apart, thus we return NIL.

   I hope someone can help me. I am trying to a parser to be used
with prehispanic languages, but this stops me from continuing.

   This routine will help me continue some of my work, well, at
least I think it will.

   Thanks for your help, I appreciate it.


-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    

From: Pierre Mai
Subject: Re: Weird problem
Date: 
Message-ID: <874spe6he1.fsf@orion.dent.isdn.cs.tu-berlin.de>
····@my-dejanews.com writes:

>    I have been trying to code a lisp routine that is able
> to split a word according to some "rules", with no result.

Well, despite all the examples and description you have given, I'm
afraid your problem is still not well-defined.  At least the following 
two points should be considered:

a)  Will there ever be only one possible way to break apart a word?
    Your current specification seems to imply that you expect either
    no way, or exactly one way exists, but I wonder if this is really 
    always the case (natural language processing is IMHO often fraught 
    with ambiguities).

b)  Even if there is only one way to break apart a word, is it
    possible, that there are more than one way to split at any given
    point?  E.g. "as-tro-naut" vs. "ast-ro-naut", which might indeed
    fail, if "ast" is in the first list, but "ro" isn't in the
    second.  If cases like this don't exist, then breaking apart words 
    is an easy, linear operation.  If they do exist (which again I'd
    expect in NLP), you'll need to perform some kind of backtracking.

Here is a simple solution, which assumes that only one way exists to
split a word, and that a simple heuristic suffices to disambiguate
between possible matches at a given point (implemented are first-match 
and longest-first-match, via the operations SIMPLE-FIND-PART and
GREEDY-FIND-PART).

If this isn't sufficient, you can either try to change the code below
to support some form of backtracking/non-determinism, or you can check
out Screamer, which is an extension to Common Lisp for non-determinstic
programming, which makes this task much easier IMHO.  Screamer is by
Jeffrey Mark Siskind (http://www.neci.nj.nec.com/homepages/qobi).

I'd recommend using Screamer, since I'd imagine you will want to
process your word fragments further, and most things NLP will imply
some non-determinism.

Regs, Pierre.

;;; Utility function

(defun starts-with (string start-string &key (start 0))
  (let ((start-length (+ start (length start-string))))
    (and (>= (length string) start-length)
	 (string-equal string start-string :start1 start
		       :end1 start-length))))

;;; The different part-finders, which implement first-match and
;;; longest-first-match heuristics respectively.

(defun simple-find-part (string part-list &key (start 0))
  (dolist (part part-list)
    (when (starts-with string part :start start)
      (return part))))

(defun greedy-find-part (string part-list &key (start 0))
  (loop with result = nil
	with length = 0
	for part in part-list
	do
	(when (and (starts-with string part :start start)
		   (> (length part) length))
	  (setq result part
		length (length part)))
	finally
	(return result)))

;;; The main function.

(defun break-apart (word part-finder &rest part-lists)
  (loop with word-length = (length word)
	for index = 0 then (+ index (length part))
	for part-list in part-lists
	for part = (funcall part-finder word part-list :start index)
	never (or (not part) (>= index word-length))
	collect part into result
	finally
	(return (and (= index word-length)
		     result))))

;;; Examples

#|
* (break-apart "astronaut" #'simple-find-part
	     '("as" "co" "ast") '("tro" "ro" "mp")
	     '("na" "ut") '("ut" "er"))
("as" "tro" "na" "ut")
* (break-apart "astronaut" #'greedy-find-part
	     '("as" "co" "ast") '("tro" "ro" "mp")
	     '("na" "ut") '("ut" "er"))
("ast" "ro" "na" "ut")
* (break-apart "astronaut" #'simple-find-part
	     '("as" "co" "ast") '("tro" "mp")
	     '("na" "ut") '("ut" "er"))
("as" "tro" "na" "ut")
* (break-apart "astronaut" #'greedy-find-part
	     '("as" "co" "ast") '("tro" "mp")
	     '("na" "ut") '("ut" "er"))
NIL
|#
-- 
Pierre Mai <····@acm.org>               http://home.pages.de/~trillian/
  "One smaller motivation which, in part, stems from altruism is Microsoft-
   bashing." [Microsoft memo, see http://www.opensource.org/halloween1.html]
From: Stig Hemmer
Subject: Re: Weird problem
Date: 
Message-ID: <ekvu2x4or5v.fsf@ra.pvv.ntnu.no>
····@my-dejanews.com writes:
>    I have been trying to code a lisp routine that is able
> to split a word according to some "rules", with no result.

Show us your work!  You'll probably learn more by having us comment
your code than by getting a finished solution.

Stig Hemmer,
Jack of a Few Trades.