From: hifi25nl
Subject: Clean punctuation problem
Date: 
Message-ID: <7tKSh.24550$tx6.12577@tornado.fastwebnet.it>
I am using CMUCL lisp in Linux and I want to clean from punctuation a file
(this file can have italian accented letters). I have tried using

(defun read-words (file)
  (with-open-file (input file :direction :input)
    (let ((buffer (make-sequence 'simple-vector (file-length input)
                                 :initial-element #\Space)))
      (read-sequence buffer input)  ; read the file into BUFFER
      (read-from-string             ; convert the buffer to a list
       (concatenate 'simple-string
             ;; replace punctuation with a space in place
             ;; i.e. do not create a new vector for BUFFER
         "(" (nsubstitute-if #\Space
                   (complement #'alphanumericp) buffer)
         ")")))))

However this function delete also the accented letters
So I have modified it as

(defun read-words (file)
  (with-open-file (input file :direction :input)
    (let ((buffer (make-sequence 'simple-vector (file-length input)
                                 :initial-element #\Space)))
      (read-sequence buffer input)  ; read the file into BUFFER
      (read-from-string             ; convert the buffer to a list
       (concatenate 'simple-string
             ;; replace punctuation with a space in place
             ;; i.e. do not create a new vector for BUFFER
         "(" (clean buffer) ")")))))

(defun clean (string)
  (delete #\\ string)
  (delete #\" string)
  (delete #\( string)
  (delete #\) string)
  (delete - string)
  (delete #\` string)
  (delete #\: string)
  (delete #\; string)
  (delete #\, string)
  (delete #\. string)
  (delete #\? string)
  (delete #\! string)
  (substitute #\Newline #\SPACE string)) 

In this case the result is (I hope you can see the same character in this
newsgroup):

(PROVA SECONDO |PERCHé|)

when the original file is:
----------
prova
secondo perch�?
----------
How can I have "perch�" instead of "|PERCHé|"??


Piero

From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evfvb3$lud$1@nemesis.news.tpi.pl>
Hi. Use SPLIT-SEQUENCE

[ http://www.cliki.net/SPLIT-SEQUENCE ]

instead of read-from-string, or ;) my MY-SPLIT-STRING

[ http://groups.google.pl/group/comp.lang.lisp/msg/31412a3538a3dcbe?dmode=source ].

MY-SPLIT-STRING is ugly (blush).

Btw, function clean should like this:

(defun clean (string)
   (substitute #\Newline #\SPACE
               (delete-if (lambda (x) (find x "\\\"()-`:;,.?!")) string)))
;; end fun clean

Important:

(delete ...)
(delete ...)
...

might give unexpected results. You should use SETQ or nest them,
or better use delete-if.

Regards, Szymon.
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evg0nm$ds1$1@atlantis.news.tpi.pl>
hifi25nl wrote:

> when the original file is:
> ----------
> prova
> secondo perch�?
> ----------

Is it unicode ?

Afaik CMUCL (but check it, I haven't used CMUCL for years) does not support it.

> How can I have "perch�" instead of "|PERCHé|"??

Read file as stream of bytes and operate on
bytes instead of characters ;)
or (better) use CMUCL fork named SBCL.
SBCL supports unicode.

If you do not want symbols (|PERCHé|) use appropriate string or
sequence splitting utility, or regular expressions.

Regular expressions for CL:

[ http://weitz.de/cl-ppcre/ ]

Regards, Szymon.
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evg1gb$gde$1@atlantis.news.tpi.pl>
Szymon wrote:
> 
> If you do not want symbols (|PERCHé|) use appropriate string or
> sequence splitting utility, or regular expressions.
> 

or use something trivial like this:

(defun split-string-on-space (string)
   (labels ((--> (start result &aux pos)
                 (if (setq pos (position #\Space string :start start))
                     (--> (1+ pos) (cons (subseq string start pos) result))
                   (cons (subseq string start) result))))
     (nreverse (--> 0 nil))))

CL-USER> (split-string-on-space "foo bar baz")

==> ("foo" "bar" "baz")

CL-USER> (split-string-on-space "foo  bar  baz")

==> ("foo" "" "bar" "" "baz")
From: Takehiko Abe
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <keke-F0F644.22010710042007@nnrp.gol.com>
> > when the original file is:
> > ----------
> > prova
> > secondo perch�?
> > ----------
> 
> Is it unicode ?
> 
> Afaik CMUCL (but check it, I haven't used CMUCL for years) does not support it.

If you use UTF-8, delete-if should work because the chars to remove
are all ascii even if CMUCL is oblivious of unicode.
From: Ralf Mattes
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <pan.2007.04.10.15.18.46.912657@mh-freiburg.de>
On Tue, 10 Apr 2007 22:01:07 +0900, Takehiko Abe wrote:

>> > when the original file is:
>> > ----------
>> > prova
>> > secondo perché?
>> > ----------
>> 
>> Is it unicode ?
>> 
>> Afaik CMUCL (but check it, I haven't used CMUCL for years) does not support it.
> 
> If you use UTF-8, delete-if should work because the chars to remove
> are all ascii even if CMUCL is oblivious of unicode.


Yes, but how does:

 (nsubstitute-if #\Space
                   (complement #'alphanumericp) buffer)

Will CMUCL treat the (multibyte) accented character fragments as
alphanumeric?

Cheers, Ralf Mattes
 
From: hifi25nl
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <mhOSh.25575$tx6.9860@tornado.fastwebnet.it>
(nsubstitute-if #\Space
� � � � � � � � � �(complement #'alphanumericp) buffer)
will delete all accented letters, that is � � � � � 

I am trying to use some of the suggestions, but some of the functions are
not working in CMUCL (?), it's only a few days that i am using it..., but
also I would like to point out that I want a list of words like (city
motorbike suddenly) and not a list of strings like ("city"
motorbike" "suddenly")


Now I am trying this: 

(defun punctuationp (input)
   (not (member input '(#\a #\b #\c #\d #\e #\f #\g #\h #\i #\l #\m #\n #\o
#\p #\q #\r #\s #\t #\u #\v #\z #\� #\� #\� #\� #\� #\� #\space))))

(defun read-words (file)
  (with-open-file (input file :direction :input)
    (let ((buffer (make-sequence 'simple-vector (file-length input)
                                 :initial-element #\Space)))
      (read-sequence buffer input)  ; read the file into BUFFER 
      (read-from-string ; convert the buffer to a list
       (concatenate 'simple-string
             ;; replace punctuation with a space in place
             ;; i.e. do not create a new vector for BUFFER
         "(" (delete-if #'punctuationp (substitute #\space #\newline
buffer)) ")")))))

But the result is a list of words like this:

(PROVA GFEG SECONDO PERCH)
Note that I have lost the italian letter "�" from the word "perch�"!

Piero

  

Ralf Mattes wrote:

> On Tue, 10 Apr 2007 22:01:07 +0900, Takehiko Abe wrote:
> 
>>> > when the original file is:
>>> > ----------
>>> > prova
>>> > secondo perch�?
>>> > ----------
>>> 
>>> Is it unicode ?
>>> 
>>> Afaik CMUCL (but check it, I haven't used CMUCL for years) does not
>>> support it.
>> 
>> If you use UTF-8, delete-if should work because the chars to remove
>> are all ascii even if CMUCL is oblivious of unicode.
> 
> 
> Yes, but how does:
> 
>  (nsubstitute-if #\Space
>                    (complement #'alphanumericp) buffer)
> 
> Will CMUCL treat the (multibyte) accented character fragments as
> alphanumeric?
> 
> Cheers, Ralf Mattes
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evgd19$8n1$1@nemesis.news.tpi.pl>
What about data encoding ? Can you convert UTF-8 to ISO-8859-something ?
If not give all non ascii characters you want.

btw:

CL-USER> (defparameter *non-word-characters*
            #.(map 'vector #'char-code
                   (format nil ··@{~C~}\\\"()-`:;,.?!" #\Tab #\Space #\Newline #\Null)))

==> *NON-WORD-CHARACTERS*

CL-USER> (defun next-symbol-word (stream)
            (assert (equal (stream-element-type stream) '(unsigned-byte 8)))
            (let ((octet-bag
                   (load-time-value (make-array 100 :element-type '(unsigned-byte 8) :fill-pointer 0)))
                  (nw-chars *non-word-characters*))
              (setf (fill-pointer octet-bag) 0)
              (loop (handler-case (read-byte stream t)
                      (end-of-file () (return))
                      (:no-error (octet)
                        (if (find octet nw-chars)
                            (when (plusp (length octet-bag)) (return))
                            (vector-push octet octet-bag)))))
              (unless (zerop (length octet-bag))
                (intern (map-into (make-string (length octet-bag)) #'code-char octet-bag)))))

==> NEXT-SYMBOL-WORD

CL-USER> (with-open-file (in "/home/tichy/suck-next-word.test.txt" :element-type '(unsigned-byte 8) :direction    :input)
            (loop collect (let ((word (next-symbol-word in))) (unless word (loop-finish)) word)))

==> (|prova| |secondo| |perché|)


so you have list of symbols... as you wanted.
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evgdfu$892$1@atlantis.news.tpi.pl>
(defun next-symbol-word (stream)
   (assert (equal (stream-element-type stream) '(unsigned-byte 8)))
   (let ((char-bag
          (load-time-value (make-array 100 :element-type 'character :fill-pointer 0)))
         (nw-chars *non-word-characters*))
     (setf (fill-pointer char-bag) 0)
     (loop (handler-case (read-byte stream t)
             (end-of-file () (return))
             (:no-error (octet)
               (if (find octet nw-chars)
                   (when (plusp (length char-bag)) (return))
                 (vector-push (code-char octet) char-bag)))))
     (when (plusp (length char-bag)) (intern char-bag))))

just opimized next-symbol-word (no superflous string creation).


CL-USER> (with-open-file (in "/home/tichy/suck-next-word.test.txt" :element-type '(unsigned-byte 8) :direction    :input)
            (loop collect (let ((word (next-symbol-word in))) (unless word (loop-finish)) word)))

==> (|prova| |secondo| |perché|)
From: hifi25nl
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <PsPSh.25711$tx6.17197@tornado.fastwebnet.it>
This seems to work . Thank you. I will think about it

Piero
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evhehs$8j5$1@nemesis.news.tpi.pl>
hifi25nl wrote:

> This seems to work . Thank you. I will think about it
> 
> Piero

Nice.

I presume the data of yours lives in file encoded in UTF-8,
and is a very small subset of unicode, namely: ISO 8859-1.

If so I can help you more, I modified my code once more:

(defconstant +iso-8859-1-word-characters+ ; IMHO ;>
   #(#x41 #x42 #x43 #x44 #x45 #x46 #x47 #x48 #x49 #x4A #x4B
     #x4C #x4D #x4E #x4F #x50 #x51 #x52 #x53 #x54 #x55 #x56
     #x57 #x58 #x59 #x5A #x61 #x62 #x63 #x64 #x65 #x66 #x67
     #x68 #x69 #x6A #x6B #x6C #x6D #x6E #x6F #x70 #x71 #x72
     #x73 #x74 #x75 #x76 #x77 #x78 #x79 #x7A #xC0 #xC1 #xC2
     #xC3 #xC4 #xC5 #xC6 #xC7 #xC8 #xC9 #xCA #xCB #xCC #xCD
     #xCE #xCF #xD0 #xD1 #xD2 #xD3 #xD4 #xD5 #xD6 #xD9 #xDA
     #xDB #xDC #xDD #xDE #xDF #xE0 #xE1 #xE2 #xE3 #xE4 #xE5
     #xE6 #xE7 #xE8 #xE9 #xEA #xEB #xEC #xED #xEE #xEF #xF0
     #xF1 #xF2 #xF3 #xF4 #xF5 #xF6 #xF9 #xFA #xFB #xFC #xFD
     #xFE #xFF))

;; stream elements:
;; (small: ISO 8859-1) subset of unicode encoded in UTF-8.
;;
(defun next-symbol-word (stream)
   (assert (equal (stream-element-type stream) '(unsigned-byte 8)))
   (let ((char-bag
          (load-time-value (make-array 100 :element-type 'character :fill-pointer 0))))
     (setf (fill-pointer char-bag) 0)
     (loop (handler-case (read-byte stream t)
             (end-of-file () (return))
             (:no-error (octet)
               (unless (zerop (ldb (byte 1 7) octet))
                 (setf (ldb (byte 5 6) octet) (ldb (byte 5 0) octet)
                       (ldb (byte 6 0) octet) (ldb (byte 6 0) (read-byte stream))))
               (if (find octet +iso-8859-1-word-characters+)
                   (vector-push (code-char octet) char-bag)
                 (when (plusp (length char-bag)) (return))))))
     (when (plusp (length char-bag)) (intern char-bag))))

test file contains:
------------------
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
�����
------------------

CL-USER> (with-open-file (in "/home/tichy/suck-next-word.test.1.txt" :element-type '(unsigned-byte 8) :direction :input)
            (loop collect (let ((word (next-symbol-word in))) (unless word (loop-finish)) word)))

==> (����� ����� ����� ����� ����� ����� |�����| |�����| |�����| |�����| |�����| |�����|)

HTH, Szymon.
From: hifi25nl
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <AY4Th.28198$tx6.24464@tornado.fastwebnet.it>
OK, now I see exactly this output:

(|prova| |secondo| |perch�|)

and this is good!, but a...question.
How to get rid of vertical bars?. I would like this:

(prova secondo perch�)


Thank you





Szymon wrote:

> ;; stream elements:
> ;; (small: ISO 8859-1) subset of unicode encoded in UTF-8.
> ;;
> (defun next-symbol-word (stream)
> (assert (equal (stream-element-type stream) '(unsigned-byte 8)))
> (let ((char-bag
> (load-time-value (make-array 100 :element-type 'character :fill-pointer
> 0)))) (setf (fill-pointer char-bag) 0)
> (loop (handler-case (read-byte stream t)
> (end-of-file () (return))
> (:no-error (octet)
> (unless (zerop (ldb (byte 1 7) octet))
> (setf (ldb (byte 5 6) octet) (ldb (byte 5 0) octet)
> (ldb (byte 6 0) octet) (ldb (byte 6 0) (read-byte stream))))
> (if (find octet +iso-8859-1-word-characters+)
> (vector-push (code-char octet) char-bag)
> (when (plusp (length char-bag)) (return))))))
> (when (plusp (length char-bag)) (intern char-bag))))
From: Matthias Benkard
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <1176297500.097329.319190@o5g2000hsb.googlegroups.com>
Hi,

> How to get rid of vertical bars?

That's the wrong question.  The right question is: What do the
vertical bars mean?

Consider the following CLISP session:

[1]> 'hello
HELLO
[2]> 'HELLO
HELLO
[3]> 'HeLlO
HELLO
[4]> '\h\e\l\l\o
|hello|
[5]> '|hello|
|hello|
[6]> '|Hello|
|Hello|
[7]> '|HeLlO|
|HeLlO|
[8]> '|HELLO|
HELLO
[9]> 'hell\o
|HELLo|
[10]> 'this\ is\ a\ single\ symbol
|THIS IS A SINGLE SYMBOL|
[11]> '|THIS IS A SINGLE SYMBOL|
|THIS IS A SINGLE SYMBOL|

Does that help your understanding?

Bye-bye,
Matthias
From: Pascal Bourguignon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <87r6qrko7f.fsf@voyager.informatimago.com>
hifi25nl <········@fastmail.fm> writes:

> OK, now I see exactly this output:
>
> (|prova| |secondo| |perch�|)
>
> and this is good!, but a...question.
> How to get rid of vertical bars?. 

Beside Matthias's answer, why are you interning symbols for this kind
of data?  This is something that was done in the ancian times when
there was no strings in lisp.  Nowadays, words are better stored in
strings than in symbols.  (If you want to unify equal strings, you can
always put them in a hash table).

> I would like this:
>
> (prova secondo perch�)

Now, the question is what symbols you would want to get if the file contained:

"PROVA prova Prova proVA"


That is, do you want to get distinct symbols:

   (PROVA |prova| |Prova| |proVA|)

or all the same:

  (PROVA PROVA PROVA PROVA)

?


Finally the question is how you want to display these symbols.  You
can avoid the escapes (either \ or ||) by explicitely printing the
symbols with *PRINT-ESCAPE* set to NIL (or by using PRINC instead of
PRIN1 or PRINT):


[97]> (princ '(|Prova| prova |prova| pro\va))
Prints:  (Prova PROVA prova PROvA)
Returns: (|Prova| PROVA |prova| |PROvA|)

Note how PRINC returns its first argument, and this gets printed by
the REPL with PRINT.

-- 
__Pascal Bourguignon__
http://www.informatimago.com
http://pjb.ogamita.org
From: hifi25nl
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <3y8Th.28571$tx6.27049@tornado.fastwebnet.it>
The software I am working on operates on lists of italian words.
Since these are italian words, I am not interested in a distinction between
PROVA, |prova|, |Prova| and |proVA|. The code must treat all as the same
italian word PROVA.
This is an old software that is using lists as (NEL MEZZO DEL CAMMIN DI
NOSTRA VITA), ====> this is Dante, and manipulate them with the classical
lisp functions for lists. 
So my problem is not if this is the best way, but how to achieve it. 
For example in each word as VITA are stored some properties (genre female,
singular, etc..., using frames as suggested in the book on Lisp from
Winston, not the last edition...) and I don't want to do another time all
the work about it!
Is not a problem only of "printing" the symbols. The symbol must be PROVA
and nothing else...

Piero 

Pascal Bourguignon wrote:

> hifi25nl <········@fastmail.fm> writes:
> 
>> OK, now I see exactly this output:
>>
>> (|prova| |secondo| |perch�|)
>>
>> and this is good!, but a...question.
>> How to get rid of vertical bars?.
> 
> Beside Matthias's answer, why are you interning symbols for this kind
> of data?  This is something that was done in the ancian times when
> there was no strings in lisp.  Nowadays, words are better stored in
> strings than in symbols.  (If you want to unify equal strings, you can
> always put them in a hash table).
> 
>> I would like this:
>>
>> (prova secondo perch�)
> 
> Now, the question is what symbols you would want to get if the file
> contained:
> 
> "PROVA prova Prova proVA"
> 
> 
> That is, do you want to get distinct symbols:
> 
>    (PROVA |prova| |Prova| |proVA|)
> 
> or all the same:
> 
>   (PROVA PROVA PROVA PROVA)
> 
> ?
> 
> 
> Finally the question is how you want to display these symbols.  You
> can avoid the escapes (either \ or ||) by explicitely printing the
> symbols with *PRINT-ESCAPE* set to NIL (or by using PRINC instead of
> PRIN1 or PRINT):
> 
> 
> [97]> (princ '(|Prova| prova |prova| pro\va))
> Prints:  (Prova PROVA prova PROvA)
> Returns: (|Prova| PROVA |prova| |PROvA|)
> 
> Note how PRINC returns its first argument, and this gets printed by
> the REPL with PRINT.
> 
From: Pascal Bourguignon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <87bqhultgt.fsf@voyager.informatimago.com>
hifi25nl <········@fastmail.fm> writes:

> The software I am working on operates on lists of italian words.
> Since these are italian words, I am not interested in a distinction between
> PROVA, |prova|, |Prova| and |proVA|. The code must treat all as the same
> italian word PROVA.
> This is an old software that is using lists as (NEL MEZZO DEL CAMMIN DI
> NOSTRA VITA), ====> this is Dante, and manipulate them with the classical
> lisp functions for lists. 
> So my problem is not if this is the best way, but how to achieve it. 
> For example in each word as VITA are stored some properties (genre female,
> singular, etc..., using frames as suggested in the book on Lisp from
> Winston, not the last edition...) and I don't want to do another time all
> the work about it!
> Is not a problem only of "printing" the symbols. The symbol must be PROVA
> and nothing else...

Ok.  In this situation I'd just explicitely call:

    (intern (string-upcase char-bag))

instead of:

   (intern char-bag)

in NEXT-SYMBOL-WORD.



-- 
__Pascal Bourguignon__
http://www.informatimago.com
http://pjb.ogamita.org
From: Matthias Benkard
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <1176312780.764249.5030@l77g2000hsb.googlegroups.com>
Hi,

> Ok.  In this situation I'd just explicitely call:
>
>     (intern (string-upcase char-bag))

Yeah, except...  If CMUCL doesn't know that the characters it works
with are alphanumeric, it may not be able to STRING-UPCASE them
either.  For instance:

  ; Loading #P"/home/mulk/.cmucl-init.lisp".
  * (string-upcase "perch�")

  "PERCH�"

That's not quite right.  I guess if one really wanted to do this in
CMUCL, they would have to write their own version of STRING-UPCASE
that special-cased the accented letters that their language needed.

Bye-bye,
Matthias
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evja6t$3il$1@atlantis.news.tpi.pl>
Matthias Benkard wrote:
> Hi,

......

> 
>   "PERCH�"
> 
> That's not quite right.  I guess if one really wanted to do this in
> CMUCL, they would have to write their own version of STRING-UPCASE
> that special-cased the accented letters that their language needed.

Right. blah

Code, (not tested) but should work (it should return symbols
with upcased namestrings).


(defun custom-upcase (octet)
   (if (and (>= octet #.(char-code #\a)) (<= octet #.(char-code #\z)))
       (+ #.(char-code #\A) (- octet #.(char-code #\a)))
     (case octet
       (#xE0 #xC0)(#xE1 #xC1)(#xE2 #xC2)(#xE3 #xC3)(#xE4 #xC4)
       (#xE5 #xC5)(#xE6 #xC6)(#xE7 #xC7)(#xE8 #xC8)(#xE9 #xC9)
       (#xEA #xCA)(#xEB #xCB)(#xEC #xCC)(#xED #xCD)(#xEE #xCE)
       (#xEF #xCF)(#xF0 #xD0)(#xF1 #xD1)(#xF2 #xD2)(#xF3 #xD3)
       (#xF4 #xD4)(#xF5 #xD5)(#xF6 #xD6)(#xF9 #xD9)(#xFA #xDA)
       (#xFB #xDB)(#xFC #xDC)(#xFD #xDD)(#xFE #xDE)(otherwise octet))))

(defconstant +iso-8859-1-word-characters+ ; IMHO ;>
   #(#x41 #x42 #x43 #x44 #x45 #x46 #x47 #x48 #x49 #x4A #x4B
     #x4C #x4D #x4E #x4F #x50 #x51 #x52 #x53 #x54 #x55 #x56
     #x57 #x58 #x59 #x5A #x61 #x62 #x63 #x64 #x65 #x66 #x67
     #x68 #x69 #x6A #x6B #x6C #x6D #x6E #x6F #x70 #x71 #x72
     #x73 #x74 #x75 #x76 #x77 #x78 #x79 #x7A #xC0 #xC1 #xC2
     #xC3 #xC4 #xC5 #xC6 #xC7 #xC8 #xC9 #xCA #xCB #xCC #xCD
     #xCE #xCF #xD0 #xD1 #xD2 #xD3 #xD4 #xD5 #xD6 #xD9 #xDA
     #xDB #xDC #xDD #xDE #xDF #xE0 #xE1 #xE2 #xE3 #xE4 #xE5
     #xE6 #xE7 #xE8 #xE9 #xEA #xEB #xEC #xED #xEE #xEF #xF0
     #xF1 #xF2 #xF3 #xF4 #xF5 #xF6 #xF9 #xFA #xFB #xFC #xFD
     #xFE #xFF))

;; stream is (small (8859-1) subset of) unicode encoded in UTF-8.
;;
(defun next-symbol-word (stream)
   (assert (equal (stream-element-type stream) '(unsigned-byte 8)))
   (let ((char-bag
          (load-time-value (make-array 100 :element-type 'character :fill-pointer 0))))
     (setf (fill-pointer char-bag) 0)
     (loop (handler-case (read-byte stream t)
             (end-of-file () (return))
             (:no-error (octet)
               (unless (zerop (ldb (byte 1 7) octet))
                 (setf (ldb (byte 5 6) octet) (ldb (byte 5 0) octet)
                       (ldb (byte 6 0) octet) (ldb (byte 6 0) (read-byte stream))))
               (if (find octet +iso-8859-1-word-characters+)
                   (vector-push (code-char (custom-upcase octet)) char-bag)
                 (when (plusp (length char-bag)) (return))))))
     (when (plusp (length char-bag)) (intern char-bag))))
From: hifi25nl
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <M39Th.28647$tx6.17185@tornado.fastwebnet.it>
Great! Now I have 
CMUCL lisp ==> (PROVA GFEYG SECONDO |PERCHé|)
SBCL lisp ==> (PROVA GFEYG SECONDO PERCHé)
With SBCL lisp I have what I want, but PERCHé instead of PERCH� with the
right accented letter....
Is this a problem of setting SBCL lisp?

This  was the code..

(defparameter *non-word-characters*
#.(map 'vector #'char-code
(format nil ··@{~C~}\\\"()-`:;,.?!" #\Tab #\Space #\Newline #\Null)))

(defun next-symbol-word (stream)
(assert (equal (stream-element-type stream) '(unsigned-byte 8)))
(let ((char-bag
(load-time-value (make-array 100 :element-type 'character :fill-pointer 0)))
(nw-chars *non-word-characters*))
(setf (fill-pointer char-bag) 0)
(loop (handler-case (read-byte stream t)
(end-of-file () (return))
(:no-error (octet)
(if (find octet nw-chars)
(when (plusp (length char-bag)) (return))
(vector-push (code-char octet) char-bag)))))
(when (plusp (length char-bag)) (intern (string-upcase char-bag)))))

(with-open-file
(in "/home/olmeda/Downloads/temp/test" :element-type '(unsigned-byte
8) :direction :input) 
                (loop collect (let ((word (next-symbol-word in))) (unless
word (loop-finish)) word)))

Pascal Bourguignon wrote:

> hifi25nl <········@fastmail.fm> writes:
> 
>> The software I am working on operates on lists of italian words.
>> Since these are italian words, I am not interested in a distinction
>> between PROVA, |prova|, |Prova| and |proVA|. The code must treat all as
>> the same italian word PROVA.
>> This is an old software that is using lists as (NEL MEZZO DEL CAMMIN DI
>> NOSTRA VITA), ====> this is Dante, and manipulate them with the classical
>> lisp functions for lists.
>> So my problem is not if this is the best way, but how to achieve it.
>> For example in each word as VITA are stored some properties (genre
>> female, singular, etc..., using frames as suggested in the book on Lisp
>> from Winston, not the last edition...) and I don't want to do another
>> time all the work about it!
>> Is not a problem only of "printing" the symbols. The symbol must be PROVA
>> and nothing else...
> 
> Ok.  In this situation I'd just explicitely call:
> 
>     (intern (string-upcase char-bag))
> 
> instead of:
> 
>    (intern char-bag)
> 
> in NEXT-SYMBOL-WORD.
> 
> 
> 
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evjcpv$qhq$1@nemesis.news.tpi.pl>
hifi25nl wrote:

> Is this a problem of setting SBCL lisp?

No. My code is designed to work regardless of implementation:
it reads octets, not characters. But you just used the old version
without this:

(unless (zerop (ldb (byte 1 7) octet))
   (setf (ldb (byte 5 6) octet) (ldb (byte 5 0) octet)
         (ldb (byte 6 0) octet) (ldb (byte 6 0) (read-byte stream))))

If you plan to switch to SBCL you can use (much) simpler approach,
beacuse SBCL fully supports UTF-8, among others.
But yesterday you wanded CMUCL...

I just posted version with upcasing.

My advice: read about ISO 8859 and UTF-8.

http://en.wikipedia.org/wiki/ISO_8859

http://en.wikipedia.org/wiki/UTF-8
From: hifi25nl
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <YScTh.29138$tx6.2627@tornado.fastwebnet.it>
Thank you very much for your advice. I have not decided what lisp to use.
When I asked I was thinking to some trivial problem...but, at last, with
CMUCL was not so trivial it seems...
I need a lisp that is free, standard and can be used in Linux and under
Windows at the same time with a GUI that is NOT Emacs. Maybe all these
conditions cannot be fulfilled together!
My knowledge of Lisp (and character encoding) is not good enough, thank you
for pointing to some documentation.

The last code gives me this result in CMUCL (original file is encoded in
utf-8)

(PROVA SECONDO |PERCH�|)

Note the word PERCH� surrounded by vertical bars. The other words are not. I
would need all the words without vertical bars...

Piero

P.S.

I have tried SBCL with the same code and the result is perfect (for me)

(PROVA GFEYG SECONDO PERCH�)

I must point out that also in SBCL other code proposed in this thread gives
me results like this

(PROVA GFEYG SECONDO PERCHé)

So I find very interesting a code that can be applied to all available lisp
that "clean from punctuation"


Szymon wrote:

> hifi25nl wrote:
> 
>> Is this a problem of setting SBCL lisp?
> 
> No. My code is designed to work regardless of implementation:
> it reads octets, not characters. But you just used the old version
> without this:
> 
> (unless (zerop (ldb (byte 1 7) octet))
>    (setf (ldb (byte 5 6) octet) (ldb (byte 5 0) octet)
>          (ldb (byte 6 0) octet) (ldb (byte 6 0) (read-byte stream))))
> 
> If you plan to switch to SBCL you can use (much) simpler approach,
> beacuse SBCL fully supports UTF-8, among others.
> But yesterday you wanded CMUCL...
> 
> I just posted version with upcasing.
> 
> My advice: read about ISO 8859 and UTF-8.
> 
> http://en.wikipedia.org/wiki/ISO_8859
> 
> http://en.wikipedia.org/wiki/UTF-8
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evjqu8$gbo$1@atlantis.news.tpi.pl>
hifi25nl wrote:

 > [ .... ]

> I need a lisp that is free, standard and can be used in Linux and under
> Windows at the same time with a GUI that is NOT Emacs. Maybe all these
> conditions cannot be fulfilled together!

Looks like so.

_Maybe_ CLISP ?

Personally I'm unix-oriented (no MS Win after 3.1), and I like Emacs very much.

|PERCH�| and PERCH� are printed representations of the same symbol !
Vertical bars are not "real", see:

CL-USER> (length (symbol-name '|PERCH�|))

==> 6

6, NOT 8.

CL-USER> (length (symbol-name '||))

==> 0

> Note the word PERCH� surrounded by vertical bars. The other words are not. I
> would need all the words without vertical bars...

Educate yourself abut: lisp symbols, lisp reader, lisp printer.
There are many books about lisp, and CLTL2 and Hyperspec are available online.
Online introductory books:

http://www.cs.cmu.edu/~dst/LispBook/index.html
http://www.psg.com/~dlamkins/Site/sl.html
http://www.gigamonkeys.com/book/

but, first, RE-READ Pascal's post about *PRINT-ESCAPE*.

some more examples:

CL-USER> (eq '|PERCH�| 'PERCH�)

==> T

CMUCL escapes that word because it treats � not as regular character
(similar to whitespaces).

;; four identical F O O's:

CL-USER> (reduce (lambda (a b) (when (eq a b) a)) '(|F O O| f\ o\ o f| |o| |o) :initial-value (intern "F O O"))

==> |F O O|

CL-USER> (map nil #'print '(|F O O| f\ o\ o f| |o| |o))

|F O O|
|F O O|
|F O O|

code for SBCL, here:

;; version without byte-play.
;;
;; NOT for CMUCL (for SBCL).

(defun next-symbol-word (stream)
   (let ((char-bag (load-time-value
                    (make-array 100 :element-type 'character :fill-pointer 0))))
     (setf (fill-pointer char-bag) 0)
     (loop (handler-case (read-char stream)
             (end-of-file () (return))
             (:no-error (char)
               (if (find char
                         #.(format nil "abcdefghijklmnopqrstuvwxyz~
                                        ABCDEFGHIJKLMNOPQRSTUVWXYZ~
                                        ������������������������������~
                                        ������������������������������"))
                   (vector-push (char-upcase char) char-bag)
                 (when (plusp (length char-bag)) (return))))))
     (when (plusp (length char-bag))
       (intern char-bag))))

;; test:

CL-USER> (with-open-file
              (in "/home/tichy/suck-next-word.test.1.txt" :element-type 'character :direction :input)
            (loop collect (let ((word (next-symbol-word in))) (unless word (loop-finish)) word)))

==> (ABCDE ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ���ޟ)
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evl2rp$3hr$1@atlantis.news.tpi.pl>
Newsreader silently changed encoding to windows-1252 :(
So, I'm reposting

SBCL

(little changed) version:

(defun next-symbol-word (stream)
   (let ((char-bag (load-time-value
                    (make-array 100 :element-type 'character :fill-pointer 0))))
     (setf (fill-pointer char-bag) 0)
     (loop (handler-case (char-upcase (read-char stream))
             (end-of-file () (return))
             (:no-error (char)
               (if (find char "ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞ")
                   (vector-push char char-bag)
                 (when (plusp (length char-bag)) (return))))))
     (when (plusp (length char-bag))
       (intern char-bag))))
From: Matthias Benkard
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <1176378523.661524.117160@b75g2000hsg.googlegroups.com>
Hi,

> (little changed) version

This new version strikes me as a regression.  On the one hand, I don't
think it will work in any Unicode-unaware implementation (such as
CMUCL), as CHAR-UPCASE can't upcase characters it doesn't recognise as
alphanumeric.  On the other hand, using FIND as an alternative to
ALPHANUMERICP doesn't make much sense in any Unicode-aware
implementation (such as SBCL), as you can just use ALPHANUMERICP
instead.

Bye-bye,
Matthias
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evlfa6$rhj$1@atlantis.news.tpi.pl>
Matthias Benkard wrote:
> Hi,
> 
>> (little changed) version
> 
> This new version strikes me as a regression.  On the one hand, I don't
> think it will work in any Unicode-unaware

OP is (probably) going to move to SBCL, so I posted
SBCL-only version. Versatile version is here:
http://groups.google.pl/group/comp.lang.lisp/msg/78e09e4c4bd49c80?dmode=source

 > [ ... ]

> implementation (such as SBCL), as you can just use ALPHANUMERICP
> instead.

Ok, FIND was not good choice. Btw, ALPHANUMERICP "accepts"
numbers. ALPHA-CHAR-P does the job.

I hope this is the last one ;>

(defun next-symbol-word (stream)
   (load-time-value (assert (string= (lisp-implementation-type) :SBCL)))
   (let ((char-bag (load-time-value
                    (make-array 100 :element-type 'character :fill-pointer 0))))
     (setf (fill-pointer char-bag) 0)
     (loop (handler-case (read-char stream)
             (end-of-file () (return))
             (:no-error (char)
               (cond ((alpha-char-p char)
                      (vector-push (char-upcase char) char-bag))
                     ((string/= char-bag "")
                      (return (intern char-bag)))))))))
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evl3dq$pb2$1@nemesis.news.tpi.pl>
Newsreader silently changed encoding to windows-1252 :(
So, I'm reposting:

(defun next-symbol-word (stream)
   (let ((char-bag (load-time-value
                    (make-array 100 :element-type 'character :fill-pointer 0))))
     (setf (fill-pointer char-bag) 0)
     (loop (handler-case (char-upcase (read-char stream))
             (end-of-file () (return))
             (:no-error (char)
               (if (find char "ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞ")
                   (vector-push char char-bag)
                 (when (plusp (length char-bag))
		  (return (intern char-bag)))))))))
From: Takehiko Abe
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <keke-A0724B.14595411042007@nnrp.gol.com>
> > If you use UTF-8, delete-if should work because the chars to remove
> > are all ascii even if CMUCL is oblivious of unicode.
> 
> 
> Yes, but how does:
> 
>  (nsubstitute-if #\Space
>                    (complement #'alphanumericp) buffer)
> 
> Will CMUCL treat the (multibyte) accented character fragments as
> alphanumeric?

No, it won't. Anyways, CMUCL's alphanumericp tests only
in ascii range (at least the version I have --19d).
From: Pascal Bourguignon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <87mz1gmb1i.fsf@voyager.informatimago.com>
Takehiko Abe <····@gol.com> writes:

>> > when the original file is:
>> > ----------
>> > prova
>> > secondo perch�?
>> > ----------
>> 
>> Is it unicode ?
>> 
>> Afaik CMUCL (but check it, I haven't used CMUCL for years) does not support it.
>
> If you use UTF-8, delete-if should work because the chars to remove
> are all ascii even if CMUCL is oblivious of unicode.

Obviously, not.  UTF-8 is a 8-bit encoding, which uses the 8th bit,
and ASCII is only a 7-bit encoding.

What you can try is to use an encoding that maps 256 different
characters to the 256 octets.  With that kind of 1-1 octet encoding,
you can indeed try to process any other encoding.  But why convert
octets to strange meaningless intermediary characters when you can
just open a file with :element-type '(unsigned-byte 8)?

http://www.cliki.net/CloserLookAtCharacters

-- 
__Pascal Bourguignon__
http://www.informatimago.com
http://pjb.ogamita.org
From: Takehiko Abe
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <keke-868000.02012811042007@nnrp.gol.com>
> > If you use UTF-8, delete-if should work because the chars to remove
> > are all ascii even if CMUCL is oblivious of unicode.
> 
> Obviously, not. 

Your confidence is amazing.
From: Pascal Bourguignon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <874pnnmaow.fsf@voyager.informatimago.com>
Takehiko Abe <····@gol.com> writes:

>> > If you use UTF-8, delete-if should work because the chars to remove
>> > are all ascii even if CMUCL is oblivious of unicode.
>> 
>> Obviously, not. 
>
> Your confidence is amazing.

It's a question of definition.  In my book, ASCII is defined as a
character set of 128 so-called "characters" (actually, there's 95
characters and 33 control codes), and a mapping of these 128
"characters" to the subset of integers between 0 and 127 inclusive.

That's 2^7 different codes, or 7 bits, not 8 bits.


Unfortunately, there are a lot of lousy people out there who call
ASCII anything where you can store an ASCII code, like for example an
octet. (What if my bytes are 32-bit?)  I don't know what they do with
the values that don't encode any ASCII character or control code...
(I don't think they get the excuse of the parity bit, because I don't
believe most of them even know what it is.)



Now, the question is what does this implementation do when you specify
for external-format the ASCII code, and it reads some bytes with the
8th bit set as it is to expect in a UTF-8 file that doesn't contain
only ASCII characters.

Most implementations I've used signal an error on reading such a byte.



If your ASCII is not a 7-bit encoding, (that is, it it is not ASCII,
but something else, so why do you call it "ascii"?), then indeed you
could try to read a UTF-8 BYTE stream and decode it into some
characters, as long as all the 256 bytes are mapped to character.
Some implementation have a 1-1 external-format or an extended
ISO-8859-1 encoding to include the 256 codes, and that can be used.
But this is not ASCII.


-- 
__Pascal Bourguignon__
http://www.informatimago.com
http://pjb.ogamita.org
From: John Thingstad
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <op.tqmc5yulpqzri1@pandora.upc.no>
On Wed, 11 Apr 2007 13:07:11 +0200, Pascal Bourguignon  
<···@informatimago.com> wrote:

>
> If your ASCII is not a 7-bit encoding, (that is, it it is not ASCII,
> but something else, so why do you call it "ascii"?), then indeed you
> could try to read a UTF-8 BYTE stream and decode it into some
> characters, as long as all the 256 bytes are mapped to character.
> Some implementation have a 1-1 external-format or an extended
> ISO-8859-1 encoding to include the 256 codes, and that can be used.
> But this is not ASCII.
>
>

Sort of. ISO-8859-1 (iso-latin) is ASCII + 128 extended codes for
other european languages. More to the point UTF-8 isn't neccesairly
1 octet in length. THIS is only true for a valid ASCII code.
(it can be upto 3 octets)
You seem to be mixing up code and encoding.
Unicode is a character code table. UTF8 is a encoding as is UTF16 and UTF32

never seen BYTE mean anything but 8 bits.
WORD on the other hand...
-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evii4o$rep$1@nemesis.news.tpi.pl>
John Thingstad wrote:

> never seen BYTE mean anything but 8 bits.

hm, there are/was computers with variable BYTE length.
There are/was computers with 4, 6, 7, 9, ... (?) bit bytes.
and so on.
From: Espen Vestre
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <m1mz1fm75h.fsf@vestre.net>
Szymon <···········@o2.pl> writes:

> hm, there are/was computers with variable BYTE length.
> There are/was computers with 4, 6, 7, 9, ... (?) bit bytes.
> and so on.

You forgot 5 bits, which was/is used in TELEX communication. And TOPS-10
used five-bit bytes in its file name character set!
-- 
  (espen)
From: Rob Warnock
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <dp6dnRflv9_fBoDbnZ2dnUVZ_jqdnZ2d@speakeasy.net>
Espen Vestre  <·····@vestre.net> wrote:
+---------------
| Szymon <···········@o2.pl> writes:
| > hm, there are/was computers with variable BYTE length.
| > There are/was computers with 4, 6, 7, 9, ... (?) bit bytes.
| > and so on.
| 
| You forgot 5 bits, which was/is used in TELEX communication.
| And TOPS-10 used five-bit bytes in its file name character set!
+---------------

Actually, TOPS-10 used *six*-bit bytes for filenames [uppercase
alpha, digits, and some punct, with 6 chars for name + 3 for type]
and system call names. SIXBIT used exactly the same characters as
ASCII codes #\Space through #\_ (underscore), but 32 less. That is,
0 was a SIXBIT #\Space, and 63 (#o77) was #\_. The following has
a table:

    http://nemesis.lonestar.org/reference/telecom/codes/sixbit.html

All filenames, that is, except on DECtape, where there was a
compressed encoding called RADIX50. The "50" in RADIX50 is actually
octal, #o50 or 40 decimal, so "RADIX50" is actually a radix-40 format,
and supports *only* uppercase alpha, digits, #\., #\$, #\%, and space
[meaning NUL]. The table here:

    http://nemesis.lonestar.org/reference/telecom/codes/radix50.html

has the correct character codes, but doesn't tell you how it
was packed into words. The RADIX50 code used on the PDP-11
packed three RADIX50 characters into 16 bits [I think]:
(+ (* 1600 c1) (* 40 c2) c3), where "c1" is the leftmost of
the three. So "A2" would be encoded as 17720 (#x4538). On the
PDP-10, you could pack six RADIX50 characters into 32 bits,
leaving 4 bits for some flags they just *had* to squeeze
into the PDP-10 DECtape directories for some arcane reason
that escapes me at the moment...  ;-}

ISTR that RADIX50 was also used inside the PDP-10 assembler,
MACRO-10, and in the linker's symbol table format, again because
you could pack six chars & those extra four flag bits into a
36-bit word.



-Rob

p.s. For those not familiar, the DEC PDP-10 hardware provided
byte access (including auto-incrementing) for all sizes of
bytes from 1 through 36 bits. The only restriction was that
bytes had to fit within the 36-bit PDP-10 words. When an even
number couldn't so fit, the remaining bits were "wasted". E.g.,
the most common format for English plaintext was 7-bit ASCII,
stored five bytes per 36-bit word, with the least significant
bit of each word unused [except... well, that's another story].

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Espen Vestre
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <m1irc2m8wb.fsf@vestre.net>
····@rpw3.org (Rob Warnock) writes:

> Actually, TOPS-10 used *six*-bit bytes for filenames [uppercase

Hmm. Of course. Thank you for correcting me!
-- 
  (espen)
From: Edi Weitz
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <ubqhvjcf5.fsf@agharta.de>
On Wed, 11 Apr 2007 14:33:24 +0200, "John Thingstad" <··············@chello.no> wrote:

> never seen BYTE mean anything but 8 bits.

  http://www.lispworks.com/documentation/HyperSpec/Body/f_by_by.htm

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")
From: Rob Warnock
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <dp6dnRblv9_3AYDbnZ2dnUVZ_jqdnZ2d@speakeasy.net>
Edi Weitz  <········@agharta.de> wrote:
+---------------
| "John Thingstad" <··············@chello.no> wrote:
| > never seen BYTE mean anything but 8 bits.
| 
|   http://www.lispworks.com/documentation/HyperSpec/Body/f_by_by.htm
+---------------

And as it says at the bottom of these:

    http://www.lispworks.com/documentation/HyperSpec/Body/f_ldb.htm#ldb
    http://www.lispworks.com/documentation/HyperSpec/Body/f_dpb.htm#dpb

    Historically, the name ``ldb'' comes from a DEC PDP-10 assembly
    language instruction meaning ``load byte.''

    Historically, the name ``dpb'' comes from a DEC PDP-10 assembly
    language instruction meaning ``deposit byte.''

The PDP-10 also provided ILDB and IDPB (Increment byte pointer then LBD,
and Increment byte pointer then DPB).


-Rob

p.s. I frequently make mistakes with CL's BYTE function, since
the order of "size" & "position" in a "byte-spec" is *backwards*
from what the PDP-10 hardware used. (*sigh*)

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
From: Takehiko Abe
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <keke-7CB5E5.21184511042007@nnrp.gol.com>
> >> > If you use UTF-8, delete-if should work because the chars to remove
> >> > are all ascii even if CMUCL is oblivious of unicode.
> >> 
> >> Obviously, not. 
> >
> > Your confidence is amazing.
> 
> It's a question of definition.

Obviously, not.
From: Szymon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <evgh1o$l3m$1@nemesis.news.tpi.pl>
Takehiko Abe wrote:
>>> when the original file is:
>>> ----------
>>> prova
>>> secondo perch�?
>>> ----------
>> Is it unicode ?
>>
>> Afaik CMUCL (but check it, I haven't used CMUCL for years) does not support it.
> 
> If you use UTF-8, delete-if should work because the chars to remove
> are all ascii even if CMUCL is oblivious of unicode.

Page linked by Pascal contains important information:

"CMUCL apparently supports utf-8 I/O through its simple-streams implementation,
as well as iso-8859-1; its regular file streams support only iso-8859-1."

.
From: Takehiko Abe
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <keke-AD6D96.14244111042007@nnrp.gol.com>
> > If you use UTF-8, delete-if should work because the chars to remove
> > are all ascii even if CMUCL is oblivious of unicode.
> 
> Page linked by Pascal contains important information:
> 
> "CMUCL apparently supports utf-8 I/O through its simple-streams implementation,
> as well as iso-8859-1; its regular file streams support only iso-8859-1."
> 

I haven't read Pascal Bourguignon's nonsense to the end.
But sorry, that information is utterly irrelevant. delete-if does
not involve i/o nor stream.

And the best way to use utf-8 in CMUCL is to do nothing special.
If all you need is latin-1 characters, then there's
no point in using utf-8 in the first place.
From: Pascal Bourguignon
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <873b37ma6f.fsf@voyager.informatimago.com>
Takehiko Abe <····@gol.com> writes:

>> > If you use UTF-8, delete-if should work because the chars to remove
>> > are all ascii even if CMUCL is oblivious of unicode.
>> 
>> Page linked by Pascal contains important information:
>> 
>> "CMUCL apparently supports utf-8 I/O through its simple-streams implementation,
>> as well as iso-8859-1; its regular file streams support only iso-8859-1."
>> 
>
> I haven't read Pascal Bourguignon's nonsense to the end.
> But sorry, that information is utterly irrelevant. delete-if does
> not involve i/o nor stream.
>
> And the best way to use utf-8 in CMUCL is to do nothing special.
> If all you need is latin-1 characters, then there's
> no point in using utf-8 in the first place.

It's not what I'd call the "best way".  It's a pis aller.  In some
very special cases, you can do some meaningfull processing of the
UTF-8 byte stream this way, but since you're processing bytes, why
don't you just open the file as :element-type '(unsigned-byte 8)?

It's play child, delete works as well on bytes than on characters.

So with a utf-8 file containing:

------------(/tmp/myfile.utf-8)-------------------
:::::::::::::::::::::::::::::::::::::::
:: C'est mieux avec des octets! ���� ::
:::::::::::::::::::::::::::::::::::::::
--------------------------------------------------


(defun ascii-bytes (string)
  (map 'vector 
    (lambda (ch)
      (+ 32 (or (position ch " !\"#$%&'()*+,-./0123456789:;<=>·@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~")
                (error "You string contains non ASCII character."))))
    string))


(with-open-file (bytes "/tmp/myfile.utf-8" :element-type '(unsigned-byte 8))
   (let ((buffer (make-array 4096 :element-type  '(unsigned-byte 8) :initial-element 0 :fill-pointer t)))
     (setf (fill-pointer buffer) (read-sequence buffer bytes))
     (print buffer)
     (print (delete-if (lambda (byte) (position byte #.(ascii-bytes "!',.:;?"))) ; you don't need to type bytes!
                       buffer))
     (values)))

prints:

#(58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 10 58 58 32 67
  39 101 115 116 32 109 105 101 117 120 32 97 118 101 99 32 100 101 115 32 111 99 116 101 116 115 33 32 195 169 195 173 195 179 195 186
  32 58 58 10 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 10) 
#(10 32 67 101 115 116 32 109 105 101 117 120 32 97 118 101 99 32 100 101 115 32 111 99 116 101 116 115 32 195 169 195 173 195 179 195
  186 32 10 10) 
                                                                
                                                                
                                                                
                                                                
                                                                
                                                                
                                                                
                                                                


-- 
__Pascal Bourguignon__
http://www.informatimago.com
http://pjb.ogamita.org
From: Raffael Cavallaro
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <2007041110054816807-raffaelcavallaro@pasdespamsilvousplaitmaccom>
On 2007-04-11 07:18:16 -0400, Pascal Bourguignon <···@informatimago.com> said:

> It's not what I'd call the "best way".  It's a pis aller.

pis aller = "last resort"
From: John Thingstad
Subject: Re: Clean punctuation problem
Date: 
Message-ID: <op.tqlaynvapqzri1@pandora.upc.no>
On Tue, 10 Apr 2007 15:01:07 +0200, Takehiko Abe <····@gol.com> wrote:

>> > when the original file is:
>> > ----------
>> > prova
>> > secondo perch�?
>> > ----------
>>
>> Is it unicode ?
>>
>> Afaik CMUCL (but check it, I haven't used CMUCL for years) does not  
>> support it.
>
> If you use UTF-8, delete-if should work because the chars to remove
> are all ascii even if CMUCL is oblivious of unicode.

You could look at Edi Weits lib flexi-streams
http://weitz.de/flexi-streams/

Something like

(octets-to-string seq :external-format :utf8)

is what I use.
The advantage of this function is that it can convert between a pletora of
formats UTF-8, UTF-16, UTF-32, ISO-8859-x, KOI8R, ASCII
simularly there is a function string-to-octets (a octet is a array of 8  
bit elements.)

Here is a example of some code I wrote to read UTF16 strings and convert  
them to
iso-latin-1 (ISO-8859-1) It uses CFFI and FLEXI-STREAMS.
(If you have Hunchentoot you should already have this library)

(defun utf16-string-to-byte-array (wstr length)
   (let* ((size (* 2 length))
          (seq (make-array size :element-type '(unsigned-byte 8))))
     (dotimes (i size)
       (setf (aref seq i) (mem-aref wstr ':uchar i)))
     seq))

(defconstant +double-zero+
   (make-array 2 :element-type '(unsigned-byte 8) :initial-element 0))

(defun utf16-string-to-latin1 (wstr length)
   (let* ((seq (utf16-string-to-byte-array wstr length))
          (size (1+ (search +double-zero+ seq))))
     (octets-to-string seq :external-format :utf16 :end size)))

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
From: Szymon
Subject: Re: Clean punctuation problem, diffrent approach
Date: 
Message-ID: <evgbkc$44j$1@nemesis.news.tpi.pl>
;; example 1

CL-USER> (defparameter *octet-stream* (open "/home/tichy/suck-next-word.test.txt" :element-type '(unsigned-byte 8) :direction :input))

==> *OCTET-STREAM*

CL-USER> (suck-next-word *octet-stream*)

==> "prova"

CL-USER> (suck-next-word *octet-stream*)

==> "secondo"

CL-USER> (suck-next-word *octet-stream*)

==> "perché"

CL-USER> (suck-next-word *octet-stream*)

==> NIL

CL-USER> (close *octet-stream*)

==> T


;; example 2

CL-USER> (with-open-file (in "/home/tichy/suck-next-word.test.txt"
                           :element-type '(unsigned-byte 8)
                           :direction    :input)
            (loop collect (let ((word (suck-next-word in))) (unless word (loop-finish)) word)))

==> ("prova" "secondo" "perché")

;; code, not well tested. so be careful.

(defparameter *non-word-characters*
   #.(map 'vector #'char-code
          (format nil ··@{~C~}\\\"()-`:;,.?!" #\Tab #\Space #\Newline #\Null)))

(defun suck-next-word (stream)
   (assert (equal (stream-element-type stream) '(unsigned-byte 8)))
   (let ((octet-bag
          (load-time-value (make-array 100 :element-type '(unsigned-byte 8) :fill-pointer 0)))
         (nw-chars *non-word-characters*))
     (setf (fill-pointer octet-bag) 0)
     (loop (handler-case (read-byte stream t)
             (end-of-file () (return))
             (:no-error (octet)
               (if (find octet nw-chars)
                   (unless (zerop (fill-pointer octet-bag)) (return))
                 (vector-push octet octet-bag)))))
     (unless (zerop (length octet-bag))
       (map-into (make-string (length octet-bag)) #'code-char octet-bag))))

HTH, Szymon.