From Word to comma-delimited files

From: An[z]elmus
Subject: From Word to comma-delimited files
Date: Thu, 08 Jan 2009 13:04:15 +0000
Message-ID: <njtbm4tshm55p2si2s0qqmo5v64856e0n9@4ax.com>

I have no longer practiced with LISP for a year, but now I think I
need it and it seems that I have forgotten even the few things I
learned.
I want to bild comma-delimited files starting from a bunch of Word's
documents, then I want to import all data in a MySQL database.
The documents contains the inventory off all the books of a library.
They are thousands of books. I don't earn money for this job, 
After saving the file(s) in text format, each entry appears separated
from the other by a blank line and is organized in the following way,
line by line:

1) Classification (always present)
2) Author (some times missing)
3) Title (always present)
4) Subtitle or Translation (some times missing)
5) Pubblication data (always present)

As I said, I want to tranfer every thing in a CSV file with the 5
fields aforementioned. The data for columns 2 and 4 are sometimes
missing and I have to insert an empty field.
Here is a sample of the text files that need to be treated:

------------------------------------------------------
400 BUSON c 2
Buson - Sono futatsu no tabi
[Buson - I due viaggi]
TOKYO: Asahi Shimbunsha, 2001 � p. 183

400 BUSON c 3
Buson
TOKYO: Nihon Keizai Shimbunsha

400 BUSON c 4
Yosa Buson � Kakemeguru omoi
Yosa Buson � On the wings of art
SHIGA: Miho Museum, 2008 � p. 397

400 CHIKUDEN a 1
SASAKI, Kozo
Chikuden � Toyo bijutsu sensho
[Chikuden � Collana d�Arte Estremo-orientale]
TOKYO: Sansaisha, 1970, 1977 � tav. 38 + p. 74
------------------------------------------------------

As you can see, there probably are "regularities" that may be captured
with regular espressions. For example: line 2 (Author wich is
sometimes missing), always start with the second name all in capital
letters while the third field (Title, always present) usually has only
the first letter capitalized. So my idea  is to deal first with line 2
inserting a blank field where appropriate. After I have done that, it
would remain only to deal with one field out of 5 possibly missing.

Re: From Word to comma-delimited files An[z]elmus
Re: From Word to comma-delimited files Kaz Kylheku
- Re: From Word to comma-delimited files An[z]elmus
  - Re: From Word to comma-delimited files Kaz Kylheku
    - Re: From Word to comma-delimited files Jeff Schwab
      - Re: From Word to comma-delimited files Kaz Kylheku
        [OT] Re: From Word to comma-delimited files Jeff Schwab
    - Re: From Word to comma-delimited files An[z]elmus
Re: From Word to comma-delimited files An[z]elmus

From: An[z]elmus
Subject: Re: From Word to comma-delimited files
Date: Thu, 08 Jan 2009 13:09:55 +0000
Message-ID: <uiubm49l5g4omtsaqk877i22f4j9a1e3r9@4ax.com>

On Thu, 08 Jan 2009 14:04:15 +0100, "An[z]elmus"
<·······@somewhere.org> wrote:
>I want to bild comma-delimited files starting from a bunch of Word's
>documents, then I want to import all data in a MySQL database.
>The documents contains the inventory off all the books of a library.
>They are thousands of books

The entire catalog in MS Word may be downloaded from:
http://csaeo.altervista.org/biblio.htm

From: Kaz Kylheku
Subject: Re: From Word to comma-delimited files
Date: Thu, 08 Jan 2009 20:08:17 +0000
Message-ID: <20090114012421.618@gmail.com>

On 2009-01-08, An[z]elmus <·······@somewhere.org> wrote:
> As you can see, there probably are "regularities" that may be captured
> with regular espressions. For example: line 2 (Author wich is
> sometimes missing), always start with the second name all in capital

``Always'' is a strong word to use on data that was typed by librarians into MS
Word.

This conversion job calls for some measure manual gruntwork, with just
the right amount of automation of the low-level tasks.

What I would do is write a text-to-text filtering program that normalizes a
single catalog record. Or at least tries to.

Then from my text editor, I would interactively pipe successive records
through the filter, verifying that the results make sense.

I would expect that there will likely be additional hand-tweaking necessary to
improve the fidelity of some of the data. There may be spelling errors, etc.
Those changes will be harder to make once the data is in an unwieldy SQL
database.

In Vim you can easily filter a paragraph of text through an external command
using !]command<Enter>.   Add another ] to skip over the filtered paragraph.
These commands can be stored in a buffer, say buffer z, and invoked using @z.
then repeated using @@.

You could munge through thousands of records fairly quickly.

From: An[z]elmus
Subject: Re: From Word to comma-delimited files
Date: Thu, 08 Jan 2009 21:10:01 +0000
Message-ID: <34pcm4lebl6orfconotvvjlfvdjq36use9@4ax.com>

On Thu, 8 Jan 2009 20:08:17 +0000 (UTC), Kaz Kylheku
<········@gmail.com> wrote:
>``Always'' is a strong word to use on data that was typed by librarians into MS
>Word.
>This conversion job calls for some measure manual gruntwork, with just
>the right amount of automation of the low-level tasks.
>
Yes, data are not uniformly and coherently typed. Some manual control
has to be done before and after the automated adjustements. For
example after having "normalized" the record I open the CSV file using
a special free editor (CSVed) that allow me to further split yhe
columns so that from the starting 5 fields I end up whith 11. 
 
>What I would do is write a text-to-text filtering program that normalizes a
>single catalog record. Or at least tries to.
>
I don't know exactly what you mean, but I can't go thoroughly through
each record, I have not enough energy, and I prefer to alter the
informations the less is possible. I can just skim the data
superficialy before and, with a little more attention, later. If too
many records are wrong I'll try again making some change. What I am
tryng to do is infact a kind of normalization of the records in 5 base
filelds. 

>Then from my text editor, I would interactively pipe successive records
>through the filter, verifying that the results make sense.
>
>I would expect that there will likely be additional hand-tweaking necessary to
>improve the fidelity of some of the data. There may be spelling errors, etc.
>Those changes will be harder to make once the data is in an unwieldy SQL
>database.
>
On the contrary I thought that some tweaking coud be done more easily
column by column when data have already been tranfered in the
database. For example some punctuation marks would be no longer
necessary and I thought they could be easily removed.

>In Vim you can easily filter a paragraph of text through an external command
>using !]command<Enter>.   Add another ] to skip over the filtered paragraph.
>These commands can be stored in a buffer, say buffer z, and invoked using @z.
>then repeated using @@.
>
I know VIM very superficially and i am afraid it woul be necessary a
little demonstration.

From: Kaz Kylheku
Subject: Re: From Word to comma-delimited files
Date: Thu, 08 Jan 2009 22:03:58 +0000
Message-ID: <20090114032411.229@gmail.com>

On 2009-01-08, An[z]elmus <·······@somewhere.org> wrote:
> On Thu, 8 Jan 2009 20:08:17 +0000 (UTC), Kaz Kylheku
><········@gmail.com> wrote:
>>``Always'' is a strong word to use on data that was typed by librarians into MS
>>Word.
>>This conversion job calls for some measure manual gruntwork, with just
>>the right amount of automation of the low-level tasks.
>>
> Yes, data are not uniformly and coherently typed. Some manual control
> has to be done before and after the automated adjustements. For
> example after having "normalized" the record I open the CSV file using
> a special free editor (CSVed) that allow me to further split yhe
> columns so that from the starting 5 fields I end up whith 11. 
>  
>>What I would do is write a text-to-text filtering program that normalizes a
>>single catalog record. Or at least tries to.
>>
> I don't know exactly what you mean, but I can't go thoroughly through
> each record, I have not enough energy, and I prefer to alter the
> informations the less is possible.

But if you leave it to the machine, the algorithm might make mistakes.  A title
may end up interpreted as an author name, and a caption as a title, etc.

>>In Vim you can easily filter a paragraph of text through an external command
>>using !]command<Enter>.   Add another ] to skip over the filtered paragraph.
>>These commands can be stored in a buffer, say buffer z, and invoked using @z.
>>then repeated using @@.
>>
> I know VIM very superficially and i am afraid it woul be necessary a
> little demonstration. 

If you have a paragraph like:

asdf asdf asdf adf
adsf asdf as adsf
asdfasdf

Put the cursor on the empty line above the paragraph. You can move to the
empty line at just after the paragraph using ] and back using [. These motions
can be combined with the ! command to put the text through a pipe instead.
Example: pipe paragraph through cut program to extract character columns 1
through 8: !]cut -c 1-8<Enter>. This leaves the cursor back at the top of the
chopped paragraph.  You want to be at the bottom, so you can just repeat the
same thing for the next paragraph and so on. That requires one more ] . You can
record all this to a named buffer (like the buffer called z) like this: 
qz!]cut -c 1-8<Enter>]q.  The qz command starts recording under z, and q stops
recording. Now you can execute the buffer using @z. The last execution can be
repeated with @@, which can be typed fast with keyboard auto-repeat.

From: Jeff Schwab
Subject: Re: From Word to comma-delimited files
Date: Thu, 08 Jan 2009 22:24:10 +0000
Message-ID: <N4CdneiWocmX4PvUnZ2dnUVZ_t7inZ2d@giganews.com>

Kaz Kylheku wrote:
> On 2009-01-08, An[z]elmus <·······@somewhere.org> wrote:
>> On Thu, 8 Jan 2009 20:08:17 +0000 (UTC), Kaz Kylheku
>> <········@gmail.com> wrote:

>>> In Vim you can easily filter a paragraph of text through an external command
>>> using !]command<Enter>.   Add another ] to skip over the filtered paragraph.
>>> These commands can be stored in a buffer, say buffer z, and invoked using @z.
>>> then repeated using @@.
>>>
>> I know VIM very superficially and i am afraid it woul be necessary a
>> little demonstration. 
> 
> If you have a paragraph like:
> 
> asdf asdf asdf adf
> adsf asdf as adsf
> asdfasdf
> 
> Put the cursor on the empty line above the paragraph. You can move to the
> empty line at just after the paragraph using ] and back using [. These motions
> can be combined with the ! command to put the text through a pipe instead.
> Example: pipe paragraph through cut program to extract character columns 1
> through 8: !]cut -c 1-8<Enter>. This leaves the cursor back at the top of the
> chopped paragraph.  You want to be at the bottom, so you can just repeat the
> same thing for the next paragraph and so on. That requires one more ] . You can
> record all this to a named buffer (like the buffer called z) like this: 
> qz!]cut -c 1-8<Enter>]q.  The qz command starts recording under z, and q stops
> recording. Now you can execute the buffer using @z. The last execution can be
> repeated with @@, which can be typed fast with keyboard auto-repeat.

:%s/buffer/register/g
:%s/]/}/g
:%s/\[/\{/
:%s/]/}/

From: Kaz Kylheku
Subject: Re: From Word to comma-delimited files
Date: Thu, 08 Jan 2009 22:47:35 +0000
Message-ID: <20090114041756.451@gmail.com>

On 2009-01-08, Jeff Schwab <····@schwabcenter.com> wrote:
>:%s/]/}/g

Oh yes, I now see that indeed I am hitting the Shift key at the time the
paragraph motion is happening. Sorry!

From: Jeff Schwab
Subject: [OT] Re: From Word to comma-delimited files
Date: Thu, 08 Jan 2009 22:50:16 +0000
Message-ID: <fKidnet3obO1HvvUnZ2dnUVZ_tLinZ2d@giganews.com>

Kaz Kylheku wrote:
> On 2009-01-08, Jeff Schwab <····@schwabcenter.com> wrote:
>> :%s/]/}/g
> 
> Oh yes, I now see that indeed I am hitting the Shift key at the time the
> paragraph motion is happening. Sorry!

I just started teaching my wife to use Vim, and make the same kind of 
mistakes.  It's amazing how much I don't know about what my fingers are 
doing while I work. :)

From: An[z]elmus
Subject: Re: From Word to comma-delimited files
Date: Thu, 08 Jan 2009 22:38:53 +0000
Message-ID: <0dvcm457vjd0iugdr26pq7f8f7pr7ie2nn@4ax.com>

On Thu, 8 Jan 2009 22:03:58 +0000 (UTC), Kaz Kylheku
<········@gmail.com> wrote:

>> I don't know exactly what you mean, but I can't go thoroughly through
>> each record, I have not enough energy, and I prefer to alter the
>> informations the less is possible.
>
>But if you leave it to the machine, the algorithm might make mistakes.  A title
>may end up interpreted as an author name, and a caption as a title, etc.

Yes this happens all the time. I don't expect every thing will be
perfect. But if the misinterpretations, mostly bacause the underlying
data were not consistently typed, are a few out of say one hundred
records, it is not too bad. If there are many the algorithm is wrong
or even worse there is no satisfying algorithm possible for the task.
I'll try eventually VIM.

From: An[z]elmus
Subject: Re: From Word to comma-delimited files
Date: Thu, 08 Jan 2009 19:36:30 +0000
Message-ID: <gmkcm4lp3ttl5qrfol8i5tugjooq5rcsu4@4ax.com>

On Thu, 08 Jan 2009 14:04:15 +0100, "An[z]elmus"
<·······@somewhere.org> wrote:

>1) Classification (always present)
>2) Author (some times missing)
>3) Title (always present)
>4) Subtitle or Translation (some times missing)
>5) Pubblication data (always present)

Until now I have come up with this:

(defun file-lines (path)
  "Sucks up an entire file from PATH into a list of freshly-allocated
      strings, returning two values: the list of strings and the
number of
      lines read."
  (with-open-file (s path)
    (loop for line = (read-line s nil nil)
          and line-count from 0
          while line
          collect line into lines
          finally (return (values lines line-count)))))

The function above (Rob Warnock) collects all the lines of the file in
a list.

(defun collect-cards (lines)
  (let ((schede '())
        (scheda '()))
    (dolist (x lines (reverse schede))
      (cond
       ((not(equal x ""))
        (push x scheda))
       (t (setq schede (push (reverse scheda) schede))
          (setq scheda '()))))))

And this one collects in a list of lists each card bypassing the blank
lines.

TO DO:
- Insert separators (a pipe or a dollar sign) including consecutive
ones for blank fields.

- Print back the lists line by line to a  CSV file.