substitute #\& "&" my-str

From: verec
Subject: substitute #\& "&amp;" my-str
Date: Fri, 27 Jan 2006 19:29:50 +0000
Message-ID: <43da74af$0$87292$5a6aecb4@news.aaisp.net.uk>

I'm trying to "un-escape" an "html-encoded" string, and
thought that I could use substitue as in:

> (substitute #\& "&amp;" "lamb &amp; elephant &amp; &quot;squale&quot;")

unfortunately, this doesn't seem to work.

Nor does:

> (substitute "&" "&amp;" "lamb &amp; elephant &amp; &quot;squale&quot;")

as both return the sequence, unchanged.

The CLHS entry for

  subsitute newitem olditem sequence

doesn't seem to say anything about the "type compatibility" between
newitem, olditem nor sequence, only that the first two should be
"objects" and the third one a "proper sequence".

Did I miss some magic builtin CL function, or should I roll my own?

BTW: I had a look at split-sequence, and it seems that this couldn't
work either, because it uses position internally, and

> (position "&amp;" "lamb &amp; elephant &amp; &quot;squale&quot;")
returns nil.

Many thanks
--
JFB

Re: substitute #\& "&" my-str Kaz Kylheku
- Re: substitute #\& "&" my-str verec
  - Re: substitute #\& "&" my-str Kaz Kylheku
Re: substitute #\& "&" my-str Wade Humeniuk
- Re: substitute #\& "&" my-str verec
  - Re: substitute #\& "&" my-str Kaz Kylheku
    - Re: substitute #\& "&" my-str verec
      - Re: substitute #\& "&" my-str Edi Weitz
        Re: substitute #\& "&" my-str verec
        Re: substitute #\& "&" my-str Edi Weitz
        Re: substitute #\& "&" my-str verec

From: Kaz Kylheku
Subject: Re: substitute #\& "&amp;" my-str
Date: Fri, 27 Jan 2006 19:40:02 +0000
Message-ID: <1138390802.221169.75280@g44g2000cwa.googlegroups.com>

verec wrote:
> I'm trying to "un-escape" an "html-encoded" string, and
> thought that I could use substitue as in:
>
> > (substitute #\& "&amp;" "lamb &amp; elephant &amp; &quot;squale&quot;")
>
> unfortunately, this doesn't seem to work.
>
> Nor does:
>
> > (substitute "&" "&amp;" "lamb &amp; elephant &amp; &quot;squale&quot;")
>
> as both return the sequence, unchanged.
>
> The CLHS entry for
>
>   subsitute newitem olditem sequence

The sequences in question are strings.

SUBSTITUTE works finds and replaces individual elements of sequences.

The element of a sequence is a character, not a substring.

The element of a string, the item, is a character, not a substring.

> Did I miss some magic builtin CL function, or should I roll my own?

SEARCH, REPLACE

From: verec
Subject: Re: substitute #\& "&amp;" my-str
Date: Fri, 27 Jan 2006 19:52:15 +0000
Message-ID: <43da79ef$0$87294$5a6aecb4@news.aaisp.net.uk>

On 2006-01-27 19:40:02 +0000, "Kaz Kylheku" <········@gmail.com> said:

>> Did I miss some magic builtin CL function, or should I roll my own?
> 
> SEARCH, REPLACE

Sounds like "roll-my-own" time then, obviously using primitives
like search and replace.

Many thanks
--
JFB

From: Kaz Kylheku
Subject: Re: substitute #\& "&amp;" my-str
Date: Fri, 27 Jan 2006 19:56:32 +0000
Message-ID: <1138391792.264725.137980@f14g2000cwb.googlegroups.com>

verec wrote:
> On 2006-01-27 19:40:02 +0000, "Kaz Kylheku" <········@gmail.com> said:
>
> >> Did I miss some magic builtin CL function, or should I roll my own?
> >
> > SEARCH, REPLACE
>
> Sounds like "roll-my-own" time then, obviously using primitives
> like search and replace.
>
> Many thanks
> --
> JFB

Isn't there some HTML or XML library out there with escape and
unescape?

Also, regex maybe?

CL-PPCRE probably has some slick way of doing this.

I'm looking at the web page, and it looks like the REGEX-REPLACE-ALL is
the thing.

From: Wade Humeniuk
Subject: Re: substitute #\& "&amp;" my-str
Date: Fri, 27 Jan 2006 19:58:58 +0000
Message-ID: <6YuCf.110866$AP5.22455@edtnps84>

verec wrote:
> I'm trying to "un-escape" an "html-encoded" string, and
> thought that I could use substitue as in:
> 
>> (substitute #\& "&amp;" "lamb &amp; elephant &amp; &quot;squale&quot;")
> 
> unfortunately, this doesn't seem to work.
> 

Use cl-ppcre

http://www.cliki.net/CL-PPCRE

CL-USER 4 > (cl-ppcre:regex-replace-all "&amp;" "lamb &amp; elephant &amp; 
&quot;squale&quot;" "&")
"lamb & elephant & &quot;squale&quot;"

CL-USER 5 >

Wade

From: verec
Subject: Re: substitute #\& "&amp;" my-str
Date: Fri, 27 Jan 2006 20:56:35 +0000
Message-ID: <43da8903$0$87292$5a6aecb4@news.aaisp.net.uk>

On 2006-01-27 19:58:58 +0000, Wade Humeniuk 
<··················@telus.net> said:

> verec wrote:
>> I'm trying to "un-escape" an "html-encoded" string, and
>> thought that I could use substitue as in:
>> 
>>> (substitute #\& "&amp;" "lamb &amp; elephant &amp; &quot;squale&quot;")
>> 
>> unfortunately, this doesn't seem to work.
>> 
> 
> Use cl-ppcre
> 
> http://www.cliki.net/CL-PPCRE
> 
> CL-USER 4 > (cl-ppcre:regex-replace-all "&amp;" "lamb &amp; elephant 
> &amp; &quot;squale&quot;" "&")
> "lamb & elephant & &quot;squale&quot;"

Thanks for the pointer.

In the mean-time, I had come up with:

(defmacro while (test &body body)
  `(do ()
       ((not ,test))
     ,@body))

(defun search-replace (what with where)
  (let ((pos nil)
        (what-len (length what)))
    (while (setf pos (search what where))
      (setf where
            (with-output-to-string (out)
              (write-string (subseq where 0 pos) out)
              (write-string with out)
              (write-string (subseq where (+ pos what-len)) out))))
    where))

CL-USER 46 > (search-replace "&amp;" "&" "lamb &amp; elephant &amp; 
&quot;squale&quot;")
"lamb & elephant & &quot;squale&quot;"

CL-USER 47 > (search-replace "&quot;" "\"" "lamb & elephant & 
&quot;squale&quot;")
"lamb & elephant & \"squale\""

which seems to work, but that I am not too happy with because
it generates three temporary strings for each occurence of
the searched for string.

I'm having a look at CL-PPCRE now.

Many thanks
--
JFB

From: Kaz Kylheku
Subject: Re: substitute #\& "&amp;" my-str
Date: Fri, 27 Jan 2006 22:22:57 +0000
Message-ID: <1138400577.402716.296560@o13g2000cwo.googlegroups.com>

verec wrote:
> (defun search-replace (what with where)
>   (let ((pos nil)
>         (what-len (length what)))
>     (while (setf pos (search what where))
>       (setf where
>             (with-output-to-string (out)
>               (write-string (subseq where 0 pos) out)
>               (write-string with out)
>               (write-string (subseq where (+ pos what-len)) out))))
>     where))

This idiom is somewhat awkward with the SETF:

(let ((var nil))
  (while (setf var (compute-var))
     (expr)))

Idea:

(loop for var = (compute-var)
      while var
      do (expr))

> CL-USER 46 > (search-replace "&amp;" "&" "lamb &amp; elephant &amp;
> &quot;squale&quot;")
> "lamb & elephant & &quot;squale&quot;"
>
> CL-USER 47 > (search-replace "&quot;" "\"" "lamb & elephant &
> &quot;squale&quot;")
> "lamb & elephant & \"squale\""
>
> which seems to work, but that I am not too happy with because

Does it really work? Ha ha!

What if you try to replace "foo" with "foo" in a string that contains
"foo"? Because your algorithm re-scans the string from the beginning,
it will never terminate. It will keep finding "foo" and replacing it
with "foo".

What's worse, if you replace "foo" with "foos", it will keep growing
the string until memory runs out: "...foo..." -> "...foos..." ->
"...fooss..." and so forth.

Tant pis!

> it generates three temporary strings for each occurence of
> the searched for string.

It also generates an entire string-stream object to build one of those
strings.

What you can do is make two passes over the input string. In the first
pass, calculate exactly how many characters are required to do the
replacement.  Then make a string which is exactly that big using
MAKE-STRING.

Then, in a second pass, copy pieces of the original string, and the
replacement string, into that buffer using REPLACE.

From: verec
Subject: Re: substitute #\& "&amp;" my-str
Date: Sat, 28 Jan 2006 11:45:37 +0000
Message-ID: <43db5961$0$87293$5a6aecb4@news.aaisp.net.uk>

On 2006-01-27 22:22:57 +0000, "Kaz Kylheku" <········@gmail.com> said:

> This idiom is somewhat awkward with the SETF:
> 
> (let ((var nil))
>   (while (setf var (compute-var))
>      (expr)))

I'm more than happy to correct anything that is non idiomatic
or inefficient, but what is it, precisely, that you find
awkward? Using the value returned by setf? Or using it as
an argument to while?

> (loop for var = (compute-var)
>       while var
>       do (expr))

Noted.

>> which seems to work, but that I am not too happy with because
> 
> Does it really work? Ha ha!

I am aware of those cases. One more reason to look elsewhere :-)

> What you can do is make two passes over the input string. In the first
> pass, calculate exactly how many characters are required to do the
> replacement.  Then make a string which is exactly that big using
> MAKE-STRING.
> 
> Then, in a second pass, copy pieces of the original string, and the
> replacement string, into that buffer using REPLACE.

Good idea. The reason I didn't use replace is because it cannot
expand or shrink the sequence. But in the design you suggest, this
is what is wanted.

I'll give it a try.

[Right now, I'm trying to tell CL-PPCRE to replace &#41 with #\A,
ie &#xy with (code-char (from-base-16 xy))]

Many thanks
--
JFB

From: Edi Weitz
Subject: Re: substitute #\& "&amp;" my-str
Date: Sat, 28 Jan 2006 15:24:39 +0000
Message-ID: <ulkx0xtmw.fsf@agharta.de>

On Sat, 28 Jan 2006 11:45:37 +0000, verec <·····@mac.com> wrote:

> [Right now, I'm trying to tell CL-PPCRE to replace &#41 with #\A, ie
> &#xy with (code-char (from-base-16 xy))]

Assuming you forgot the semicolon and every number of hex digits (not
only two) is OK, it'd look more or less like this (untested):

  (defun foo (string)
    (ppcre:regex-replace-all "&#([0-9a-fA-F]+);"
                             string
                             (lambda (whole-match hex-string)
                               (declare (ignore whole-match))
                               (string
                                (code-char
                                 (parse-integer hex-string :radix 16))))
                             :simple-calls t))

If you're very much concerned about micro-efficiency you should use
REGEX-REPLACE-ALL without the SIMPLE-CALLS argument.  The closure will
be more complicated then, of course.

HTH,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: verec
Subject: Re: substitute #\& "&amp;" my-str
Date: Sat, 28 Jan 2006 23:57:42 +0000
Message-ID: <43dc04f6$0$87290$5a6aecb4@news.aaisp.net.uk>

On 2006-01-28 15:24:39 +0000, Edi Weitz <········@agharta.de> said:

>  (defun foo (string)
>     (ppcre:regex-replace-all "&#([0-9a-fA-F]+);"
>                              string
>                              (lambda (whole-match hex-string)
>                                (declare (ignore whole-match))
>                                (string
>                                 (code-char
>                                  (parse-integer hex-string :radix 16))))
>                              :simple-calls t))

Excellent.

(defun replace-&#xx (string)
  (ppcre:regex-replace-all "&#([0-9a-fA-F]+);"
                           string
                           (lambda (whole-match hex-string)
                             (declare (ignore whole-match))
                             (string
                              (code-char
                               (parse-integer hex-string :radix 16))))
                           :simple-calls t))

(defun simple-replace (text)
  (dolist (x '(("&amp;" . "&") ("&lt;" . "<") ("&gt;" . ">") ("&quot;" 
. "\"")))
    (setf text (ppcre:regex-replace-all (car x) text (cdr x))))
  text)

(defun unescape (text)
  (simple-replace (replace-&#xx text)))

CL-USER> (unescape "&#41;propos &lt;&#5e; was &quot;easy&quot;?&gt; 
&amp; simple?")
"Apropos <^ was \"easy\"?> & simple?"

Many thanks.
--
JFB

From: Edi Weitz
Subject: Re: substitute #\& "&amp;" my-str
Date: Sun, 29 Jan 2006 00:16:08 +0000
Message-ID: <uvew3x513.fsf@agharta.de>

On Sat, 28 Jan 2006 23:57:42 +0000, verec <·····@mac.com> wrote:

> (defun simple-replace (text)
>   (dolist (x '(("&amp;" . "&") ("&lt;" . "<") ("&gt;" . ">") ("&quot;"
>   . "\"")))
>     (setf text (ppcre:regex-replace-all (car x) text (cdr x))))
>   text)

I'd suggest something like the following:

  (defun simple-replace (text)
    (ppcre:regex-replace-all "&(amp|lt|gt|quot);"
                             text
                             #'find-simple-replacement))

  (defun find-simple-replacement (target-string start end
                                  match-start match-end
                                  reg-starts reg-ends)
    (declare (ignore start end match-end reg-starts reg-ends))
    (case (char target-string (1+ match-start))
      (#\a "&")
      (#\l "<")
      (#\g ">")
      (#\q "\"")))

In your implementation CL-PPCRE has to compile four regular
expressions at run time for each call of SIMPLE-REPLACE.  In my
version it will compile one regular expression once at compile time
and that's it.  See the compiler macros in `api.lisp' for details, or
trace the function PPCRE:CREATE-SCANNER for both versions.

Cheers,
Edi.

-- 

Lisp is not dead, it just smells funny.

Real email: (replace (subseq ·········@agharta.de" 5) "edi")

From: verec
Subject: Re: substitute #\& "&amp;" my-str
Date: Sun, 29 Jan 2006 00:48:27 +0000
Message-ID: <43dc10dc$0$87290$5a6aecb4@news.aaisp.net.uk>

On 2006-01-29 00:16:08 +0000, Edi Weitz <········@agharta.de> said:

> I'd suggest something like the following:
> 
>   (defun simple-replace (text)
>     (ppcre:regex-replace-all "&(amp|lt|gt|quot);"
>                              text
>                              #'find-simple-replacement))
> 
>   (defun find-simple-replacement (target-string start end
>                                   match-start match-end
>                                   reg-starts reg-ends)
>     (declare (ignore start end match-end reg-starts reg-ends))
>     (case (char target-string (1+ match-start))
>       (#\a "&")
>       (#\l "<")
>       (#\g ">")
>       (#\q "\"")))
> 
> In your implementation CL-PPCRE has to compile four regular
> expressions at run time for each call of SIMPLE-REPLACE.  In my
> version it will compile one regular expression once at compile time
> and that's it.  See the compiler macros in `api.lisp' for details, or
> trace the function PPCRE:CREATE-SCANNER for both versions.

What can I say?

Thanks again, Edi. Works like a charm.
--
JFB