INTERN hygiene in Web applications

From: Ivan Shvedunov
Subject: INTERN hygiene in Web applications
Date: Tue, 17 Jan 2006 17:18:30 +0000
Message-ID: <434n77F1jhkj2U1@individual.net>

   Hi all!

   I've noticed an unpleasant potential security problem present in many 
CL libraries related to XML and Web applications in general. Most likely 
I'm very far from being the first to notice this, but I didn't see much 
discussion on this topic.

   It's not uncommon for aforementioned libraries to call INTERN on 
user-supplied strings, such as XML element/attribute names or GET/POST 
parameters (more sure about XML). More often than not, the symbols go to 
KEYWORD package. This is rather unfortunate situation as besides opening
up possibilities of various DoS attacks on package system 
implementations ("polluting" packages to make subsequent INTERNs much 
slower; see e.g. http://www.cs.rice.edu/~scrosby/hash/ ) this just 
allows one to pump chunks of rubbish into application's memory with each 
request (e.g. by sending random junk XML to a Lisp-based Web Service or 
so-called "AJAX" application). The rubbish will not be removed until 
Lisp exits.

   The problem is unlikely to be widely exploited in nearest future, but
if CL usage becomes widespread (*) we'll get some problems not unlike MS 
Windows WMF flaws which were inherited from old times when it was safe 
to trust WMFs. Fixing vulnerable applications may turn out to be rather 
hard.

   As possible solutions I see either interning such symbols in 
temporary packages which should be cleared after each request, or not 
calling INTERN for user-supplied data at all (in this case, it's 
possible to preintern all necessary symbols based e.g. on DTD / XML 
Schema / possible request parameters and then just use FIND-SYMBOL). Any 
other suggestions?

----
(*) This one-branch IF should probably be replaced by WHEN.

Re: INTERN hygiene in Web applications Wade Humeniuk
- Re: INTERN hygiene in Web applications Wade Humeniuk
Re: INTERN hygiene in Web applications Adam Warner
- Re: INTERN hygiene in Web applications Thomas F. Burdick
  - Re: INTERN hygiene in Web applications Adam Warner
    - Re: INTERN hygiene in Web applications Marcin 'Qrczak' Kowalczyk
      - Re: INTERN hygiene in Web applications Adam Warner
- Re: INTERN hygiene in Web applications Wade Humeniuk
  - Re: INTERN hygiene in Web applications Rob Warnock
    - Re: INTERN hygiene in Web applications Joerg Hoehle
      - Re: INTERN hygiene in Web applications Rob Warnock
        Re: INTERN hygiene in Web applications Thomas A. Russ
        Re: INTERN hygiene in Web applications Juho Snellman

From: Wade Humeniuk
Subject: Re: INTERN hygiene in Web applications
Date: Tue, 17 Jan 2006 17:32:51 +0000
Message-ID: <7T9zf.103754$km.56946@edtnps89>

Ivan Shvedunov wrote:

>   As possible solutions I see either interning such symbols in temporary 
> packages which should be cleared after each request, or not calling 
> INTERN for user-supplied data at all (in this case, it's possible to 
> preintern all necessary symbols based e.g. on DTD / XML Schema / 
> possible request parameters and then just use FIND-SYMBOL). Any other 
> suggestions?
> 

Yes you are not the first to notice the problem.  I have used FIND-SYMBOL
in the past and if the symbol was not found I left the attribute/symbol
as a STRING.  Then I had mix of interned symbols and strings and used
EQUAL instead of EQL to do the necessary processing.  Other alternatives
include using UNINTERNED SYMBOLS "interned" in (maybe weak) HASH_TABLES.  Then
you essentially do your own interning and EQL/et.al. will then work for you.

Wade

From: Wade Humeniuk
Subject: Re: INTERN hygiene in Web applications
Date: Wed, 18 Jan 2006 01:48:38 +0000
Message-ID: <W7hzf.96227$6K2.86588@edtnps90>

In LispWorks a possible weak-intern can be written as

(defvar *weak-web-symbols* (make-hash-table :test 'equal :weak-kind :value))

(defun weak-intern (string)
   (multiple-value-bind (value present-p)
       (gethash string *weak-web-symbols*)
     (if present-p
         value
       (setf (gethash string *weak-web-symbols*)
             (make-symbol string)))))

When the value of this weak table is not longer bound to any other
object.  See,

http://www.lispworks.com/documentation/lw445/LWRM/html/lwref-217.htm#set-hash-table-weak

CL-USER 133 > (eql (weak-intern "test") (weak-intern "test"))
T

CL-USER 134 > (weak-intern "test")
#:|test|

CL-USER 135 > (describe *)

#:|test| is a SYMBOL
NAME          "test"
VALUE         #<unbound value>
FUNCTION      #<unbound function>
PLIST         (PKG::SYMBOL-NAME-STRING "test")
PACKAGE       NIL

CL-USER 136 >

Wade

From: Adam Warner
Subject: Re: INTERN hygiene in Web applications
Date: Wed, 18 Jan 2006 01:08:58 +0000
Message-ID: <pan.2006.01.18.01.08.55.621484@consulting.net.nz>

On Tue, 17 Jan 2006 20:18:30 +0300, Ivan Shvedunov wrote:
>    Hi all!
> 
>    I've noticed an unpleasant potential security problem present in many
> CL libraries related to XML and Web applications in general. Most likely
> I'm very far from being the first to notice this, but I didn't see much
> discussion on this topic.
> 
>    It's not uncommon for aforementioned libraries to call INTERN on
> user-supplied strings, such as XML element/attribute names or GET/POST
> parameters (more sure about XML). More often than not, the symbols go to
> KEYWORD package. This is rather unfortunate situation as besides opening
> up possibilities of various DoS attacks on package system
> implementations ("polluting" packages to make subsequent INTERNs much
> slower; see e.g. http://www.cs.rice.edu/~scrosby/hash/ ) this just
> allows one to pump chunks of rubbish into application's memory with each
> request (e.g. by sending random junk XML to a Lisp-based Web Service or
> so-called "AJAX" application). The rubbish will not be removed until
> Lisp exits.
> 
>    The problem is unlikely to be widely exploited in nearest future, but
> if CL usage becomes widespread (*) we'll get some problems not unlike MS
> Windows WMF flaws which were inherited from old times when it was safe
> to trust WMFs. Fixing vulnerable applications may turn out to be rather
> hard.
> 
>    As possible solutions I see either interning such symbols in
> temporary packages which should be cleared after each request, or not
> calling INTERN for user-supplied data at all (in this case, it's
> possible to preintern all necessary symbols based e.g. on DTD / XML
> Schema / possible request parameters and then just use FIND-SYMBOL). Any
> other suggestions?

A similar issue exists when interning Unicode graphemes as intermediate
objects. Since the set of graphemes is unbounded a malicious person could
compose a long string/stream of unique graphemes that eventually consumes
all free memory as the graphemes are interned.

Language support for weak references permits a robust solution to be
developed while keeping the uniqueness (reference equality) of real
objects. Interned chunks of rubbish will eventually be garbage collected
when the only reference to the rubbish is within, say, a weak hash table
of symbols.

ANSI Common Lisp does not provide weak references nor weak hash tables. If
you intend to rely upon weak references to mitigate DoS attacks then only
those implementations that support weak references can be secure against
them.

This is a reason Java 1.2+ provides a better foundation for building
some abstractions upon than ANSI Common Lisp:
<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/ref/WeakReference.html>

Regards,
Adam

From: Thomas F. Burdick
Subject: Re: INTERN hygiene in Web applications
Date: Wed, 18 Jan 2006 09:52:56 +0000
Message-ID: <xcvy81d277r.fsf@conquest.OCF.Berkeley.EDU>

Adam Warner <······@consulting.net.nz> writes:

> A similar issue exists when interning Unicode graphemes as intermediate
> objects. Since the set of graphemes is unbounded a malicious person could
> compose a long string/stream of unique graphemes that eventually consumes
> all free memory as the graphemes are interned.

Characters are not interned objects, so I don't see what the
vulnerability is here.

-- 
           /|_     .-----------------------.                        
         ,'  .\  / | Free Mumia Abu-Jamal! |
     ,--'    _,'   | Abolish the racist    |
    /       /      | death penalty!        |
   (   -.  |       `-----------------------'
   |     ) |                               
  (`-.  '--.)                              
   `. )----'

From: Adam Warner
Subject: Re: INTERN hygiene in Web applications
Date: Wed, 18 Jan 2006 11:25:40 +0000
Message-ID: <pan.2006.01.18.11.25.38.155024@consulting.net.nz>

On Wed, 18 Jan 2006 01:52:56 -0800, Thomas F. Burdick wrote:

> Adam Warner <······@consulting.net.nz> writes:
> 
>> A similar issue exists when interning Unicode graphemes as intermediate
>> objects. Since the set of graphemes is unbounded a malicious person
>> could compose a long string/stream of unique graphemes that eventually
>> consumes all free memory as the graphemes are interned.
> 
> Characters are not interned objects, so I don't see what the
> vulnerability is here.

A language's 'characters' typically map to a small set of positive
integers that fit within a native machine integer. There is no mapping to
a bounded integer for Unicode graphemes so one has to come up with a
different way to store and compare them. ONE way is to make each grapheme
an object and define a mapping from the string representation of the
grapheme to the grapheme object. If one always consults the mapping before
creating a new grapheme object then only one object will exist per
grapheme. Graphemes can then be compared via their pointers (and pointer
comparisons are one of the fastest operations a machine can perform).

In any text there are likely to be no more than a handful of graphemes so
iterating over such text as interned graphemes is no problem. Text can
however be composed of an infinite variety of legal graphemes (Unicode
code point sequences). If such text is iterated over as interned grapheme
objects then a new mapping will be added for each unique grapheme. Even if
each grapheme that's read is immediately discarded the table of interned
graphemes will continue to grow and eventually consume all resources
unless there is some way for an implementation to discard otherwise
unreferenced graphemes.

Regards,
Adam

From: Marcin 'Qrczak' Kowalczyk
Subject: Re: INTERN hygiene in Web applications
Date: Thu, 19 Jan 2006 13:56:18 +0000
Message-ID: <87zmls1ful.fsf@qrnik.zagroda>

Adam Warner <······@consulting.net.nz> writes:

> If one always consults the mapping before creating a new grapheme
> object then only one object will exist per grapheme. Graphemes can
> then be compared via their pointers (and pointer comparisons are one
> of the fastest operations a machine can perform).

But their creation would be slower. I've once read that trying to keep
bignums unique usually doesn't pay. This fact doesn't have to carry
over to graphemes, but perhaps it does.

If a library works on graphemes, it's probably better to represent
them as strings.

AFAIK most libraries working on text don't work in terms on graphemes
but code points or even code units.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From: Adam Warner
Subject: Re: INTERN hygiene in Web applications
Date: Fri, 20 Jan 2006 02:05:16 +0000
Message-ID: <pan.2006.01.20.02.05.14.280089@consulting.net.nz>

On Thu, 19 Jan 2006 14:56:18 +0100, Marcin 'Qrczak' Kowalczyk wrote:
> Adam Warner <······@consulting.net.nz> writes:
> 
>> If one always consults the mapping before creating a new grapheme
>> object then only one object will exist per grapheme. Graphemes can then
>> be compared via their pointers (and pointer comparisons are one of the
>> fastest operations a machine can perform).
> 
> But their creation would be slower. I've once read that trying to keep
> bignums unique usually doesn't pay. This fact doesn't have to carry over
> to graphemes, but perhaps it does.

I concur. I mentioned one way to approach the comparison of graphemes
that's analogous to symbols.

In a multithreaded environment maintaining a globally consistent mapping
imposes additional costs.

Regards,
Adam

From: Wade Humeniuk
Subject: Re: INTERN hygiene in Web applications
Date: Wed, 18 Jan 2006 01:59:20 +0000
Message-ID: <Yhhzf.96236$6K2.60286@edtnps90>

Adam Warner wrote:

> 
> Language support for weak references permits a robust solution to be
> developed while keeping the uniqueness (reference equality) of real
> objects. Interned chunks of rubbish will eventually be garbage collected
> when the only reference to the rubbish is within, say, a weak hash table
> of symbols.
> 
> ANSI Common Lisp does not provide weak references nor weak hash tables. If
> you intend to rely upon weak references to mitigate DoS attacks then only
> those implementations that support weak references can be secure against
> them.
> 

I think weak references and hash tables should be added to the spec (or in
some other way.)  But as far as I know LispWorks, clisp, Allegro CL and (I think)
CMUCL all have weak references (hash-tables, arrays).  That these CLs have
extensions for weakness is proof enough of their usefulness.

http://clisp.cons.org/impnotes/hash-dict.html#make-hash

http://www.franz.com/support/documentation/7.0/doc/implementation.htm#cl-make-hash-table-2

Wade

> This is a reason Java 1.2+ provides a better foundation for building
> some abstractions upon than ANSI Common Lisp:
> <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/ref/WeakReference.html>
>

From: Rob Warnock
Subject: Re: INTERN hygiene in Web applications
Date: Wed, 18 Jan 2006 08:17:10 +0000
Message-ID: <eeydnf9j1okbZFDeRVn-iA@speakeasy.net>

Wade Humeniuk  <··················@telus.net> wrote:
+---------------
| I think weak references and hash tables should be added to the spec
| (or in some other way.)  But as far as I know LispWorks, clisp,
| Allegro CL and (I think) CMUCL all have weak references (hash-tables,
| arrays).
+---------------

CMUCL has weak pointers and weak hashtables (and also supports
finalization), but from looking at the source it would seem
that symbol tables *don't* use normal hashtables [they use
a specialized internal "package-hashtable" type], and thus
can't use weak hashtables! (Oops.)

So CMUCL might suffer from an XML->keyword-package DoS attack.
It might be safest to maintain one's own XML symbol table,
possibly using a weak hashtable.


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Joerg Hoehle
Subject: Re: INTERN hygiene in Web applications
Date: Wed, 18 Jan 2006 14:01:59 +0000
Message-ID: <u7j8xljmw.fsf@users.sourceforge.net>

····@rpw3.org (Rob Warnock) writes:
> CMUCL has weak pointers and weak hashtables (and also supports
> finalization), but from looking at the source it would seem
> that symbol tables *don't* use normal hashtables [they use
> a specialized internal "package-hashtable" type], and thus
> can't use weak hashtables! (Oops.)

Rob,
if you manage to change the way an implementation manages symbol
tables, then you're clearly hacking a particular version of that
implementation. -- You shouldn't do that.

Your posting seems to imply that cmucl is in a worse position than the
other CL implementations because it does not use "normal" hashtables
for symbol tables. This is IMHO not justified.

Instead, you should use a portable library of weak tables[*] (IIRC I
heard someone write that) and roll your own XML element identifiers,
as others suggested.

[*] that portable library of weak tables is IIRC an API unification,
providing one exported API that internally maps to all CL
implementations that actually support weak tables.

Regards,
	Jorg Hohle
Telekom/T-Systems Technology Center

From: Rob Warnock
Subject: Re: INTERN hygiene in Web applications
Date: Thu, 19 Jan 2006 04:25:06 +0000
Message-ID: <6LOdnUy8Hqg_iVLeRVn-pw@speakeasy.net>

Joerg Hoehle  <······@users.sourceforge.net> wrote:
+---------------
| ····@rpw3.org (Rob Warnock) writes:
| > CMUCL has weak pointers and weak hashtables (and also supports
| > finalization), but from looking at the source it would seem
| > that symbol tables *don't* use normal hashtables [they use
| > a specialized internal "package-hashtable" type], and thus
| > can't use weak hashtables! (Oops.)
| 
| Rob, if you manage to change the way an implementation manages symbol
| tables, then you're clearly hacking a particular version of that
| implementation. -- You shouldn't do that.
+---------------

Uh... I think you have me confused with the OP who started this
thread. I wasn't proposing to hack *anything*, only describing
how I think current CMUCL actually behaves in this regard.
Specifically, it doesn't appear that symbols are "weak" in CMUCL,
even if they have no "contents" (e.g., a global or function or
macro binding).

+---------------
| Your posting seems to imply that cmucl is in a worse position than the
| other CL implementations because it does not use "normal" hashtables
| for symbol tables. This is IMHO not justified.
+---------------

I was not criticizing in any way. I was only trying to describe
the details of what I *saw* when I looked at the CMUCL source.
Remember, the immediately-previous poster (Wade Humeniuk) had
listed a bunch of implementations with various kinds of weak
references, saying "...and (I think) CMUCL...", and I was simply
clarifying the extent to which [as I saw it] CMUCL matched the
rest of the implementations in that list.

In the process, I noticed that while it *does* provide weak
hash-tables, it doesn't use them for its own symbol tables,
and also that its own symbol tables didn't seem to support
GC'ing symbols that were no longer referenced. Thus it --
AND PROBABLY MOST OTHER CL IMPLEMENTATIONS!! -- appears to be
vulnerable to the OP's hypothesized XML->keyword-package DoS
attack. That's all.

+---------------
| Instead, you should use a portable library of weak tables[*] (IIRC I
| heard someone write that) and roll your own XML element identifiers,
| as others suggested.
+---------------

Again, you're confusing me with the OP who was asking about
the possibility of an XML->keyword-package DoS attack. I'm not
doing anything with XML myself. [At least, not if I can help it!!]


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Thomas A. Russ
Subject: Re: INTERN hygiene in Web applications
Date: Thu, 19 Jan 2006 17:56:33 +0000
Message-ID: <ymid5iogkz2.fsf@sevak.isi.edu>

····@rpw3.org (Rob Warnock) writes:

> ... only describing
> how I think current CMUCL actually behaves in this regard.
> Specifically, it doesn't appear that symbols are "weak" in CMUCL,
> even if they have no "contents" (e.g., a global or function or
> macro binding).

My first inclination was to object that CL can't treat symbols weakly
because of the requirement that reading the same (interned) symbol name
is supposed to produce the same symbol.

But then it occurred to me that operationally it wouldn't make any
difference if unreferenced symbols with no other properties besides
their symbol-names were GC'd.  Since without any other reference, and no
other associated information such as functions, class definitions, etc.,
you wouldn't have any object to compare it to.  So, given enough
attention to making sure the other information named by symbols is
checked, the could disappear if not referenced.

Now whether this would be useful in practice is another matter.  I would
expect that in most applications, there are not all that many symbols
that appear briefly and then disappear.  Perhaps symbols naming only
lexical variables could be eliminated after compilation, but unless
space is very tight (embedded applications?) it hardly seems useful.  I
guess that would leave applications with some form of evaluation or
processing of external information, like the posited web server.

-- 
Thomas A. Russ,  USC/Information Sciences Institute

From: Juho Snellman
Subject: Re: INTERN hygiene in Web applications
Date: Fri, 20 Jan 2006 04:41:43 +0000
Message-ID: <slrndt0qg7.77b.jsnell@sbz-30.cs.Helsinki.FI>

Thomas A. Russ <···@sevak.isi.edu> wrote:
> My first inclination was to object that CL can't treat symbols weakly
> because of the requirement that reading the same (interned) symbol name
> is supposed to produce the same symbol.
> 
> But then it occurred to me that operationally it wouldn't make any
> difference if unreferenced symbols with no other properties besides
> their symbol-names were GC'd. [...]

It'd make a difference for FIND-SYMBOL, which must not return NIL
if a symbol with the given name has been interned into the package.
e.g:

  (intern "FOO")
  (gc)
  (assert (find-symbol "FOO"))   

-- 
Juho Snellman