*Using* uninterned symbols, alterative to printlevel objects and marshalling

From: Robert Maas, http://tinyurl.com/uh3t
Subject: *Using* uninterned symbols, alterative to printlevel objects and marshalling
Date: Fri, 25 Apr 2008 17:11:04 +0000
Message-ID: <rem-2008apr25-001@yahoo.com>

Like *REALLY* using uninterned/noninterned symbols to the **MAX**!

Suppose you are building a large data-structure (tree of CONS
cells) in memory, and you want to print just the topmost part of it
for debugging purposes. The standard way is to set print-level etc.
to make the prettyprinter limit printing to only so many levels
deep and only so many elements at each level. This is a very crude
way to solve the problem. Most often the result is printing several
pages where it's hard to find the info you need, or omitting the
key part you need to see because it's deeper in the structure than
a lot of worthless junk you are trying to suppress.

Many months ago I devised an alternate way to solve this problem.
Instead of building the structure entirely out of CONS cells, at
major divisions between container and contained-structure I insert
an uninterned symbol. The value of that symbol is the rest of the
structure below that level. The name of the symbol is devised to
give some idea what is in that hidden structure under it. For
example, a list of 7 items might have the print-name "L7", while a
string of 69265 characters might have the print-name "S69265", and
a structure of a particular intentional type might have a
print-name that indicated what kind of structure it is and perhaps
also some key parameter or registration number.

This past month I extended that same idea to allowing more than one
type of data underneath each uninterned symbol, by using the
property list instead of the value cell, but the basic principle is
the same. In effect this gives you something like a CLOS object in
regard to data, although without any methods or generic functions,
but with a custom print-name, and without the overhead of needing
to define a CLOS class before objects can be created in that class.
You can write constructor functions like MAKE-whatever to create
the symbol with the desired print name and then bind the
properties, and any convenience functions you may want for setting
and getting various properties, and it doesn't matter what order
you define functions or load files.

Now very shortly I'm planning to extend this one more step to allow
saving structures with embedded uninterned symbols to file in
ordinary s-expression form and then later reading them back in
correctly. This will require a hash table (or package, but I'll use
a hash table for my application) to look up the string name of each
uninterned symbol (which must be unique, of course) and replace the
read-in symbol (which will be a new symbol each time) by the entry
in the hash table (which will always be the same symbol for the
same name in any given core-image).

Java has a well-designed system for marshalling structures to be
transmitted from one location to another, but the external
representation of such data is inscrutable. It seems to me that my
use of uninterned symbols with hashtable merging of various copies
of the same symbol would be a better tradeoff between ease of
coding and ease of operation and ease of debugging etc.

See my recent thread on ProxHash for details of my current project
where uninterned symbols will be used in this way. This article
today is the first description of the uninterned-sumbol idea as an
alternative to Java's marshalling.

Note that Java doesn't have symbols per se in the runtime
environment, at least not in the default library as shipped, so it
wouldn't be easy to implment my idea in Java, because we'd first
have to implement a class for symbols that worked like in Lisp.

So is anybody curious to watch my progress on implementing my idea?
Has anybody else already done anything very similar?

Re: *Using* uninterned symbols, alterative to printlevel objects and marshalling Pascal Bourguignon
- Re: *Using* uninterned symbols, alterative to printlevel objects and marshalling Robert Maas, http://tinyurl.com/uh3t
  - Re: *Using* uninterned symbols, alterative to printlevel objects and marshalling Pascal J. Bourguignon

From: Pascal Bourguignon
Subject: Re: *Using* uninterned symbols, alterative to printlevel objects and marshalling
Date: Fri, 25 Apr 2008 18:42:43 +0000
Message-ID: <871w4tu670.fsf@hubble.informatimago.com>

·················@SpamGourmet.Com (Robert Maas, http://tinyurl.com/uh3t) writes:

> Like *REALLY* using uninterned/noninterned symbols to the **MAX**!
> [...]
> Many months ago I devised an alternate way to solve this problem.
> Instead of building the structure entirely out of CONS cells, at
> major divisions between container and contained-structure I insert
> an uninterned symbol. The value of that symbol is the rest of the
> structure below that level. The name of the symbol is devised to
> give some idea what is in that hidden structure under it. For
> example, a list of 7 items might have the print-name "L7", while a
> string of 69265 characters might have the print-name "S69265", and
> a structure of a particular intentional type might have a
> print-name that indicated what kind of structure it is and perhaps
> also some key parameter or registration number.
> [...]
> So is anybody curious to watch my progress on implementing my idea?
> Has anybody else already done anything very similar?

It's nice, but why not use just structures, defining a print-object
method on them?

If you have so deep and big lists (vectors, etc) to print, isn't it
because you've not abstracted enough your data?

(defstruct gene 
    (bases (make-array 0 :element-type '(member #\A #\C #\G #\T))
           :type (vector (member #\A #\C #\G #\T) *)))

(defmethod print-object ((self gene) stream)
   (print-unreadable-object (self stream :identity t :type t)
      (format stream "with ~A base~:*~P" (length (gene-bases self))))
   self)

(make-gene :bases (make-array (random 1000000)
                      :initial-element #\A
                      :element-type '(member #\A #\C #\G #\T)))

(make-gene :bases (make-array (random 1000000)
                      :initial-element #\A
                      :element-type '(member #\A #\C #\G #\T)))
--> #<GENE with 461891 bases #x207F45BE>

I find it better than #:GENE-WITH-461891-BASES.

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/

PLEASE NOTE: Some quantum physics theories suggest that when the
consumer is not directly observing this product, it may cease to
exist or will exist only in a vague and undetermined state.

From: Robert Maas, http://tinyurl.com/uh3t
Subject: Re: *Using* uninterned symbols, alterative to printlevel objects and marshalling
Date: Tue, 29 Apr 2008 15:32:16 +0000
Message-ID: <rem-2008apr29-001@yahoo.com>

> > Many months ago I devised an alternate way to solve this problem.
> > Instead of building the structure entirely out of CONS cells, at
> > major divisions between container and contained-structure I insert
> > an uninterned symbol. The value of that symbol is the rest of the
> > structure below that level. The name of the symbol is devised to
> > give some idea what is in that hidden structure under it. For
> > example, a list of 7 items might have the print-name "L7", while a
> > string of 69265 characters might have the print-name "S69265", ...
> From: Pascal Bourguignon <····@informatimago.com>
> It's nice, but why not use just structures, defining a
> print-object method on them?

Mostly because of the hassle of needing to define a structure class
before I can make objects of that type, which introduces a
dependency as to what sequence I set up things. I prefer defining
only functions, whereby I can define things in any order
whatsoever. Also I am not aware that I can easily set up a
different print method for each individual object, or cause the
print effect to be parameterized to yield something really compact
like #:S36482 without any alphanum characters except the ones I
want and automatically generated from the contents. Maybe someday
I'll take a look at your idea and see if it'd be workable. However
I don't in any case see any actual advantage to your way compared
to mine. My way is totally simple. For example, to coerce to
wrapped form, I have one line of code like this:
  (or (symbolp varwholemsg) (setq varwholemsg (lensym-bind varwholemsg)))

> If you have so deep and big lists (vectors, etc) to print, isn't
> it because you've not abstracted enough your data?

No, I don't think so. In the varwholemsg case, it isn't that any
structures are really deep, but that a key variable is tens of
thousands of characters (the entire contents of a spam message that
contains a huge attachment, which my software is processing to
determine the IP number so that a complaint can be sent to the
correct place, and to replace all mention of my own e-mail address
by dummy values so I don't reveal to spam-friendly ISPs that my
e-mail address "works"). I have to hide the huge string to avoid
overflowing VT100 dialup buffers whenever a breakpoint is hit.

In another application, my code explores a **huge** search space to
generate pseudo-random very-large prime numbers by a recursive
process, building an exploration-of-possibilities tree as it
recurses deeper, keeping track of the continuation of each level of
process at that point in the tree. For example, at the top level it
may be looking for a prime in the range of 12E300 to 13E300, which
means it needs a fully-factored even number in the range of 12E300
to 13E300-2, which means it needs a fully-factored number in the
range 6E300 to 5.6E300-1, which it can senthesize by various means
which it needs to explore. Typically it synthesizes several factors
then the quotient is the one remaining factor which must be prime.
So it recurses to looking for a prime in some range, several levels
down inside the tree that shows dividing out the other factors
already determined.

In the current project, I generate consecutive two-letter codes for
nodes in a tree (so that they'll print out compactly in ASCII-art
graphics), and then set up various properties on the symbols such
as the original record, the normalized&parsed record, the raw
histogram of characters, the normalized frequency-ratio histogram,
the ProxHash vector, the list of links in the graph having this
node as one endpoint, the row,col position in the current graph
being laid out, and maybe more later. It's nice with a symbol's
property list that I can tack on a new type of property any time I
want. Can new fields of a defstruct be added (to already-created
objects) any time you want just as easily?

> (defstruct gene
>     (bases (make-array 0 :element-type '(member #\A #\C #\G #\T))
>            :type (vector (member #\A #\C #\G #\T) *)))
(etc.)

In your example, the type gene is defined entirely in terms of more
primitive types. But what if you want to make two or more mutually
recursive defstructs, where one or more of them must be defined in
terms of other types which haven't yet been defined. Does this
work? For my random-prime-number generator, I can imagine:
  <find-prime> defined in terms of <small-prime> | (+ 1 (* 2 <composite>))
  <composite> defined in terms of <explore-factors-1> | ...
  <explore-factors-1> is defined in terms of a gradually resolving
    process that eventually resolves to (* <small-prime> ... <find-prime>)

> (defmethod print-object ((self gene) stream)
>    (print-unreadable-object (self stream :identity t :type t)
>       (format stream "with ~A base~:*~P" (length (gene-bases self))))
>    self)
..
> (make-gene ...)
> --> #<GENE with 461891 bases #x207F45BE>

That's awfully verbose if this is going to be embedded in some
large structure that you're trying to print out during debugging.

> I find it better than #:GENE-WITH-461891-BASES.

Actually I'd find #:G461891 much more compact for embedding in
larger printouts. Now if it's not going to be embedded, then the
longer name which gives more info may be desirable. In fact a name
that gives even more info may or may not be desireable, such as
#:CHIMP-C14-hox42-461891.

Anyway, thanks for ideas about possible alternatives which may or
may not be suitable for my purposes. In particular I wasn't aware
that a print method could be defined for DEFSTRUCTs. I looked in
CLtL1 just now to see if I could find mention of print-object, but
all I could find was print-function, which I hadn't noticed before
because I never had any application for DEFSTRUCTs in the first
place hence no need to read that chapter carefully. Is this a
change since CLOS and generic functions were introduced with
standardization? Hmm, I looked in the HyperSpec just now, and the
index shows both print-object and print-function. Looking at:
<http://www.lispworks.com/documentation/HyperSpec/Body/22_acl.htm>
          All methods for print-object must obey *print-readably*.
Hmm, I don't see how your proposed print-object method obeys that.
Am I missing something, or did you violate the rule?
Does the print-unreadable-object clause override just the default
way to print unreadably but not override what happens if
*print-readably* is true? It looks like there's a lot I'd need to
learn before I could make use of DEFSTRUCTs in this way. Uninterned
symbols are so much easier to work with because I already know what
there is to know about them. Maybe someday I'll have the time to
make a serious attempt at DEFSTRUCTs. For now, I'm having enough
trouble finding time to complete the project just using what I
already know.

(OT: The other day I got distracted by teaching myself how to set
 cookies in PHP so that they can be found later, then unsuccessfully
 tried to learn how to do the same in Perl, finally decided I was
 wasting too much time trying to find a tutorial on how to do it
 using existing packages, and I should just TELNET to my PHP script
 and observe what comes back in the header and just hand-code
 generating similar text, if I ever feel the urge to spend time on
 it again. Then just hand-code the same text from Lisp and Java and
 C and C++ and Flaming Thunder too.)

From: Pascal J. Bourguignon
Subject: Re: *Using* uninterned symbols, alterative to printlevel objects and marshalling
Date: Tue, 29 Apr 2008 17:33:10 +0000
Message-ID: <7c1w4o4ld5.fsf@pbourguignon.anevia.com>

·················@SpamGourmet.Com (Robert Maas, http://tinyurl.com/uh3t) writes:

> Can new fields of a defstruct be added (to already-created
> objects) any time you want just as easily?

To do it implementation independently, you'd have to use CLOS objects.

For example, clisp breaks like this:

C/USER[68]> (defstruct test a b)
TEST
C/USER[69]> (setf s (make-test :a 1 :b 2))
#S(TEST :A 1 :B 2)
C/USER[70]> (defstruct test a b c)
TEST
C/USER[71]> s

*** - SYSTEM::%%STRUCTURE-REF: 3 is not a valid index into
*** - SYSTEM::%%STRUCTURE-REF: 3 is not a valid index into
*** - SYSTEM::%%STRUCTURE-REF: 3 is not a valid index into
*** - Unprintable error message
The following restarts are available:
ABORT          :R1      ABORT
C/Break 1 USER[72]> :q

as it is allowed to. "The consequences of redefining a defstruct structure are undefined."

>> (defstruct gene
>>     (bases (make-array 0 :element-type '(member #\A #\C #\G #\T))
>>            :type (vector (member #\A #\C #\G #\T) *)))
> (etc.)
>
> In your example, the type gene is defined entirely in terms of more
> primitive types. But what if you want to make two or more mutually
> recursive defstructs, where one or more of them must be defined in
> terms of other types which haven't yet been defined. Does this
> work? 

Yes, specification of field types, or CLOS object slot types is
facultative, so you can define:

(defstruct country name persons)
(defstruct person name origin-country)

(make-country
 :name "USA" 
 :persons (list (make-person
                 :name "Albert"
                 :origin-country (make-country
                                  :name "Germany"
                                  :persons (list (make-person
                                                  :name "Adolf"
                                                  :origin-country '#1=#.(make-country
                                                                         :name "Austria")))))

                (make-person
                 :name "Arnold"
                 :origin-country '#1#)))

--> #S(COUNTRY :NAME "USA" :PERSONS (#S(PERSON :NAME "Albert" :ORIGIN-COUNTRY #S(COUNTRY :NAME "Germany" :PERSONS (#S(PERSON :NAME "Adolf" :ORIGIN-COUNTRY #1=#S(COUNTRY :NAME "Austria" :PERSONS NIL))))) #S(PERSON :NAME "Arnold" :ORIGIN-COUNTRY #1#)))

>> (defmethod print-object ((self gene) stream)
>>    (print-unreadable-object (self stream :identity t :type t)
>>       (format stream "with ~A base~:*~P" (length (gene-bases self))))
>>    self)
> ..
>> (make-gene ...)
>> --> #<GENE with 461891 bases #x207F45BE>
>
> That's awfully verbose if this is going to be embedded in some
> large structure that you're trying to print out during debugging.
>
>> I find it better than #:GENE-WITH-461891-BASES.
>
> Actually I'd find #:G461891 much more compact for embedding in
> larger printouts. Now if it's not going to be embedded, then the
> longer name which gives more info may be desirable. In fact a name
> that gives even more info may or may not be desireable, such as
> #:CHIMP-C14-hox42-461891.

Well, it's up to you really.

(defmethod print-object ((self gene) stream)
  (if *print-readably*
     (error "Cannot print objects of class ~S readably." (class-of self))
     (format stream "#<~A-~A>" (class-of self) (length (gene-bases self))))
  self)

If you want to print them readably, you could use normal symbols,
possibly interned in a special package.  Once upon a time, I made
packages for each class of objects I had to deal with (messages,
message queues, smtp sessions, etc), and interned there symbols named
by the identifiers (message-id, queue-id, etc) whose values were the
objects identified.

So you could have:  CHIMP:C14-HOX42-461891

If you have a list of chimps, you can print and read the symbols in
(let ((*package* (find-package "CHIMP"))) ...)  which gives only
C14-HOX42-461891, and if you must mix chimps with bananas, you stay in
CL-USER, and get CHIMP:C14-HOX42-461891 and BANANA:B12-S123.

You can keep the symbol naming the object or structure in a field, or
just recompute the symbol-name and FIND-SYMBOL it.

(defmethod print-object ((self gene) stream)
  ;; so you can write:
  (format stream "~A:B~A" (class-of self) (length (gene-bases self)))
  ;; xor:
  (format stream "~S" (identifier self))
  self)

>           All methods for print-object must obey *print-readably*.
> Hmm, I don't see how your proposed print-object method obeys that.

Yes, the use of the PRINT-UNREADABLE-OBJECT ensures that.
If you don't use this macro, you have to test it yourself.

In the last print-object above, we print more or less readably. 
Actually, here is how clisp prints symbols readably:

(write 'user1::test :readably t)
prints:
|USER1|::|TEST|
--> USER1::TEST

Does your implementation print them like this too?

Remember that printing readably must take into account the fact that
the readtable case, or the *read-base* may be different when the text
will be read. 

So, just putting (when *print-readably* (error "Sorry no can do")) at
the beginning of your print-object method would be good enough, from
the standard point of view.

-- 
__Pascal Bourguignon__