Big time difference between (unix) find and DIRECTORY : why?

From: Jeff Cunningham
Subject: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 02:19:36 +0000
Message-ID: <pan.2005.08.10.02.19.33.560334@cunningham.net>

;; first method
(let* ((s (with-output-to-string (str)
           (ext:run-program "find"
                            '("traplogs" "-name" "*.msg" "-print") :output
                            str)))
       (msgs (sort (split "[\\n]" s) #'string-lessp)))
  (format t "number of msg files=~s~%" (length msgs)))
                                                                                          
;; second method
(let ((files (directory "traplogs/*.msg")))
  (format t "number of msg files=~s~%" (length files)))
                                                                                          
The latter method takes a couple minutes to produce a list of 12931 files.
The former method takes less than a second.

Why such a difference?
                                                                                          
--jeff

Re: Big time difference between (unix) find and DIRECTORY : why? Kaz Kylheku
- Re: Big time difference between (unix) find and DIRECTORY : why? Ulrich Hobelmann
  - Re: Big time difference between (unix) find and DIRECTORY : why? ··········@tfeb.org
    - Re: Big time difference between (unix) find and DIRECTORY : why? Kaz Kylheku
      - Re: Big time difference between (unix) find and DIRECTORY : why? Tim Bradshaw
        Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
        Re: Big time difference between (unix) find and DIRECTORY : why? Tim Bradshaw
        Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
        Re: Big time difference between (unix) find and DIRECTORY : why? Kaz Kylheku
  - Re: Big time difference between (unix) find and DIRECTORY : why? Kaz Kylheku
    - Re: Big time difference between (unix) find and DIRECTORY : why? Ulrich Hobelmann
      - Re: Big time difference between (unix) find and DIRECTORY : why? Pascal Bourguignon
  - Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
    - Re: Big time difference between (unix) find and DIRECTORY : why? Ulrich Hobelmann
      - Re: Big time difference between (unix) find and DIRECTORY : why? Greg Menke
      - Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
        Re: Big time difference between (unix) find and DIRECTORY : why? Ulrich Hobelmann
        Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
        Re: Big time difference between (unix) find and DIRECTORY : why? Marcin 'Qrczak' Kowalczyk
        Re: Big time difference between (unix) find and DIRECTORY : why? Ulrich Hobelmann
        Re: Big time difference between (unix) find and DIRECTORY : why? Hartmann Schaffer
        Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
        Re: Big time difference between (unix) find and DIRECTORY : why? Larry Clapp
        Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
        Re: Big time difference between (unix) find and DIRECTORY : why? Larry Clapp
        Re: Big time difference between (unix) find and DIRECTORY : why? Marcin 'Qrczak' Kowalczyk
        Re: Big time difference between (unix) find and DIRECTORY : why? Ulrich Hobelmann
        Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
        Re: Big time difference between (unix) find and DIRECTORY : why? Ulrich Hobelmann
        Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
        Re: Big time difference between (unix) find and DIRECTORY : why? Pascal Bourguignon
        Re: Big time difference between (unix) find and DIRECTORY : why? Tim X
        Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
        Re: Big time difference between (unix) find and DIRECTORY : why? Rob Warnock
Re: Big time difference between (unix) find and DIRECTORY : why? Matthias Buelow
- Re: Big time difference between (unix) find and DIRECTORY : why? Jeff Cunningham
  - Re: Big time difference between (unix) find and DIRECTORY : why? Jeff Cunningham
Re: Big time difference between (unix) find and DIRECTORY : why? Alexander Kjeldaas
- Re: Big time difference between (unix) find and DIRECTORY : why? Juho Snellman

From: Kaz Kylheku
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 06:52:13 +0000
Message-ID: <1123656733.541677.194160@g47g2000cwa.googlegroups.com>

Jeff Cunningham wrote:
> ;; first method
> (let* ((s (with-output-to-string (str)
>            (ext:run-program "find"
>                             '("traplogs" "-name" "*.msg" "-print") :output
>                             str)))
>        (msgs (sort (split "[\\n]" s) #'string-lessp)))
>   (format t "number of msg files=~s~%" (length msgs)))
>
> ;; second method
> (let ((files (directory "traplogs/*.msg")))
>   (format t "number of msg files=~s~%" (length files)))
>
> The latter method takes a couple minutes to produce a list of 12931 files.
> The former method takes less than a second.
>
> Why such a difference?

Which find program is that? The GNU one?

The find program tends to be quite optimized. For instance, I believe
that GNU find will know that since you are only interested in the names
(your only search criterion is -name) leaf directories can be processed
efficiently.

How do you know a directory contains only leaf nodes? Well, you look at
the link count of the directory. If traplogs/ has no subdirectories,
then it has a link count of 2. One for the traplogs reference down from
the parent, and one for the . (dot) entry from itself. Every
subdirectory contributes an additional link count via its .. (dot dot)
parent entry.

If you know that none of the entries in a directory are subdirectoris,
and you are not interested in timestamps, permissions or other inode
metadata of these entries, just their names, then you can just scan the
directory and not stat() any of the inodes.

It may also be that the directory() function does silly things, like
building up pathnames rather than descending into directories via
chdir(). That would chew up extra cycles in the VFS layer doing
redundant path component parsing.

The first investigation step would be to conduct a system call trace on
the two. That alone might explain the difference.

From: Ulrich Hobelmann
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 07:46:25 +0000
Message-ID: <3ltpmiF147fl2U1@individual.net>

Kaz Kylheku wrote:
> The find program tends to be quite optimized. For instance, I believe
> that GNU find will know that since you are only interested in the names
> (your only search criterion is -name) leaf directories can be processed
> efficiently.
> 
> How do you know a directory contains only leaf nodes? Well, you look at
> the link count of the directory. If traplogs/ has no subdirectories,
> then it has a link count of 2. One for the traplogs reference down from
> the parent, and one for the . (dot) entry from itself. Every
> subdirectory contributes an additional link count via its .. (dot dot)
> parent entry.

Unix VFS is a graph, not a tree.  There can be multiple hard links 
pointing down the tree at certain files, or at certain directories (but 
not upward, unless "root" tells it to, AFAIK, so there are no refcount 
cycles).  Nobody nowadays uses hard links, but they're great and allow 
you to put files and directories in different locations, like what all 
those attribute-my-file-and-find-it-later do, only the tech has been 
there decades.

> If you know that none of the entries in a directory are subdirectoris,
> and you are not interested in timestamps, permissions or other inode
> metadata of these entries, just their names, then you can just scan the
> directory and not stat() any of the inodes.

Even if you don't know it, you just iterate through the entries.  If 
it's a directory you stat, otherwise you don't.  I don't see where the 
leaf-node information comes in.

-- 
I believe in Karma.  That means I can do bad things to people
all day long and I assume they deserve it.
	Dogbert

From: ··········@tfeb.org
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 15:27:57 +0000
Message-ID: <1123687677.202570.24430@o13g2000cwo.googlegroups.com>

Ulrich Hobelmann wrote:

> Unix VFS is a graph, not a tree.  There can be multiple hard links
> pointing down the tree at certain files, or at certain directories (but
> not upward, unless "root" tells it to, AFAIK, so there are no refcount
> cycles).  Nobody nowadays uses hard links, but they're great and allow
> you to put files and directories in different locations, like what all
> those attribute-my-file-and-find-it-later do, only the tech has been
> there decades.
>

Specifically it is safe to assume that only directories have hard links
to other directories, and that the structure of those links will be a
tree.

> Even if you don't know it, you just iterate through the entries.  If
> it's a directory you stat, otherwise you don't.  I don't see where the
> leaf-node information comes in.

stat is *how you find out if it is a directory*.  The entries in a
directory do not contain this information - they just contain the name
and the inode number, the inode contains everything else (but not,
obviously, the name, as files can have many names).

--tim

From: Kaz Kylheku
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 16:16:37 +0000
Message-ID: <1123690597.108251.162730@o13g2000cwo.googlegroups.com>

··········@tfeb.org wrote:
> Specifically it is safe to assume that only directories have hard links
> to other directories, and that the structure of those links will be a
> tree.

Only directories contain any kind of links to anything. A link is just
an association between a name and an object, contained in an
association list known as a directory. :)

> > Even if you don't know it, you just iterate through the entries.  If
> > it's a directory you stat, otherwise you don't.  I don't see where the
> > leaf-node information comes in.
>
> stat is *how you find out if it is a directory*.  The entries in a
> directory do not contain this information - they just contain the name
> and the inode number, the inode contains everything else (but not,
> obviously, the name, as files can have many names).

That's right. So if you only care whether or not an entry is a
directory (do not care about any other properties) and you can /deduce/
that it's not a directory, you can avoid the stat call.

That can be quite a performance improvement, even under caching
conditions. When you don't have the inodes cached, it can be huge,
because the inodes are quite possibly scattered on the disk surface.
They are not arranged for a nice linear scan.

From: Tim Bradshaw
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 19:27:25 +0000
Message-ID: <1123702045.295847.323740@f14g2000cwb.googlegroups.com>

Kaz Kylheku wrote:
> ··········@tfeb.org wrote:
> > Specifically it is safe to assume that only directories have hard links
> > to other directories, and that the structure of those links will be a
> > tree.
>
> Only directories contain any kind of links to anything. A link is just
> an association between a name and an object, contained in an
> association list known as a directory. :)
>

Sorry, that's what I meant to say, I think.  What I actually meant was
that it is safe to assume that the link structure of a Unix filesystem
is a tree, not a general directed graph.

>
> That's right. So if you only care whether or not an entry is a
> directory (do not care about any other properties) and you can /deduce/
> that it's not a directory, you can avoid the stat call.

Yes, exactly so.  The person to whom I was replying seemed to think
that there was some magic way of knowing that a file was a directory
without doing stat() on it, but just by looking at the directory entry
for it, which there isn't, except in the special case where you've
looked at the link count of the containing directory and it is 2.

And I agree about the performance stuff (though actually decent
filesystems make efforts to arrange that the inodes refereenced by a
directory are physically near it).

--tim

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Thu, 11 Aug 2005 23:44:35 +0000
Message-ID: <86wtmsujh8.fsf@drjekyll.mkbuelow.net>

"Tim Bradshaw" <··········@tfeb.org> writes:

>Yes, exactly so.  The person to whom I was replying seemed to think
>that there was some magic way of knowing that a file was a directory
>without doing stat() on it, but just by looking at the directory entry

Generally, yes. However, BSD FFS (and I think Linux EXT2 aswell) store
the file type in the directory entry aswell (a performance hack), and
export it to the user via the d_type field in the dirent structure, so
you needn't stat here. Using the d_type field however doesn't work on
all filesystems, so one would need to check first if the filesystem
being searched supports it, and it's not portable but I would think all
systems that use FFS (UFS) support it.

mkb.

From: Tim Bradshaw
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Fri, 12 Aug 2005 16:22:31 +0000
Message-ID: <1123863750.975013.109000@f14g2000cwb.googlegroups.com>

Matthias Buelow wrote:

>
> Generally, yes. However, BSD FFS (and I think Linux EXT2 aswell) store
> the file type in the directory entry aswell (a performance hack), and
> export it to the user via the d_type field in the dirent structure, so
> you needn't stat here. Using the d_type field however doesn't work on
> all filesystems, so one would need to check first if the filesystem
> being searched supports it, and it's not portable but I would think all
> systems that use FFS (UFS) support it.

I'm fairly sure this was not in the original FFS (4.2BSD? or was it
4.1?) but is some more recent addition.

--tim

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Fri, 12 Aug 2005 19:45:58 +0000
Message-ID: <86hddvym4p.fsf@drjekyll.mkbuelow.net>

"Tim Bradshaw" <··········@tfeb.org> writes:

>I'm fairly sure this was not in the original FFS (4.2BSD? or was it
>4.1?) but is some more recent addition.

Indeed.. apparently it has been introduced with 4.4BSD:

     The filesystem format used in 4.4BSD has several  addi-
tions.   Directory entries have an additional field, d_type,
that identifies the type of the entry (normally found in the
st_mode field of the stat structure).  This field is partic-
ularly useful for identifying directories without  the  need
to use stat(2).

(from: McKusick, Bostic, Karels, Leffler,
Installing and Operating 4.4BSD UNIX, July 27, 1993.)

mkb.

From: Kaz Kylheku
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Fri, 12 Aug 2005 18:50:11 +0000
Message-ID: <1123872610.983099.59960@g47g2000cwa.googlegroups.com>

Matthias Buelow wrote:
> "Tim Bradshaw" <··········@tfeb.org> writes:
>
> >Yes, exactly so.  The person to whom I was replying seemed to think
> >that there was some magic way of knowing that a file was a directory
> >without doing stat() on it, but just by looking at the directory entry
>
> Generally, yes. However, BSD FFS (and I think Linux EXT2 aswell) store
> the file type in the directory entry aswell (a performance hack), and
> export it to the user via the d_type field in the dirent structure, so
> you needn't stat here.

I didn't know that, good to know! According to the manual page for
mke2fs, this is a few feature which is turned on by default if you are
running over at least a 2.2 kernel. But the filesystem won't be
mountable on some old kernels.

From: Kaz Kylheku
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 16:39:25 +0000
Message-ID: <1123691965.019454.92500@g43g2000cwa.googlegroups.com>

Ulrich Hobelmann wrote:
> Kaz Kylheku wrote:
> > The find program tends to be quite optimized. For instance, I believe
> > that GNU find will know that since you are only interested in the names
> > (your only search criterion is -name) leaf directories can be processed
> > efficiently.
> >
> > How do you know a directory contains only leaf nodes? Well, you look at
> > the link count of the directory. If traplogs/ has no subdirectories,
> > then it has a link count of 2. One for the traplogs reference down from
> > the parent, and one for the . (dot) entry from itself. Every
> > subdirectory contributes an additional link count via its .. (dot dot)
> > parent entry.
>
> Unix VFS is a graph, not a tree.

The VFS is a layer that indirects upon different filesystem
implementations, some of which may allow the link() operation to work
on directories.

Anyway, that is irrelevant. Whether or not it's allowed to be an
arbitrary DAG, or is restricted to the special case of a tree, the fact
holds that a link count of 2 means that the directory has no further
links to any other directories -- unless it's some weird filesystem in
which things like .. are faked out and there aren't any real links.

But you see the find program can do a statfs() to figure that out. It
can also detect when it's crossing a mount point, etc.

Also, I never said that the find program's optimizations are bullet
proof. :)

> There can be multiple hard links
> pointing down the tree at certain files, or at certain directories (but
> not upward, unless "root" tells it to, AFAIK, so there are no refcount
> cycles).

Actually, it's not just the refcount cycles that are the problem (a
full fsck could in principle hunt down the garbage and release it), but
deadlocks. In the BSD VFS, the upper layers above the switch assume
that there are no backlinks. If you create backlinks, deadlocks can
happen. Some operations that chase links obtain a lock on the source
and target. If you have links going in opposite direction, the locks
can cause a deadly embrace. What's worse, the wait operations to
acquire these locks are uninterruptible sleeps. Nothing will wake the
sleeping processes; the machine has to be rebooted to clear them away.

I've seen this happen because on one job I wrote a filesystem interface
under BSD to a distributed system which allowed arbitrary cycles in its
structure (it used a distributed garbage collection algorithm to deal
with deleted objects).

I think I may have ended up representing the graph as an virtual tree,
faking everything so it looks like there are no backward references,
and all the vnodes look unique, even if sometimes two or more active
vnode objects in the kernel map to the same object in the distributed
system.

My test case was a bunch of shell scripts running in parallel: creating
and destroying files, doing a find, traversing up and down the tree,
etc. That ferreted out all kinds of deadlocks and crashes.

At one point I put in a cycle detection algorithm in the locking
function, so I could actually detect cases of deadlock and turn them
into error returns. But guess what, the kernel code above in most
places assumed that the lock would always succeed. Ooops! That was a
coding dead-end. (And, to make this topical, I should mention that I
prototyped the algorithm in Common Lisp first).

If the VFS upper layers were coded defensively around the calls to the
vnode locking operation, then the filesystem implementation could
report deadlocks in the vnode locking virtual function. Strategies
could be put in whereby if a deadlock is reported, the system call
would bail all the way to the top, releasing all the vnode locks that
it already holds, and then maybe re-try so the application doesn't see
the EDEADLCK error at all.

> Nobody nowadays uses hard links, but they're great and allow

That's because they learned that when they try to hard link
directories, the link() operation usually returns -1 and sets errno to
EPERM. :)

> you to put files and directories in different locations, like what all
> those attribute-my-file-and-find-it-later do, only the tech has been
> there decades.
>
> > If you know that none of the entries in a directory are subdirectoris,
> > and you are not interested in timestamps, permissions or other inode
> > metadata of these entries, just their names, then you can just scan the
> > directory and not stat() any of the inodes.
>
> Even if you don't know it, you just iterate through the entries.  If
> it's a directory you stat, otherwise you don't.

You don't /know/ whether it's a directory until you stat. A directory
entry in a traditional UNIX-like filesystem is only a name (character
string) and an inode number.

That's the point of the optimization. To logically deduce that an entry
cannot be a directory without doing a stat.

From: Ulrich Hobelmann
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 16:50:22 +0000
Message-ID: <3lupigF14f985U1@individual.net>

Kaz Kylheku wrote:

>> Nobody nowadays uses hard links, but they're great and allow
> 
> That's because they learned that when they try to hard link
> directories, the link() operation usually returns -1 and sets errno to
> EPERM. :)

Well, even for files nobody uses them.  I'd like to, once in a while, 
but then how do you for instance tar/untar such a directory graph?  I 
guess you'd end up with the files twice after untarring, and a cp -R 
wouldn't be much different I suppose.  That's where indexable file 
attributes (one of which is the filename, like on BeOS) are nicer.

>> Even if you don't know it, you just iterate through the entries.  If
>> it's a directory you stat, otherwise you don't.
> 
> You don't /know/ whether it's a directory until you stat. A directory
> entry in a traditional UNIX-like filesystem is only a name (character
> string) and an inode number.

Right.  The stat() is in my "iterate...".  I was assuming that for a 
couple 100 files you usually have the number of syscalls is efficient 
enough...

> That's the point of the optimization. To logically deduce that an entry
> cannot be a directory without doing a stat.

This might be nice for huge directories, yes.  I guess it'd be better 
though, to design a un-unixy FS with better semantics.  The way it is 
now, everybody (Linux, Mac OS, Windows, BSD) designs their own extension 
of UFS (or NTFS) with incompatible semantics.  A standard could help.

To get on-topic, maybe we need a filesystem written in portable (kind 
of) Lisp...  It has garbage collection built-in, and could use the 
pathname stuff to provide one standard interface, even with versioning 
and all. ;)

-- 
I believe in Karma.  That means I can do bad things to people
all day long and I assume they deserve it.
	Dogbert

From: Pascal Bourguignon
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 20:24:36 +0000
Message-ID: <87wtmtleuz.fsf@thalassa.informatimago.com>

Ulrich Hobelmann <···········@web.de> writes:
> Well, even for files nobody uses them.  I'd like to, once in a while,
> but then how do you for instance tar/untar such a directory graph?  I
> guess you'd end up with the files twice after untarring, and a cp -R 
> wouldn't be much different I suppose.  That's where indexable file
> attributes (one of which is the filename, like on BeOS) are nicer.

Not with GNU tar or GNU cp.  But they unify only the paths they work on.
Also cpio can handle them.

Anyways, hard links are useful.  For example, I've got a script to
clean-up path names (removing spaces and accents and other special
characters) that has an option of doing the renaming in place, or
copying or creating symbolic links or creating hard links.  The
latest is the more practical: you keep the original paths if you need
to reverse the process,  you don't need as much space or time as when
copying, and you can more easily tar the hard linked files or some
other manipulations than with symbolic links.

They are very useful to manage access rights too.  But few people bother...

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
I need a new toy.
Tail of black dog keeps good time.
Pounce! Good dog! Good dog!

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 16:45:45 +0000
Message-ID: <3lup9pF14k179U2@news.dfncis.de>

Ulrich Hobelmann <···········@web.de> wrote:

>(but 
>not upward, unless "root" tells it to, AFAIK, so there are no refcount 
>cycles).

That's not true. The only thing you cannot do with hardlinks is
link across devices and link to directories.

Linking to directories was originally allowed for root in early
versions but was later disabled due to the too easy creation
of cycles in the filesystem.

>Nobody nowadays uses hard links

Funny that you a) claim to know everybody on this planet, or the
universe, who is using Unix, and b) proclaim dogmatically something
which can instantly proven to be untrue:

$ ls -i /usr/bin|sort|awk '{print $1}'|uniq -d|wc -l
      37

(FreeBSD 5.4-stable)

mkb.

From: Ulrich Hobelmann
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 16:57:14 +0000
Message-ID: <3lupvbF14p9j1U1@individual.net>

Matthias Buelow wrote:
> Linking to directories was originally allowed for root in early
> versions but was later disabled due to the too easy creation
> of cycles in the filesystem.

Looks like you're right.  ln just complains that foo "is a directory". 
It's too bad Unix can't just use some simple GC in the kernel.  Nicer 
implementations like LFS (on NetBSD) could profit from that as well, 
though one for links, and the other for disk blocks.

>> Nobody nowadays uses hard links
> 
> Funny that you a) claim to know everybody on this planet, or the
> universe, who is using Unix, and b) proclaim dogmatically something
> which can instantly proven to be untrue:
> 
> $ ls -i /usr/bin|sort|awk '{print $1}'|uniq -d|wc -l
>       37
> 
> (FreeBSD 5.4-stable)

Sure, but those are micro-uses.  What I meant was having one file (or 
directory) in multiple places (directories).  Few if any people do that. 
  Instead now we have WinFS, Spotlight, and whatever the upcoming Gnome 
equivalent is called.

Maybe one could simulate them by having one directory /Files (with 
numbers as the filename) and use symlinks to that everywhere :D

-- 
I believe in Karma.  That means I can do bad things to people
all day long and I assume they deserve it.
	Dogbert

From: Greg Menke
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 17:11:44 +0000
Message-ID: <m3d5ol3een.fsf@athena.pienet>

Ulrich Hobelmann <···········@web.de> writes:

> Matthias Buelow wrote:
> 
> >> Nobody nowadays uses hard links
> > Funny that you a) claim to know everybody on this planet, or the
> > universe, who is using Unix, and b) proclaim dogmatically something
> > which can instantly proven to be untrue:
> > $ ls -i /usr/bin|sort|awk '{print $1}'|uniq -d|wc -l
> >       37
> > (FreeBSD 5.4-stable)
> 
> Sure, but those are micro-uses.  What I meant was having one file (or
> directory) in multiple places (directories).  Few if any people do
> that. Instead now we have WinFS, Spotlight, and whatever the upcoming
> Gnome equivalent is called.
> 
> Maybe one could simulate them by having one directory /Files (with
> numbers as the filename) and use symlinks to that everywhere :D


Solaris init.d scripts are often hardlinked from the rc#.d directories.
Hardly a micro-use...

Gregm

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 18:59:05 +0000
Message-ID: <86u0hxliti.fsf@drjekyll.mkbuelow.net>

Ulrich Hobelmann <···········@web.de> writes:

>Sure, but those are micro-uses.  What I meant was having one file (or

Why are those "micro-uses" (what's a "micro-use" anyways)? They
constitute a substantial portion of the binaries in /usr/bin here,
maybe as much as 1/3 (I won't count them now but 37 unique hardlinks
out of a set of 417 files probably will add to a substantial
percentage, even if each of those files has only two entries).

>directory) in multiple places (directories).  Few if any people do
>that.

Again, I'm amazed how you extrapolate from your own understanding to
the entire Unix+Linux community. I at least see (and have seen, and
have used myself) hardlinks being used where they make sense. There
aren't that many places where you want files doubly and often devices
are crossed, which mandates symbolic links.

>Instead now we have WinFS, Spotlight, and whatever the upcoming
>Gnome equivalent is called.

I don't know what you're talking about.

>Maybe one could simulate them by having one directory /Files (with
>numbers as the filename) and use symlinks to that everywhere :D

See above.

mkb.

From: Ulrich Hobelmann
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Thu, 11 Aug 2005 00:38:10 +0000
Message-ID: <3lvkvkF143ns0U1@individual.net>

Matthias Buelow wrote:
> Ulrich Hobelmann <···········@web.de> writes:
> 
>> Sure, but those are micro-uses.  What I meant was having one file (or
> 
> Why are those "micro-uses" (what's a "micro-use" anyways)? They
> constitute a substantial portion of the binaries in /usr/bin here,
> maybe as much as 1/3 (I won't count them now but 37 unique hardlinks
> out of a set of 417 files probably will add to a substantial
> percentage, even if each of those files has only two entries).

I called them micro-uses because it's basically just aliasing, and not 
putting files under different categories / dir hierarchies.

>> directory) in multiple places (directories).  Few if any people do
>> that.
> 
> Again, I'm amazed how you extrapolate from your own understanding to
> the entire Unix+Linux community. I at least see (and have seen, and
> have used myself) hardlinks being used where they make sense. There
> aren't that many places where you want files doubly and often devices
> are crossed, which mandates symbolic links.

Certainly hard links are used by Unix people.  But I don't even know of 
casual users who even know what hard links are and how they differ from 
symlinks.  The way the OS organizes data for the user just doesn't 
really employ symlinks, while OSes like BeOS or Mac OS 10.4 explicitly 
publish their notion of how to organize files to the casual user.

>> Instead now we have WinFS, Spotlight, and whatever the upcoming
>> Gnome equivalent is called.
> 
> I don't know what you're talking about.

Finding files not by just one (possibly hierarchic) attribute (user 
local emacs lisp bla), but by different attributes (which could 
correspond to different dir hierarchies (Unix), or different attributes 
(in the BeOS model), or different keywords (Mac Spotlight & clones).

>> Maybe one could simulate them by having one directory /Files (with
>> numbers as the filename) and use symlinks to that everywhere :D
> 
> See above.

-- 
I believe in Karma.  That means I can do bad things to people
all day long and I assume they deserve it.
	Dogbert

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Thu, 11 Aug 2005 23:45:42 +0000
Message-ID: <86slxgujfd.fsf@drjekyll.mkbuelow.net>

Ulrich Hobelmann <···········@web.de> writes:

>Finding files not by just one (possibly hierarchic) attribute (user
>local emacs lisp bla), but by different attributes (which could
>correspond to different dir hierarchies (Unix), or different
>attributes (in the BeOS model), or different keywords (Mac Spotlight &
>clones).

Aha, I know nothing about that. I hope it doesn't end up in a symlink
zoo.

mkb.

From: Marcin 'Qrczak' Kowalczyk
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Sun, 21 Aug 2005 13:12:40 +0000
Message-ID: <87br3rzb5j.fsf@qrnik.zagroda>

Ulrich Hobelmann <···········@web.de> writes:

> I called them micro-uses because it's basically just aliasing, and
> not putting files under different categories / dir hierarchies.

Gnus uses hardlinks to store crossposted articles.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From: Ulrich Hobelmann
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Sun, 21 Aug 2005 14:22:20 +0000
Message-ID: <3mrh0tF188aeaU1@individual.net>

Marcin 'Qrczak' Kowalczyk wrote:
> Ulrich Hobelmann <···········@web.de> writes:
> 
>> I called them micro-uses because it's basically just aliasing, and
>> not putting files under different categories / dir hierarchies.
> 
> Gnus uses hardlinks to store crossposted articles.

Ah, that's good to hear. :)

I just wonder how one would archive such a tree.  I can only think of 
getting a kind of disk image (directory table, and inode to data 
mapping), but not something like tar.

-- 
I believe in Karma.  That means I can do bad things to people
all day long and I assume they deserve it.
	Dogbert

From: Hartmann Schaffer
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Sun, 21 Aug 2005 20:16:50 +0000
Message-ID: <Gi5Oe.2149$Dd.9815@newscontent-01.sprint.ca>

Ulrich Hobelmann wrote:
> Marcin 'Qrczak' Kowalczyk wrote:
> 
>> Ulrich Hobelmann <···········@web.de> writes:
>>
>>> I called them micro-uses because it's basically just aliasing, and
>>> not putting files under different categories / dir hierarchies.
>>
>>
>> Gnus uses hardlinks to store crossposted articles.
> 
> 
> Ah, that's good to hear. :)
> 
> I just wonder how one would archive such a tree.  I can only think of 
> getting a kind of disk image (directory table, and inode to data 
> mapping), but not something like tar.


gnu tar handles hard links

hs

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Sun, 21 Aug 2005 16:38:34 +0000
Message-ID: <3mrp0aF18ejemU1@news.dfncis.de>

Ulrich Hobelmann <···········@web.de> wrote:

>I just wonder how one would archive such a tree.  I can only think of 
>getting a kind of disk image (directory table, and inode to data 
>mapping), but not something like tar.

Cpio should work, since it has fields for storing the inode and
device number in its format specification, aswell as a field for
the number of links for the file.

mkb.

From: Larry Clapp
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Mon, 22 Aug 2005 02:23:56 +0000
Message-ID: <slrndgido8.isn.larry@theclapp.ddts.net>

On 2005-08-21, Ulrich Hobelmann <···········@web.de> wrote:
> Marcin 'Qrczak' Kowalczyk wrote:
>> Ulrich Hobelmann <···········@web.de> writes:
>>> I called them micro-uses because it's basically just aliasing, and
>>> not putting files under different categories / dir hierarchies.
>> 
>> Gnus uses hardlinks to store crossposted articles.
>
> Ah, that's good to hear. :)
>
> I just wonder how one would archive such a tree.  I can only think
> of getting a kind of disk image (directory table, and inode to data
> mapping), but not something like tar.

Why would you expect such a widely used tool to be so fundamentally
broken?  Why would you expect such difficulties in tracking inode
numbers and recreating links as appropriate?

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Mon, 22 Aug 2005 06:26:37 +0000
Message-ID: <3mt9gtF18d1kqU1@news.dfncis.de>

Larry Clapp <·····@theclapp.org> wrote:

>Why would you expect such a widely used tool to be so fundamentally
>broken?  Why would you expect such difficulties in tracking inode
>numbers and recreating links as appropriate?

The standard ustar format (aswell as the old tar format) doesn't
have fields for inode/device numbers.
cpio does, as mentioned previously.

mkb.

From: Larry Clapp
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Mon, 22 Aug 2005 21:53:58 +0000
Message-ID: <slrndgki9v.isn.larry@theclapp.ddts.net>

On 2005-08-22, Matthias Buelow <···@incubus.de> wrote:
> Larry Clapp <·····@theclapp.org> wrote:
>>Why would you expect such a widely used tool to be so fundamentally
>>broken?  Why would you expect such difficulties in tracking inode
>>numbers and recreating links as appropriate?
>
> The standard ustar format (aswell as the old tar format) doesn't
> have fields for inode/device numbers.  cpio does, as mentioned
> previously.

Well, /usr/bin/tar (not a GNU tar) under AIX handles hard and
softlinks correctly.  Loading a tar of

  65551 -rw-r--r--    1 user  group            0 Aug 22 17:29 a
  65552 -rw-r--r--    3 user  group            0 Aug 22 17:29 b
  65552 -rw-r--r--    3 user  group            0 Aug 22 17:29 c
  65553 lrwxrwxrwx    1 user  group            1 Aug 22 17:30 d -> a
  65552 -rw-r--r--    3 user  group            0 Aug 22 17:29 e

into an editor shows that the entry for b is "primary" and the entries
for c and e are links to b; d is a link to a, with a different flag
in front of it.  Each entry also contains the string "ustar", so it
seems to think it's in ustar format.

I've used, at various times since 1986, HP-UX, AIX, SunOS, Solaris,
and Minix; it would've surprised me a lot if tar didn't handle links
correctly on any of those systems (except maybe Minix).

On the other hand, I came to this thread late, so maybe I'm missing
the entire point of the discussion.  :)

-- Larry

From: Marcin 'Qrczak' Kowalczyk
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Tue, 23 Aug 2005 17:42:27 +0000
Message-ID: <871x4ksg70.fsf@qrnik.zagroda>

Matthias Buelow <···@incubus.de> writes:

>>Why would you expect such a widely used tool to be so fundamentally
>>broken?  Why would you expect such difficulties in tracking inode
>>numbers and recreating links as appropriate?
>
> The standard ustar format (aswell as the old tar format) doesn't
> have fields for inode/device numbers.

There is no need to store them inside the archive. The archive must
contain something analogous to a symlink, i.e. a directive to create
a hardlink to a previously unarchived file. Tar only needs to build a
(device, inode) -> filename mapping in memory, for files it encounters
which have the link count greater than 1, so it can discover which of
them are shared which which others.

It works only for hardlinks within the archive. The user shouldn't
expect tar to restore hardlink relationships with files not being
archived; it's unimplementable without scanning the whole volume.

It's like Lisp #n= / #n# notation. It works only within one form, and
it would be unimplementable if it was expected to preserve sharing
with objects not being written together. The notation is asymmetric,
but the underlying sharing semantics doesn't distinguish any path.

-- 
   __("<         Marcin Kowalczyk
   \__/       ······@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

From: Ulrich Hobelmann
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Mon, 22 Aug 2005 10:11:54 +0000
Message-ID: <3mtmnaF17n00mU1@individual.net>

Larry Clapp wrote:
> On 2005-08-21, Ulrich Hobelmann <···········@web.de> wrote:
>> Marcin 'Qrczak' Kowalczyk wrote:
>>> Ulrich Hobelmann <···········@web.de> writes:
>>>> I called them micro-uses because it's basically just aliasing, and
>>>> not putting files under different categories / dir hierarchies.
>>> Gnus uses hardlinks to store crossposted articles.
>> Ah, that's good to hear. :)
>>
>> I just wonder how one would archive such a tree.  I can only think
>> of getting a kind of disk image (directory table, and inode to data
>> mapping), but not something like tar.
> 
> Why would you expect such a widely used tool to be so fundamentally
> broken?  Why would you expect such difficulties in tracking inode
> numbers and recreating links as appropriate?

Because tar simply reads a stream and creates files from it (or reads 
files and pipes them to a tape device/file).

-- 
I believe in Karma.  That means I can do bad things to people
all day long and I assume they deserve it.
	Dogbert

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Mon, 22 Aug 2005 12:09:01 +0000
Message-ID: <3mttitF180sq7U1@news.dfncis.de>

Ulrich Hobelmann <···········@web.de> wrote:

>> Why would you expect such a widely used tool to be so fundamentally
>> broken?  Why would you expect such difficulties in tracking inode
>> numbers and recreating links as appropriate?
>
>Because tar simply reads a stream and creates files from it (or reads 
>files and pipes them to a tape device/file).

This is not the reason. All programs create files sequentially, how
should they otherwise?

mkb.

From: Ulrich Hobelmann
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Mon, 22 Aug 2005 14:21:17 +0000
Message-ID: <3mu5atF187gk0U1@individual.net>

Matthias Buelow wrote:
> Ulrich Hobelmann <···········@web.de> wrote:
> 
>>> Why would you expect such a widely used tool to be so fundamentally
>>> broken?  Why would you expect such difficulties in tracking inode
>>> numbers and recreating links as appropriate?
>> Because tar simply reads a stream and creates files from it (or reads 
>> files and pipes them to a tape device/file).
> 
> This is not the reason. All programs create files sequentially, how
> should they otherwise?

Sure :)  But since tar is just a stream-like format, without a header 
(like .zip) there's no way to tell how it should handle a hardlink.  If 
tar archives a hardlink, it would have to find out if the file is 
already in the archive (ok, maybe it could memorize that) and then do 
what?  If tar had a header, it could make a note, "this is file XY, but 
it's a link to file Z", but instead tar can only tar it up again.

When unpacking, tar reads the filename and data and creates the file. 
No way to know if it should instead hardlink to another file, since 
there's no file table, just more files to come...

I'd be interested to know how GNU tar handles this.

-- 
I believe in Karma.  That means I can do bad things to people
all day long and I assume they deserve it.
	Dogbert

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Mon, 22 Aug 2005 18:12:19 +0000
Message-ID: <3muis3F18tesaU1@news.dfncis.de>

Ulrich Hobelmann <···········@web.de> wrote:

>Sure :)  But since tar is just a stream-like format, without a header 
>(like .zip) there's no way to tell how it should handle a hardlink.  If 
>tar archives a hardlink, it would have to find out if the file is 
>already in the archive (ok, maybe it could memorize that) and then do 
>what?  If tar had a header, it could make a note, "this is file XY, but 
>it's a link to file Z", but instead tar can only tar it up again.
>When unpacking, tar reads the filename and data and creates the file. 
>No way to know if it should instead hardlink to another file, since 
>there's no file table, just more files to come...

The name is stored in the "link name" field.  Actually, it works
just as expected with tar (I was a bit confused about the length
of link name, which is just 100 bytes, whereas the actual pathname
is 100 + 155 ("prefix") but apparently the prefix is assumed to be
common..) Since the name is stored, there's no need for storing
dev/inode numbers in the file.
With cpio, the inode/device numbers are stored in the file. With
them, cpio can also reproduce hardlinks upon extraction.

>I'd be interested to know how GNU tar handles this.

Unfortunately, the Gnutar code is pretty much of a mess (as is so
often with Gnu tools) but I don't know where the problem should be.
The easiest to read (imho) is the source for the open source version
of pax.

mkb.

From: Pascal Bourguignon
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Mon, 22 Aug 2005 16:46:03 +0000
Message-ID: <87ll2t7wdw.fsf@thalassa.informatimago.com>

Matthias Buelow <···@incubus.de> writes:

> Ulrich Hobelmann <···········@web.de> wrote:
>
>>> Why would you expect such a widely used tool to be so fundamentally
>>> broken?  Why would you expect such difficulties in tracking inode
>>> numbers and recreating links as appropriate?
>>
>>Because tar simply reads a stream and creates files from it (or reads 
>>files and pipes them to a tape device/file).
>
> This is not the reason. All programs create files sequentially, how
> should they otherwise?

2 threads on a bi-processors?

In unix, file creation is serialized, but we can easily imagine a
filesystem where file creation is not serialized.


-- 
"You cannot really appreciate Dilbert unless you read it in the
original Klingon"

From: Tim X
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Tue, 23 Aug 2005 12:00:51 +0000
Message-ID: <87y86s3lsc.fsf@tiger.rapttech.com.au>

Matthias Buelow <···@incubus.de> writes:

> Ulrich Hobelmann <···········@web.de> wrote:
> 
> >> Why would you expect such a widely used tool to be so fundamentally
> >> broken?  Why would you expect such difficulties in tracking inode
> >> numbers and recreating links as appropriate?
> >
> >Because tar simply reads a stream and creates files from it (or reads 
> >files and pipes them to a tape device/file).
> 
> This is not the reason. All programs create files sequentially, how
> should they otherwise?
> 

What about Unix "sparse' files which have big holes in them (like the
wtmp/utmp? I don't think these are necessarily created sequencially
(however, this doesn't negate your point re: tars limitation not being
related to sequencial writing of data). 

Note also that tar doesn't have a problem with symbolic
links. Generally you can cntrol wether it retains the sym link info or
dumps what the symlink points to. 

Hard links I'm not so sure about. Once upon a time you couldn't hard
link from one filesystem to another (or across partitions). However,
I'm not sure what the current state of things is as filesystems have
moved on considerably since the days I had to worry about things at
that level. 

Tim

-- 
Tim Cross
The e-mail address on this message is FALSE (obviously!). My real e-mail is
to a company in Australia called rapttech and my login is tcross - if you 
really need to send mail, you should be able to work it out!

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Tue, 23 Aug 2005 12:08:16 +0000
Message-ID: <3n0htfF18u0tnU1@news.dfncis.de>

Tim X <····@spamto.devnul.com> wrote:

>Hard links I'm not so sure about. Once upon a time you couldn't hard
>link from one filesystem to another (or across partitions). However,
>I'm not sure what the current state of things is as filesystems have
>moved on considerably since the days I had to worry about things at
>that level. 

Hardlinks work only on the same filesystem.

mkb.

From: Rob Warnock
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Thu, 25 Aug 2005 09:38:27 +0000
Message-ID: <Rd6dnftsbbAODJDeRVn-ug@speakeasy.net>

Marcin 'Qrczak' Kowalczyk  <······@knm.org.pl> wrote:
+---------------
| Ulrich Hobelmann <···········@web.de> writes:
| > I called them micro-uses because it's basically just aliasing, and
| > not putting files under different categories / dir hierarchies.
| 
| Gnus uses hardlinks to store crossposted articles.
+---------------

A small company I worked for long, long ago -- so long ago there were
no symlinks! [yes, we were using Unix v.7] -- used massive numbers of
hardlinks to implement a kind of version control and recompilation
optimizer for our product builds [which included the Unix kernel].
We modified *all* the editors people used ["ed", "vi", Rand "e"]
to move files out of the way before writing to them, and we also
modified *all* the Unix Makefiles to remove targets before writing
to them [but only if it really was going to write to one]. Then we
modified "cp" [later renamed "cpt"] to add a "-l" option that said
"make links if possible" [hardlinks, as noted above]. This meant that
all it took to clone a new version of the kernel source tree, make
some tweaks to a few source files, then build a new kernel was this:

	$ mkdir -p my_workspace/src
	$ cpt -rl /kernel/src/{version} . my_workspace/src
	$ cd my_workspace/src
	$ ...edit stuff..
	$ make
	...
	$

Since the "cpt" did hardlinks, all of the object files would be
preserved for source files that hadn't changed, and *huge* amounts
of compile time and disk space would be saved. And because of the
Makefile changes to pre-remove targets, your "make" wouldn't change
anyone else's already-compiled object files.

Integration of multiple peoples' changes into the {version+1} snapshot
wasn't too bad, except where multiple people had changed the same files.
But N-way merges are a pain no matter how you do them...


-Rob

-----
Rob Warnock			<····@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607

From: Matthias Buelow
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 02:33:28 +0000
Message-ID: <3lt7bnF122463U1@news.dfncis.de>

Jeff Cunningham <·······@cunningham.net> wrote:
>;; first method
>(let* ((s (with-output-to-string (str)
>           (ext:run-program "find"
>                            '("traplogs" "-name" "*.msg" "-print") :output
>                            str)))
>       (msgs (sort (split "[\\n]" s) #'string-lessp)))
>  (format t "number of msg files=~s~%" (length msgs)))
>                                                                                          
>;; second method
>(let ((files (directory "traplogs/*.msg")))
>  (format t "number of msg files=~s~%" (length files)))
>                                                                                          
>The latter method takes a couple minutes to produce a list of 12931 files.
>The former method takes less than a second.
>
>Why such a difference?

Probably because you ran the "second method" before the first?

mkb.

From: Jeff Cunningham
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Wed, 10 Aug 2005 03:11:26 +0000
Message-ID: <pan.2005.08.10.03.11.25.557547@cunningham.net>

On Wed, 10 Aug 2005 02:33:28 +0000, Matthias Buelow wrote:

> Jeff Cunningham <·······@cunningham.net> wrote:
>>;; first method
>>(let* ((s (with-output-to-string (str)
>>           (ext:run-program "find"
>>                            '("traplogs" "-name" "*.msg" "-print")
>>                            :output str)))
>>       (msgs (sort (split "[\\n]" s) #'string-lessp)))
>>  (format t "number of msg files=~s~%" (length msgs)))
>>                                                                                          
>>;; second method
>>(let ((files (directory "traplogs/*.msg")))
>>  (format t "number of msg files=~s~%" (length files)))
>>                                                                                          
>>The latter method takes a couple minutes to produce a list of 12931
>>files. The former method takes less than a second.
>>
>>Why such a difference?
> 
> Probably because you ran the "second method" before the first?
> 
> mkb.

Its not a caching phenomenon. I checked that. The time difference is
invariant to order. The DIRECTORY execution time just seems to be a heck
of a lot slower when there are a lot of files involved.

-jeff

From: Jeff Cunningham
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Thu, 11 Aug 2005 15:32:13 +0000
Message-ID: <pan.2005.08.11.15.32.13.8755@cunningham.net>

On Tue, 09 Aug 2005 20:11:26 -0700, Jeff Cunningham wrote:

What would it take (I wonder) to improve the performance of DIRECTORY when
it is improvable? At least on UNIX file systems (ext3,xfs,reiserfs)? There
is something esthetically unpleasant about having to use ext:run-program
to do something in Lisp it could perfectly well do itself. 

-jeff

From: Alexander Kjeldaas
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Tue, 23 Aug 2005 00:53:47 +0000
Message-ID: <1124758427.494546.94720@g43g2000cwa.googlegroups.com>

Jeff Cunningham wrote:
> ;; first method
> (let* ((s (with-output-to-string (str)
>            (ext:run-program "find"
>                             '("traplogs" "-name" "*.msg" "-print") :output
>                             str)))
>        (msgs (sort (split "[\\n]" s) #'string-lessp)))
>   (format t "number of msg files=~s~%" (length msgs)))
>
> ;; second method
> (let ((files (directory "traplogs/*.msg")))
>   (format t "number of msg files=~s~%" (length files)))
>
> The latter method takes a couple minutes to produce a list of 12931 files.
> The former method takes less than a second.
>
> Why such a difference?
>

When running the second method through the sb-sprof sampling profiler
in SBCL/x86-64, the following functions come up on top:

           Self        Cumul        Total
  Nr  Count     %  Count     %  Count     % Function
------------------------------------------------------------------------
   1     36  13.6     36  13.6     36  13.6 "foreign function
__xstat64"
   2     20   7.6     26   9.8     56  21.2 (SB-C::TL-XEP
SB-KERNEL:HAIRY-DATA-VECTOR-REF)
   3     16   6.1    120  45.5     72  27.3 (SB-C::TL-XEP
SB-IMPL::CONCAT-TO-SIMPLE*)
   4     16   6.1     16   6.1     88  33.3 (SB-C::TL-XEP ARRAY-RANK)
   5     13   4.9     19   7.2    101  38.3 (SB-C::TL-XEP
SB-KERNEL:HAIRY-DATA-VECTOR-SET)
   6     13   4.9     13   4.9    114  43.2 "foreign function __lxstat"
   7     12   4.5     16   6.1    126  47.7 (SB-C::TL-XEP
ARRAY-DIMENSION)
   8     10   3.8     10   3.8    136  51.5 (FLET #:CLEANUP-FUN-54)
   9     10   3.8     10   3.8    146  55.3 "foreign function
sigprocmask"
  10      8   3.0      8   3.0    154  58.3 (SB-C::TL-XEP
SB-KERNEL:%SP-STRING-COMPARE)
  11      8   3.0      8   3.0    162  61.4 "foreign function readlink"
  12      6   2.3     14   5.3    168  63.6 (SB-C::TL-XEP
SB-KERNEL:SPECIFIER-TYPE)
  13      6   2.3      6   2.3    174  65.9 "no debug information for
frame"
  14      5   1.9    137  51.9    179  67.8 (SB-C::TL-XEP CONCATENATE)
  15      5   1.9     32  12.1    184  69.7 (SB-C::TL-XEP
MAKE-SEQUENCE)
  16      5   1.9      8   3.0    189  71.6 (SB-C::TL-XEP NCONC)
  17      5   1.9      5   1.9    194  73.5 (SB-C::TL-XEP
SB-IMPL::SPLIT-AT-SLASHES)
  18      4   1.5      4   1.5    198  75.0 (SB-C::TL-XEP
SB-KERNEL:%FIND-POSITION)
  19      3   1.1      3   1.1    201  76.1 (LABELS
SB-IMPL::SXHASH-RECURSE)
  20      3   1.1      9   3.4    204  77.3 (SB-C::TL-XEP
SB-KERNEL:VALUES-SPECIFIER-TYPE)
  21      3   1.1     18   6.8    207  78.4 (SB-C::TL-XEP
SB-IMPL::PARSE-UNIX-NAMESTRING)
  22      3   1.1      3   1.1    210  79.5 (SB-C::TL-XEP
SB-IMPL::LAST1)
  23      3   1.1      4   1.5    213  80.7 (LAMBDA
(SB-KERNEL:INSTANCE))
  24      3   1.1      3   1.1    216  81.8 (SB-C::TL-XEP
SB-KERNEL:TYPE=)
  25      3   1.1      3   1.1    219  83.0 (SB-C::TL-XEP STRING<)
  26      2   0.8     80  30.3    221  83.7 (SB-C::TL-XEP
SB-IMPL::%ENUMERATE-DIRECTORIES)
  27      2   0.8      2   0.8    223  84.5 (SB-C::TL-XEP ALIEN-SAP)
  28      2   0.8      3   1.1    225  85.2 (SB-C::TL-XEP
SB-KERNEL:STRING=*)
  29      2   0.8      2   0.8    227  86.0 (SB-C::TL-XEP
SB-INT:EQUAL-BUT-NO-CAR-RECURSION)
  30      2   0.8      2   0.8    229  86.7 (SB-C::TL-XEP LENGTH)
  31      2   0.8      2   0.8    231  87.5 (SB-C::TL-XEP
SB-KERNEL:TYPE-SPECIFIER)
  32      2   0.8      2   0.8    233  88.3 (SB-C::XEP (LAMBDA
(SB-KERNEL:INSTANCE)))
  33      2   0.8      2   0.8    235  89.0 (SB-C::TL-XEP
SB-KERNEL:UB8-BASH-COPY)
  34      2   0.8      3   1.1    237  89.8 (SB-C::TL-XEP
SB-IMPL::MAYBE-MAKE-PATTERN)
  35      2   0.8      3   1.1    239  90.5 (SB-C::TL-XEP MAKE-ARRAY)
  36      1   0.4      1   0.4    240  90.9 (SB-C::TL-XEP
SB-KERNEL:SHRINK-VECTOR)
  37      1   0.4      2   0.8    241  91.3 (SB-C::TL-XEP
SB-IMPL::REHASH)
  38      1   0.4      4   1.5    242  91.7 (SB-C::TL-XEP
SB-KERNEL:%PUTHASH)
  39      1   0.4     25   9.5    243  92.0 (SB-C::TL-XEP
PARSE-NAMESTRING)
  40      1   0.4     10   3.8    244  92.4 (SB-C::TL-XEP
SB-UNIX:UNIX-READLINK)
  41      1   0.4      7   2.7    245  92.8 (SB-C::TL-XEP
SB-KERNEL::CHARACTER-SET-UNPARSE-TYPE-METHOD)
  42      1   0.4      1   0.4    246  93.2 (SB-C::TL-XEP
SB-IMPL::LIST-NREVERSE*)
  43      1   0.4      1   0.4    247  93.6 (SB-C::TL-XEP
SB-KERNEL:%TYPEP)
  44      1   0.4     25   9.5    248  93.9 (SB-C::TL-XEP
SB-IMPL::%PARSE-NAMESTRING)
  45      1   0.4      2   0.8    249  94.3 (SB-C::TL-XEP
SB-ALIEN-INTERNALS:MAKE-ALIEN-VALUE)
  46      1   0.4      1   0.4    250  94.7 "foreign function
stat_wrapper"
  47      1   0.4      1   0.4    251  95.1 (SB-C::TL-XEP
SB-KERNEL:CLASSOID-TYPEP)
  48      1   0.4      2   0.8    252  95.5 (SB-C::TL-XEP
SB-KERNEL::%%TYPEP)
  49      1   0.4      1   0.4    253  95.8 "foreign function
closure_tramp"
  50      1   0.4      1   0.4    254  96.2 (SB-C::TL-XEP
SB-KERNEL:CSUBTYPEP)
  51      1   0.4      1   0.4    255  96.6 (SB-C::TL-XEP IDENTITY)
  52      1   0.4      1   0.4    256  97.0 (SB-C::TL-XEP ASH)
  53      1   0.4      1   0.4    257  97.3 (SB-C::TL-XEP
PATHNAME-VERSION)
  54      1   0.4      1   0.4    258  97.7 (SB-C::TL-XEP
SB-KERNEL:TWO-ARG-AND)
  55      1   0.4      1   0.4    259  98.1 (FLET SB-IMPL::TRICK)
  56      1   0.4      1   0.4    260  98.5 "foreign function strlen"
  57      1   0.4      1   0.4    261  98.9 (SB-C::TL-XEP
SB-IMPL::UNPARSE-UNIX-PIECE)
  58      1   0.4      1   0.4    262  99.2 (SB-C::TL-XEP
SB-KERNEL::%MAKE-INSTANCE-WITH-LAYOUT)
  59      1   0.4      1   0.4    263  99.6 (SB-C::TL-XEP
SB-KERNEL::%CHECK-STRUCTURE-TYPE-FROM-LAYOUT)


So although optimizing the number of stat calls would improve, there
are other issues that make this code slow as well.  Also, the code
conses about 100MB in my tests - that is roughly 10kB per directory
entry.

The code probably accesses a lot of data through a slow FFI interface,
conses a lot of memory, and generally calls stat() too often.   In my
tests, the find utility is roughly 200x as fast as the SBCL code on
"hot" directory data.

astor

From: Juho Snellman
Subject: Re: Big time difference between (unix) find and DIRECTORY : why?
Date: Tue, 23 Aug 2005 01:49:03 +0000
Message-ID: <slrndgl04f.ml6.jsnell@sbz-31.cs.Helsinki.FI>

<··················@gmail.com> wrote:
> Jeff Cunningham wrote:
>>            (ext:run-program "find"
>>                             '("traplogs" "-name" "*.msg" "-print") :output
[...]
>> The latter method takes a couple minutes to produce a list of 12931 files.
>> The former method takes less than a second.
>>
>> Why such a difference?
> 
> When running the second method through the sb-sprof sampling profiler
> in SBCL/x86-64, the following functions come up on top:
[...]
> The code probably accesses a lot of data through a slow FFI interface,
> conses a lot of memory, and generally calls stat() too often.   In my
> tests, the find utility is roughly 200x as fast as the SBCL code on
> "hot" directory data.

There's definitely room for improvement in the SBCL DIRECTORY (and
some of it is obviously low-hanging fruit). However, based on the
example code the original poster is more probably using CMUCL, where
the performance on large directories is even more painful. On a
directory with 35k files CMUCL 19a took 181 seconds and SBCL 2.5
seconds.

-- 
Juho Snellman
"Premature profiling is the root of all evil."