Of a bad sort: 1 10 2 3 4 5 6 7 8 9

From: Dave Yost
Subject: Of a bad sort: 1 10 2 3 4 5 6 7 8 9
Date: Sun, 09 Apr 1995 00:00:00 +0000
Message-ID: <3m9lsi$s1@Yost.com>

Consider the following list.  Is it sorted?

    1
    2
    10

How about this one:

    1
    10
    2

Most people would say the latter is not sorted.  Yet we continue
to write software that sorts strings this way.  Why do we do it?
Because the available string comparison functions embody a
straightforward, simple implementation--which happens to be wrong
when the sorting of the result is done solely for the convenience
of a user.

I propose that the following functions be added to Common Lisp
so that sorting appropriate for humans can be accomplished
straightforwardly and efficiently:

   human-string<
   human-string>
   human-string<=
   human-string>=
   human-string-lessp
   human-string-greaterp
   human-string-not-greaterp
   human-string-not-lessp

When there is an embedded numeric string at the same position
in both strings, compare those substrings as if they each occupied
a single character position.

Has anyone implemented this in lisp?
Perhaps someone can come up with better names?

Dave Yost

Re: Of a bad sort: 1 10 2 3 4 5 6 7 8 9 Barry Margolin
- Re: Of a bad sort: 1 10 2 3 4 5 6 7 8 9 Thomas Breuel

From: Barry Margolin
Subject: Re: Of a bad sort: 1 10 2 3 4 5 6 7 8 9
Date: Mon, 10 Apr 1995 00:00:00 +0000
Message-ID: <3mc99l$70h@tools.near.net>

In article <·········@Yost.com> ····@Yost.com (Dave Yost) writes:
]    1
]    10
]    2
]
]Most people would say the latter is not sorted.  

That's why "order" is relative to a particular comparison function.
Lexicographically, it's ordered, but it's not ordered numerically.

]						  Yet we continue
]to write software that sorts strings this way.  Why do we do it?
]Because the available string comparison functions embody a
]straightforward, simple implementation--which happens to be wrong
]when the sorting of the result is done solely for the convenience
]of a user.

(sort list-of-strings #'< :key parse-integer)

will handle the above simple case.

]I propose that the following functions be added to Common Lisp
]so that sorting appropriate for humans can be accomplished
]straightforwardly and efficiently:
]
]   human-string<
]   human-string>
]   human-string<=
]   human-string>=
]   human-string-lessp
]   human-string-greaterp
]   human-string-not-greaterp
]   human-string-not-lessp
]
]When there is an embedded numeric string at the same position
]in both strings, compare those substrings as if they each occupied
]a single character position.

While this is feasible, I think it's vast overkill and likely to be
expensive to search every string for numeric substrings.  If all the
strings are expected to have the numeric portion at the same position, then
that position is probably known by the application.  Therefore, it would be
more appropriate to use a comparison function designed specifically for
that application, which knows how to parse the strings and extract the
numeric portion.
-- 
Barry Margolin
BBN Planet Corporation, Cambridge, MA
······@bbnplanet.com

From: Thomas Breuel
Subject: Re: Of a bad sort: 1 10 2 3 4 5 6 7 8 9
Date: Wed, 12 Apr 1995 00:00:00 +0000
Message-ID: <TMB.95Apr11205133@netcom6.netcom.com>

In article <··········@tools.near.net> ······@nic.near.net (Barry Margolin) writes:
| ]When there is an embedded numeric string at the same position
| ]in both strings, compare those substrings as if they each occupied
| ]a single character position.
|
| While this is feasible, I think it's vast overkill and likely to be
| expensive to search every string for numeric substrings.

I think it's actually pretty cheap to implement this, even in
CommonLisp. You walk the strings left-to-right; if and when you hit
digits in both strings, you note the beginning of the non-zero leading
digit. You then continue walking forward. By the time you reach the
end of the digit substrings, they may either have turned out to be the
same, in which case you just continue never looking back at the
digits, or they are different, in which case you just walk backwards
and compare them in reverse lexicographic order. There is not really
a separate search involved, and most of the loop looks just like the
usual character-wise string comparison function. The only thing that
you can't do anymore is use word comparison for counted (rather than
zero terminated) strings.

In fact, this might be a nice sorting option for "ls", since people
never seem to remember to put enough leading zeros on their file names
when they are numerical.

On the other hand, I don't think something like this belongs in
a language standard. It might be part of a standard library.

Now, what about embedded g-format floating point numbers :-)

Thomas.