From: Joachim Durchholz
Subject: Re: is free, open source software ethical?
Date: 
Message-ID: <1204795923.7577.70.camel@kurier>
Am Donnerstag, den 06.03.2008, 09:11 +0000 schrieb Richard Heathfield:
> It's a sort of reductio ad absurdum. Let's start by assuming that random 
> data *cannot* be compressed. That's your claim, right?

Right.

> Now, either there is such a thing as a process for generating random bits, 
> or there isn't.

There is no *algorithmic* process for generating random data.

What we consider "random data" is just data that's not random, but
almost so ("almost so" in the sense of "we know it isn't random, but we
can't exploit that so for purposes of the concrete task at hand, we can
pretend it is random without affecting the outcome").
Such data that we can pretend to be random even if it isn't is called
"pseudo-random".

> If it is your further claim that there is no such process, 
> then you are effectively arguing that there is no such thing as 
> randomness.

In fact that is not the claim.
Quantum-physical processes are truly random (unless current
quantum-physical models are fundamentally wrong).

> So let's agree that we have a process for generating a genuinely random 
> stream of bits, where for ease of discussion we'll call a 0-bit a "tail" 
> and a 1-bit a "head".
> 
> Okay, here comes the reductio.
> 
> We examine the first N results, and then try to compress { N results + 1 
> head }. Now, either we can or we can't. If we can't, well, we can't. But 
> if we *can* do so, then by your argument the data cannot be random if the 
> next result is a head, and yet we are using a genuinely random stream of 
> bits, so the data must be random, and therefore the next result *must* be 
> a tail - which means that we've predicted the next bit with guaranteed 
> success, which means that the stream can't be random - and this 
> contradicts our initial assumption.

You are ignoring that you need to emit headers to mark the compressed
sequence, which add to the size of the stream.
If you have a truly random stream, IIRC the claim is that these
additional headers will require at least as many bits as you can
compress out.
(In fact this is the typical outcome when compressing even pseudo-random
data: the zip file will be slightly longer than the original one.
Information theory says you can reduce that worst-case overhead to zero
in the asymptotical case but no more.)

> It is not part of my argument that 10 (or whatever) heads is or is not 
> random. My argument uses the fact that it is not only possible but trivial 
> to compress repeated data if there are enough repeats.

Sure, but you get overhead in the case of non-repeated data. E.g. you
need to prefix a block of non-repeated data with a header that contains
the length of the block, which means you have an overhead of log N.
That's exactly what you can gain by compressing runs of equal bits (in
fact you gain a little less because those bits don't compress to a
single bit, you need to store the run length as well; with a clever
encoding scheme, you might be able to reduce the overhead to zero, but
that's all).

(Snipping the rest as far as it is working off refuted arguments.)

> Obviously the fewer the repeats, the harder it is, and the less likely that 
> it can be done. (10 is very low, and was merely a frinstance.) But there's 
> nothing magic about 10 in my argument.

Agreed.

> I generated a run of 1600 heads (where 'set bit' = 'head'), and shoved them 
> through zip, which reduced it to 1216 bits (including all the file header 
> information and stuff - the actual compression reported by zip was 97%; it 
> was the overhead that made the end result so large).

Yes, but if you pick the information out of a truly random source, the
probability that you can compress it is 1 to 2^1600. To make this
compression scheme work, you need to add a header of at least 1 bit to
all data samples that are not all zeroes to mark them as uncompressed,
which means you have increased average data size by 2^-1600, exactly
offsetting your gain in the all-zeroes case.

If you use a more intelligent algorithm that selects the right
compression scheme for each stream of data that you get, you'll have to
add bits to tell the decompressor which algorithm to use, which again
means you're adding just as many bits as you were able to compress out.

(In practice, compression works because the data we're dealing with is
highly non-random.)

> > Certainly RMS is not such a person.
> 
> Whether he is or whether he isn't, the existence of people who write 
> software to give away for motives other than altruism does not imply the 
> non-existence of people who write software to give away altruistically.

Agreed.

Regards,
Jo