Re: Compression and on-disk sorting - Mailing list pgsql-hackers

From Andrew Piskorski
Subject Re: Compression and on-disk sorting
Date
Msg-id 20060517085230.GA53017@tehun.pair.com
Whole thread Raw
In response to Re: Compression and on-disk sorting  (Greg Stark <gsstark@mit.edu>)
Responses Re: Compression and on-disk sorting
List pgsql-hackers
On Tue, May 16, 2006 at 11:48:21PM -0400, Greg Stark wrote:

> There are some very fast decompression algorithms:
> 
> http://www.oberhumer.com/opensource/lzo/

Sure, and for some tasks in PostgreSQL perhaps it would be useful.
But at least as of July 2005, a Sandor Heman, one of the MonetDB guys,
had looked at zlib, bzlib2, lzrw, and lzo, and claimed that:
 "... in general, it is very unlikely that we could achieve any bandwidth gains with these algorithms. LZRW and LZO
mightincrease bandwidth on relatively slow disk systems, with bandwidths up to 100MB/s, but this would induce high
processingoverheads, which interferes with query execution. On a fast disk system, such as our 350MB/s 12 disk RAID,
allthe generic algorithms will fail to achieve any speedup."
 
 http://www.google.com/search?q=MonetDB+LZO+Heman&btnG=Search http://homepages.cwi.nl/~heman/downloads/msthesis.pdf

> I think most of the mileage from "lookup tables" would be better implemented
> at a higher level by giving tools to data modellers that let them achieve
> denser data representations. Things like convenient enum data types, 1-bit
> boolean data types, short integer data types, etc.

Things like enums and 1 bit booleans certainly could be useful, but
they cannot take advantage of duplicate values across multiple rows at
all, even if 1000 rows have the exact same value in their "date"
column and are all in the same disk block, right?

Thus I suspect that the exact opposite is true, a good table
compression scheme would render special denser data types largely
redundant and obsolete.

Good table compression might be a lot harder to do, of course.
Certainly Oracle's implementation of it had some bugs which made it
difficult to use reliably in practice (in certain circumstances
updates could fail, or if not fail perhaps have pathological
performance), bugs which are supposed to be fixed in 10.2.0.2, which
was only released within the last few months.

-- 
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/


pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: Compression and on-disk sorting
Next
From: "Zeugswetter Andreas DCP SD"
Date:
Subject: Re: Compression and on-disk sorting