Compression in LO and other fields - Mailing list pgsql-hackers

From Cary O'Brien
Subject Compression in LO and other fields
Date
Msg-id 199911132223.RAA05987@saltmine.radix.net
Whole thread Raw
List pgsql-hackers
Some observations about compression (sorry if this is obvious):

Compression won't work unless there is redundancy in the information
being compressed.  In a normal [1] database, there may be a little
redundancy in each row.  The real redundancy will be when you group
rows together.  I.E. in it is quite likely that many rows will have
identical or similar values for some columns.  It may be worth
considering compressing each storage block[2], in the hopes that each
block will contain several rows, and that the rows will have some
similar (redundant) information.  Then compression will work.  Just
compressing one column of one row may work in some situations, but
in general, it won't.

Once a while ago we implemented a storage system where we grouped
records (about 1k in size) into blocks (about 16k in size) and
compressed each block individually.  To get at a record you had to
decompress the block it was in, and skip to the desired offset.  Even
though the system had to throw away a lot [3] of data for each
retrieval, the system turned out to be be both economical of disk[4]
storage and fast at retrieval.  For this app we found that compression
rates increased as we got up to about 16 records, and then flattend
off.  I.E. compressing 16 records used less space than compressing 4
groups of 4, but compressing 256 records together used as much space
as compressing 16 groups of 16.

The point?

I wish I understood the PostgreSQL backend storage algorithms better,
but it seems that the combination of a Tuneable block size [5] and
an option to compress individual storage blocks [6] might be worth
looking at.

-- cary


[1] Whatever the heck that is.

[2] My lack of understanding of PostgreSQL storage manager   internals is showing.

[3] Half a block on average.

[4] Optical disk in this particular app.

[5] This is an option now, right?  But is it compile time, or run-time?   and is it postmaster-wide, per database?  I
*try*to read the pgsql-hackers-digest   but there is just too much going on!
 

[6] Can this work with the existing data file structure?











pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Status of sql_help.h
Next
From: Tim Holloway
Date:
Subject: Re: [HACKERS] Thread-safe queueing?