Re: Compression and on-disk sorting - Mailing list pgsql-hackers

From Jim C. Nasby
Subject Re: Compression and on-disk sorting
Date
Msg-id 20060519193230.GH64371@pervasive.com
Whole thread Raw
In response to Re: Compression and on-disk sorting  (Hannu Krosing <hannu@skype.net>)
List pgsql-hackers
On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:
> ??hel kenal p??eval, R, 2006-05-19 kell 14:53, kirjutas Tom Lane:
> > "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > > On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:
> > >> I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
> > >> unbeleiveable. What's in the table? It would seem to imply that our
> > >> tuple format is far more compressable than we expected.
> > 
> > > It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
> > > If the tape routines were actually storing visibility information, I'd
> > > expect that to be pretty compressible in this case since all the tuples
> > > were presumably created in a single transaction by pgbench.
> > 
> > It's worse than that: IIRC what passes through a heaptuple sort are
> > tuples manufactured by heap_form_tuple, which will have consistently
> > zeroed header fields.  However, the above isn't very helpful since the
> > rest of us have no idea what that "accounts" table contains.  How wide
> > is the tuple data, and what's in it?
> 
> Was he not using pg_bench data ?

I am. For reference:

bench=# \d accounts       Table "public.accounts" Column  |     Type      | Modifiers 
----------+---------------+-----------aid      | integer       | not nullbid      | integer       | abalance | integer
    | filler   | character(84) | 
 


> > (This suggests that we might try harder to strip unnecessary header info
> > from tuples being written to tape inside tuplesort.c.  I think most of
> > the required fields could be reconstructed given the TupleDesc.)
> 
> I guess that tapefiles compress better than averahe table because they
> are sorted, and thus at least a little more repetitive than the rest. 
> If there are varlen types, then they usually also have abundance of
> small 4-byte integers, which should also compress at least better than
> 4/1, maybe a lot better.

If someone wants to provide a patch that strips out the headers I can test that
as well.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: Compression and on-disk sorting
Next
From: Marc Munro
Date:
Subject: Re: New feature proposal