Thread: Update on sort-compression stuff

Update on sort-compression stuff

From
Martijn van Oosterhout
Date:
I'm going to be offline for a few days but there are some things I've
tested in the meantime.

Once the compression level drops below 4-to-1 the overhead of zlib
becomes overwhelming compared to the savings. I'm not sure how common
that is, I basically filled a table for random data to get it that low.

I implemented a basic implementation using pg_lzcompress. It appears
that pg_lzcompress is very, very slow. I was afraid that I'd made an
infinite loop, but it was just really slow. Mind you, the overhead of
each call might have been the problem, it was being called on every
32KB block.

My suggestions at this point are:

- Test a way of storing tuples with less overhead than a HeapTuple
header. If you could do it for in-memory sorts, that'd mean you could
fit more tuples in memory before spilling to disk. Given the
"compression" in that case is extremely cheap, it'd be much more likely
to be beneficial.

- Consider replacing pg_lzcompress with zlib if available. Or at least
test pg_lzcompress in a more realistic environment, because it seems
quite slow.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Update on sort-compression stuff

From
"Jim C. Nasby"
Date:
BTW, the test I ran this weekend ended up filling the disk, so I wasn't
able to get results. I hope to have results for a compressed sort that's
still larger than memory in the morning. Unfortunately I'm doing all
this on a machine I use for other things, so it's hard to do testing and
other things at the same time...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Update on sort-compression stuff

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> I implemented a basic implementation using pg_lzcompress. It appears
> that pg_lzcompress is very, very slow. I was afraid that I'd made an
> infinite loop, but it was just really slow. Mind you, the overhead of
> each call might have been the problem, it was being called on every
> 32KB block.

Something wrong there, because the entire point of pg_lzcompress was to
be pretty fast.  I agree it needs looking into.
        regards, tom lane


Re: Update on sort-compression stuff

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> - Test a way of storing tuples with less overhead than a HeapTuple
> header. If you could do it for in-memory sorts, that'd mean you could
> fit more tuples in memory before spilling to disk. Given the
> "compression" in that case is extremely cheap, it'd be much more likely
> to be beneficial.

I looked into this and decided that trimming the headers for the
in-memory copies is not as attractive as all that.  The killer problem
is that comparetup_heap() needs to be able to apply heap_getattr() to
the stored tuples to extract sort keys.  Unless we want to support a
variant copy of the heap_getattr() infrastructure just for sort tuples,
it ain't gonna work.  Another issue is that we'd be increasing the
palloc traffic for in-memory sorts, because tuplesort_gettuple() would
have to cons up a freshly palloc'd complete tuple to hand back to the
caller.

However, we can definitely trim a lot of overhead from what gets written
to "tape", so I'll have a go at doing that.
        regards, tom lane


Re: Update on sort-compression stuff

From
Simon Riggs
Date:
On Tue, 2006-05-23 at 14:27 -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > - Test a way of storing tuples with less overhead than a HeapTuple
> > header. If you could do it for in-memory sorts, that'd mean you could
> > fit more tuples in memory before spilling to disk. Given the
> > "compression" in that case is extremely cheap, it'd be much more likely
> > to be beneficial.
> 
> I looked into this and decided that trimming the headers for the
> in-memory copies is not as attractive as all that.  The killer problem
> is that comparetup_heap() needs to be able to apply heap_getattr() to
> the stored tuples to extract sort keys.  Unless we want to support a
> variant copy of the heap_getattr() infrastructure just for sort tuples,
> it ain't gonna work.  Another issue is that we'd be increasing the
> palloc traffic for in-memory sorts, because tuplesort_gettuple() would
> have to cons up a freshly palloc'd complete tuple to hand back to the
> caller.
> 
> However, we can definitely trim a lot of overhead from what gets written
> to "tape", so I'll have a go at doing that.

If we write the tuples in compressed form and read them back in that
same form, there wouldn't be any more palloc overhead at all. The
freelists would be full of too large blocks, but that might not be such
a problem.

heap_getattr() is called by so few other places it makes sense to have a
sort specific version.

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com