Thread: Compression and on-disk sorting

Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.

On-disk sorts are the most obvious candidate, but I suspect there's
others.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> A recent post Tom made in -bugs about how bad performance would be if we
> spilled after-commit triggers to disk got me thinking... There are
> several operations the database performs that potentially spill to disk.
> Given that any time that happens we end up caring much less about CPU
> usage and much more about disk IO, for any of these cases that use
> non-random access, compressing the data before sending it to disk would
> potentially be a sizeable win.

Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things.  If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...
        regards, tom lane


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > A recent post Tom made in -bugs about how bad performance would be if we
> > spilled after-commit triggers to disk got me thinking... There are
> > several operations the database performs that potentially spill to disk.
> > Given that any time that happens we end up caring much less about CPU
> > usage and much more about disk IO, for any of these cases that use
> > non-random access, compressing the data before sending it to disk would
> > potentially be a sizeable win.
> 
> Note however that what the code thinks is a spill to disk and what
> actually involves disk I/O are two different things.  If you think
> of it as a spill to kernel disk cache then the attraction is a lot
> weaker...

I'm really starting to see why other databases want the OS out of their
way...

I guess at this point the best we could do would be to have a
configurable limit for when compression started. The first X number of
bytes go out uncompressed, everything after that is compressed. I don't
know of any technical reason why you couldn't switch in the middle of a
file, so long as you knew exactly where you switched.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Andrew Dunstan
Date:
Jim C. Nasby wrote:
> On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:
>   
>> "Jim C. Nasby" <jnasby@pervasive.com> writes:
>>     
>>> A recent post Tom made in -bugs about how bad performance would be if we
>>> spilled after-commit triggers to disk got me thinking... There are
>>> several operations the database performs that potentially spill to disk.
>>> Given that any time that happens we end up caring much less about CPU
>>> usage and much more about disk IO, for any of these cases that use
>>> non-random access, compressing the data before sending it to disk would
>>> potentially be a sizeable win.
>>>       
>> Note however that what the code thinks is a spill to disk and what
>> actually involves disk I/O are two different things.  If you think
>> of it as a spill to kernel disk cache then the attraction is a lot
>> weaker...
>>     
>
> I'm really starting to see why other databases want the OS out of their
> way...
>   

Some of it is pure NIH syndrome. I recently heard of some tests done by 
a major DB team that showed their finely crafted raw file system stuff 
performing at best a few percent better than a standard file system, and 
sometimes worse. I have often heard of the supposed benefits of our 
being able to go behind the OS, but I am very dubious about it. What 
makes people think that we could do any better than the OS guys?


cheers

andrew


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Mon, May 15, 2006 at 03:44:50PM -0400, Andrew Dunstan wrote:
> Jim C. Nasby wrote:
> >On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:
> >  
> >>"Jim C. Nasby" <jnasby@pervasive.com> writes:
> >>    
> >>>A recent post Tom made in -bugs about how bad performance would be if we
> >>>spilled after-commit triggers to disk got me thinking... There are
> >>>several operations the database performs that potentially spill to disk.
> >>>Given that any time that happens we end up caring much less about CPU
> >>>usage and much more about disk IO, for any of these cases that use
> >>>non-random access, compressing the data before sending it to disk would
> >>>potentially be a sizeable win.
> >>>      
> >>Note however that what the code thinks is a spill to disk and what
> >>actually involves disk I/O are two different things.  If you think
> >>of it as a spill to kernel disk cache then the attraction is a lot
> >>weaker...
> >>    
> >
> >I'm really starting to see why other databases want the OS out of their
> >way...
> >  
> 
> Some of it is pure NIH syndrome. I recently heard of some tests done by 
> a major DB team that showed their finely crafted raw file system stuff 
> performing at best a few percent better than a standard file system, and 
> sometimes worse. I have often heard of the supposed benefits of our 
> being able to go behind the OS, but I am very dubious about it. What 
> makes people think that we could do any better than the OS guys?

The problem is that it seems like there's never enough ability to clue
the OS in on what the application is trying to accomplish. For a long
time we didn't have a background writer, because the OS should be able
to flush things out on it's own before checkpoint. Now there's talk of a
background reader, because backends keep stalling on waiting on disk IO.

In this case the problem is that we want to tell the OS "Hey, if this
stuff is actually going to go out to the spindles then compress it. And
by the way, we won't be doing any random access on it, either." But
AFAIK there's no option like that in fopen... :)

I agree, when it comes to base-level stuff like how to actually put the
data on the physical media, there's not much to be gained in this day
and age by using RAW storage, and in fact Oracle hasn't favored RAW for
quite some time. Every DBA I've ever talked to says that the only reason
to use RAW is if you're trying to eek every last ounce of performance
out of the hardware that you can, which for 99.99% of installs makes
absolutely no sense.

But, there's a big range between writing your own filesystem and
assuming that the OS should just handle everything for you. I think a
lot of commercial databases lean too far towards not trusting the OS
(which is understandable to a degree, givin how much OSes have
improved), while in some areas I think we still rely too much on the OS
(like read-ahead).
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
On Mon, May 15, 2006 at 03:02:07PM -0500, Jim C. Nasby wrote:
> The problem is that it seems like there's never enough ability to clue
> the OS in on what the application is trying to accomplish. For a long
> time we didn't have a background writer, because the OS should be able
> to flush things out on it's own before checkpoint. Now there's talk of a
> background reader, because backends keep stalling on waiting on disk IO.

Hmm, I thought the background writer was created to reduce the cost of
checkpoint which has to write out modified pages in the buffers. The
background writer tries to keep the number of dirty pages down.

I don't know about Oracle, but with or without an OS, backends are
going to block on I/O and you'd still need a background reader in both
cases for precisely the same reason. We might be able to do without a
background reader if we did asyncronous i/o, but that can't be done
portably.

> In this case the problem is that we want to tell the OS "Hey, if this
> stuff is actually going to go out to the spindles then compress it. And
> by the way, we won't be doing any random access on it, either." But
> AFAIK there's no option like that in fopen... :)

posix_fadvise(). We don't use it and many OSes don't support it, but it
is there.

The O/S is some overhead, but the benefits outweigh the costs IMHO.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Mon, May 15, 2006 at 10:09:47PM +0200, Martijn van Oosterhout wrote:
> > In this case the problem is that we want to tell the OS "Hey, if this
> > stuff is actually going to go out to the spindles then compress it. And
> > by the way, we won't be doing any random access on it, either." But
> > AFAIK there's no option like that in fopen... :)
> 
> posix_fadvise(). We don't use it and many OSes don't support it, but it
> is there.

There's an fadvise that tells the OS to compress the data if it actually
makes it to disk?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Ron Mayer
Date:
Jim C. Nasby wrote:
> 
> There's an fadvise that tells the OS to compress the data if it actually
> makes it to disk?

Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.

I think the reasons against adding this feature to postgresql are
largely the same as the reasons why compressed filesystems aren't
very popular.

Has anyone tried running postgresql on a compressing file-system?
I'd expect the penalties to outweigh the benefits (or they'd be
more common); but if it gives impressive results, it might add
weight to this feature idea.

  Ron M

I think the real reason Oracle and others practically re-wrote
their own VM-system and filesystems is that at the time it was
important for them to run under Windows98; where it was rather
easy to write better filesystems than your customer's OS was
bundled with.


Re: Compression and on-disk sorting

From
Tom Lane
Date:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> I think the real reason Oracle and others practically re-wrote
> their own VM-system and filesystems is that at the time it was
> important for them to run under Windows98; where it was rather
> easy to write better filesystems than your customer's OS was
> bundled with.

Windows98?  No, those decisions predate any thought of running Oracle
on Windows, probably by decades.  But I think the thought process was
about as above whenever they did make it; they were running on some
pretty stupid OSes way back when.
        regards, tom lane


Re: Compression and on-disk sorting

From
"Joshua D. Drake"
Date:
Tom Lane wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>> I think the real reason Oracle and others practically re-wrote
>> their own VM-system and filesystems is that at the time it was
>> important for them to run under Windows98; where it was rather
>> easy to write better filesystems than your customer's OS was
>> bundled with.
> 
> Windows98?  No, those decisions predate any thought of running Oracle
> on Windows, probably by decades.  But I think the thought process was
> about as above whenever they did make it; they were running on some
> pretty stupid OSes way back when.

Windows XP?

****runs****

> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 


-- 
           === The PostgreSQL Company: Command Prompt, Inc. ===     Sales/Support: +1.503.667.4564 || 24x7/Emergency:
+1.800.492.2240    Providing the most comprehensive  PostgreSQL solutions since 1997
http://www.commandprompt.com/




Re: Compression and on-disk sorting

From
mark@mark.mielke.cc
Date:
On Mon, May 15, 2006 at 05:42:53PM -0700, Joshua D. Drake wrote:
> >Windows98?  No, those decisions predate any thought of running Oracle
> >on Windows, probably by decades.  But I think the thought process was
> >about as above whenever they did make it; they were running on some
> >pretty stupid OSes way back when.
> Windows XP?
> ****runs****

You guys have to kill your Windows hate - in jest or otherwise. It's
zealous, and blinding. I'm picking on you Joshua, only because your
message is the one I saw last. Sorry...

Writing your own block caching layer can make a lot of sense. Why would it
be assumed, that a file system designed for use from a desktop, would be
optimal at all for database style loads?

Why would it be assumed that a file system to be used for many different
smaller files would be optimal at all for database style loads?

It's always going to be true, that the more specific the requirements,
the more highly optimized one can design a system. The Linux block
caching layer, or file system layout can be beat *for sure* for database
loads.

The real question - and I believe Tom and others have correctly harped
on it in the past is - is it worth it? Until somebody actually pulls
up their sleeves, invests a month or more of their life to it, and
does it, we really won't know. And even then, the cost of maintenance
would have to be considered. Who is going to keep up-to-date on
theoretical storage models? What happens when generic file system
levels again surpass the first attempt?

Personally, I believe it would be worth it - but only to a few. And
these most of these few are likely using Oracle. So, no gain unless
you can convince them to switch back... :-)

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Compression and on-disk sorting

From
"Gregory Maxwell"
Date:
Oh come on,  Sorry to troll but this is too easy.

On 5/15/06, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote:
> You guys have to kill your Windows hate - in jest or otherwise. It's
> zealous, and blinding.
[snip]
> Why would it
> be assumed, that a file system designed for use from a desktop, would be
> optimal at all for database style loads?

It wouldn't.
Why would someone use a desktop OS for a database?
Why would you call the result of answering the previous question
zealous and blinding?

PG's use of the OS's block cache is a good move because it makes PG
tend to 'just work' where the alternatives require non-trivial tuning
(sizing their caches not to push the OS into swap).  The advantages of
this are great enough that if additional smarts are needed in the OS
cache it might well be worth the work to add it there and to ask for
new fadvise flags to get the desired behavior.

That's something that would be easy enough for a dedicated hacker to
do, or easy enough to collaborate with the OS developers if the need
could be demonstrated clearly enough.

What reasonable OS couldn't you do that with?

:)


Re: Compression and on-disk sorting

From
Bruce Momjian
Date:
mark@mark.mielke.cc wrote:
> The real question - and I believe Tom and others have correctly harped
> on it in the past is - is it worth it? Until somebody actually pulls
> up their sleeves, invests a month or more of their life to it, and
> does it, we really won't know. And even then, the cost of maintenance
> would have to be considered. Who is going to keep up-to-date on
> theoretical storage models? What happens when generic file system
> levels again surpass the first attempt?
> 
> Personally, I believe it would be worth it - but only to a few. And
> these most of these few are likely using Oracle. So, no gain unless
> you can convince them to switch back... :-)

We do know that the benefit for commercial databases that use raw and
file system storage is that raw storage is only a few percentage
points faster.

--  Bruce Momjian   http://candle.pha.pa.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Compression and on-disk sorting

From
"Zeugswetter Andreas DCP SD"
Date:
> > Given that any time that happens we end up caring much less about
CPU
> > usage and much more about disk IO, for any of these cases that use
> > non-random access, compressing the data before sending it to disk
would
> > potentially be a sizeable win.
>
> Note however that what the code thinks is a spill to disk and what
> actually involves disk I/O are two different things.  If you think
> of it as a spill to kernel disk cache then the attraction is a lot
> weaker...

Yes, that is very true. However it would also increase the probability
that spill to disk is not needed, since more data fits in RAM.

It would probably need some sort of plugin architecture, since the
fastest compression algorithms (LZO) that also reach good ratios are
gpl.
LZO is proven to increase physical IO write speed with low CPU overhead.

Andreas


Re: Compression and on-disk sorting

From
"Zeugswetter Andreas DCP SD"
Date:
> > Personally, I believe it would be worth it - but only to a few. And
> > these most of these few are likely using Oracle. So, no gain unless
> > you can convince them to switch back... :-)
>
> We do know that the benefit for commercial databases that use raw and
> file system storage is that raw storage is only a few percentage
> points faster.

Imho it is really not comparable because they all use direct or async IO
that bypasses the OS buffercache even when using filesystem files for
storage.
A substantial speed difference is allocation of space for restore
(no format of fs and no file allocation needed).

I am not saying this to advocate moving in that direction however.
I do however think that there is substantial headroom in reducing the
number
of IO calls and reducing on disk storage requirements.
Especially in concurrent load scenarios.

Andreas


Re: Compression and on-disk sorting

From
"Bort, Paul"
Date:
>
> Compressed-filesystem extension (like e2compr, and I think either
> Fat or NTFS) can do that.
>

Windows (NT/2000/XP) can compress individual directories and files under
NTFS; new files in a compressed directory are compressed by default.

So if the 'spill-to-disk' all happened in its own specific directory, it
would be trivial to mark that directory for compression.

I don't know enough Linux/Unix to know if it has similar capabilities.



Re: Compression and on-disk sorting

From
Andrew Dunstan
Date:
Bort, Paul wrote:
>> Compressed-filesystem extension (like e2compr, and I think either
>> Fat or NTFS) can do that.
>>
>>     
>
> Windows (NT/2000/XP) can compress individual directories and files under
> NTFS; new files in a compressed directory are compressed by default.
>
> So if the 'spill-to-disk' all happened in its own specific directory, it
> would be trivial to mark that directory for compression. 
>
> I don't know enough Linux/Unix to know if it has similar capabilities.
>
>
>   

Or would want to ...

I habitually turn off all compression on my Windows boxes, because it's 
a performance hit in my experience. Disk is cheap ...

cheers

andrew


Re: Compression and on-disk sorting

From
Rod Taylor
Date:
On Tue, 2006-05-16 at 11:53 -0400, Andrew Dunstan wrote:
> Bort, Paul wrote:
> >> Compressed-filesystem extension (like e2compr, and I think either
> >> Fat or NTFS) can do that.
> >>
> >>     
> >
> > Windows (NT/2000/XP) can compress individual directories and files under
> > NTFS; new files in a compressed directory are compressed by default.
> >
> > So if the 'spill-to-disk' all happened in its own specific directory, it
> > would be trivial to mark that directory for compression. 
> >
> > I don't know enough Linux/Unix to know if it has similar capabilities.

> Or would want to ...
> 
> I habitually turn off all compression on my Windows boxes, because it's 
> a performance hit in my experience. Disk is cheap ...

Disk storage is cheap. Disk bandwidth or throughput is very expensive.
-- 



Re: Compression and on-disk sorting

From
Andrew Dunstan
Date:
Rod Taylor wrote:
>> I habitually turn off all compression on my Windows boxes, because it's 
>> a performance hit in my experience. Disk is cheap ...
>>     
>
> Disk storage is cheap. Disk bandwidth or throughput is very expensive.
>   

Sure, but in my experience using Windows File System compression is not 
a win here. Presumably if it were an unqualified win they would have it 
turned on everywhere. The fact that there's an option is a good 
indication that it isn't in many cases. It is most commonly used for 
files like executables that are in effect read-only - but that doesn't 
help us.

cheers

andrew



Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Tue, May 16, 2006 at 09:24:38AM +0200, Zeugswetter Andreas DCP SD wrote:
> 
> > > Given that any time that happens we end up caring much less about
> CPU
> > > usage and much more about disk IO, for any of these cases that use
> > > non-random access, compressing the data before sending it to disk
> would
> > > potentially be a sizeable win.
> > 
> > Note however that what the code thinks is a spill to disk and what
> > actually involves disk I/O are two different things.  If you think
> > of it as a spill to kernel disk cache then the attraction is a lot
> > weaker...
> 
> Yes, that is very true. However it would also increase the probability
> that spill to disk is not needed, since more data fits in RAM.

That's a pretty thin margin though, depending on how good the
compression is. This also assumes that you have a compression algorithm
that supports random access.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Tue, May 16, 2006 at 12:27:42PM -0400, Andrew Dunstan wrote:
> Rod Taylor wrote:
> >>I habitually turn off all compression on my Windows boxes, because it's 
> >>a performance hit in my experience. Disk is cheap ...
> >>    
> >
> >Disk storage is cheap. Disk bandwidth or throughput is very expensive.

Hey, that's my line! :P

> Sure, but in my experience using Windows File System compression is not 
> a win here. Presumably if it were an unqualified win they would have it 
> turned on everywhere. The fact that there's an option is a good 
> indication that it isn't in many cases. It is most commonly used for 
> files like executables that are in effect read-only - but that doesn't 
> help us.

The issue with filesystem level compression is that it has to support
things like random access, which isn't needed for on-disk sorting (not
sure about other things like hashing, etc).

In any case, my curiousity is aroused, so I'm currently benchmarking
pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
some benchmarks with pg_tmp compressed.

Does anyone have time to hack some kind of compression into the on-disk
sort code just to get some benchmark numbers? Unfortunately, doing so is
beyond my meager C abilitiy...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> In any case, my curiousity is aroused, so I'm currently benchmarking
> pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
> some benchmarks with pg_tmp compressed.
Results: http://jim.nasby.net/bench.log

As expected, compressing $PGDATA/base was a loss. But compressing
pgsql_tmp and then doing some disk-based sorts did show an improvement,
from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
a worst-case scenario. On the other hand, XP's compression algorithm
appears to be pretty aggressive, as it cut the size of the on-disk sort
file from almost 700MB to 82MB. There's probably gains to be had from a
different compression algorithm.

> Does anyone have time to hack some kind of compression into the on-disk
> sort code just to get some benchmark numbers? Unfortunately, doing so is
> beyond my meager C abilitiy...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> Does anyone have time to hack some kind of compression into the on-disk
> sort code just to get some benchmark numbers? Unfortunately, doing so is
> beyond my meager C abilitiy...

I had a look at this. At first glance it doesn't seem too hard, except
the whole logtape process kinda gets in the way. If it wern't for the
mark/restore it'd be trivial. Might take a stab at it some time, if I
can think of a way to handle the seeking...

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Tue, May 16, 2006 at 11:46:15PM +0200, Martijn van Oosterhout wrote:
> On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> > Does anyone have time to hack some kind of compression into the on-disk
> > sort code just to get some benchmark numbers? Unfortunately, doing so is
> > beyond my meager C abilitiy...
> 
> I had a look at this. At first glance it doesn't seem too hard, except
> the whole logtape process kinda gets in the way. If it wern't for the
> mark/restore it'd be trivial. Might take a stab at it some time, if I
> can think of a way to handle the seeking...

Oh, do we need to randomly seek? Is that how we switch from one tape to
another?

It might be easier to switch to giving each tape it's own file...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
On Tue, May 16, 2006 at 04:50:22PM -0500, Jim C. Nasby wrote:
> > I had a look at this. At first glance it doesn't seem too hard, except
> > the whole logtape process kinda gets in the way. If it wern't for the
> > mark/restore it'd be trivial. Might take a stab at it some time, if I
> > can think of a way to handle the seeking...
>
> Oh, do we need to randomly seek? Is that how we switch from one tape to
> another?

Not seek, mark/restore. As the code describes, sometimes you go back a
tuple. The primary reason I think is for the final pass, a merge sort
might read the tuples multiple times, so it needs to support it there.

> It might be easier to switch to giving each tape it's own file...

I don't think it would make much difference. OTOH, if this turns out to
be a win, the tuplestore could have the same optimisation.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Not seek, mark/restore. As the code describes, sometimes you go back a
> tuple. The primary reason I think is for the final pass, a merge sort
> might read the tuples multiple times, so it needs to support it there.

However it'd be possible to tell logtape in advance whether a particular
tape needs to support that, and only apply compression when not; it
would work all the time for intermediate merge passes, and with the
recent executor changes to pass down you-need-to-support-mark flags,
it'd work for the output pass in a lot of cases too.

If you're just trying to get some quick and dirty numbers: do
compression, replace Seek/Tell with PANICs, and only test on plain
sorts no joins ;-)
        regards, tom lane


Re: Compression and on-disk sorting

From
Greg Stark
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:

> > It might be easier to switch to giving each tape it's own file...
> 
> I don't think it would make much difference. OTOH, if this turns out to
> be a win, the tuplestore could have the same optimisation.

Would giving each tape its own file make it easier to allow multiple temporary
sort areas and allow optimizing to avoid seeking when multiple spindles area
available?


-- 
greg



Re: Compression and on-disk sorting

From
Andrew Piskorski
Date:
On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> On Tue, May 16, 2006 at 12:27:42PM -0400, Andrew Dunstan wrote:
> > Rod Taylor wrote:
> > >>I habitually turn off all compression on my Windows boxes, because it's 
> > >>a performance hit in my experience. Disk is cheap ...
> > >
> > >Disk storage is cheap. Disk bandwidth or throughput is very expensive.

> > Sure, but in my experience using Windows File System compression is not 
> > a win here. Presumably if it were an unqualified win they would have it 

> Does anyone have time to hack some kind of compression into the on-disk
> sort code just to get some benchmark numbers? Unfortunately, doing so is
> beyond my meager C abilitiy...

Folks, first of all, I'm in no way an expert on data compression in
RDBMSs, but other databases DO include data compression features and
claim it as a SIGNIFICANT win in I/O reduction.

Looking at performance of the Windows File System compression, etc.,
doesn't make too much sense when there are actual RDBMS compression
implementations to compare to, on the commerical market, in open
source code, and in the academic literature.

Oracle has included "table compression" since 9iR2.  They report table
size reductions of 2x to 4x as typical, with proportional reductions
in I/O, and supposedly, usually low to negligible overhead for writes:
 http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14211/build_db.htm#sthref289
 Decision Speed: Table Compression In Action by Meikel Poess and Hermann Baer (2003):
http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/poess_tablecomp.html
 Compressing Data for Space and Speed by Sanjay Mishra (2004):
http://www.oracle.com/technology/oramag/oracle/04-mar/o24tech_data.html
 Order For Maximum Compression:  http://oramossoracle.blogspot.com/2005/11/table-compression-order-for-maximum.html

I don't remember whether the current (Open Source) MonetDB includes
table compression or not, but they've published papers with lots of
interesting detail on the compression and other high performance OLAP
features in their latest (not released) "X100" MoneyDB research
codebase:
 http://monetdb.cwi.nl/ http://homepages.cwi.nl/~mk/MonetDB/ http://sourceforge.net/projects/monetdb/
ftp://ftp.research.microsoft.com/pub/debull/A05june/issue1.htm

Now, the docs and papers above are all focused on query performance,
they say nothing directly about using using compression for on-disk
sorts.  But, I'd bet that similar rules of thumb will apply in both
cases.

The main tricks seem to be:  One, EXTREMELY lightweight compression
schemes - basically table lookups designed to be as cpu friendly as
posible.  Two, keep the data compressed in RAM as well so that you can
also cache more of the data, and indeed keep it the compressed until
as late in the CPU processing pipeline as possible.

A corrolary of that is forget compression schemes like gzip - it
reduces data size nicely but is far too slow on the cpu to be
particularly useful in improving overall throughput rates.

Note, I have not really tested ANY of the above myself, your mileage
may well vary from what I recall from those various articles...

-- 
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/


Re: Compression and on-disk sorting

From
Greg Stark
Date:
Andrew Piskorski <atp@piskorski.com> writes:

> The main tricks seem to be:  One, EXTREMELY lightweight compression
> schemes - basically table lookups designed to be as cpu friendly as
> posible.  Two, keep the data compressed in RAM as well so that you can
> also cache more of the data, and indeed keep it the compressed until
> as late in the CPU processing pipeline as possible.
> 
> A corrolary of that is forget compression schemes like gzip - it
> reduces data size nicely but is far too slow on the cpu to be
> particularly useful in improving overall throughput rates.

There are some very fast decompression algorithms:

http://www.oberhumer.com/opensource/lzo/


I think most of the mileage from "lookup tables" would be better implemented
at a higher level by giving tools to data modellers that let them achieve
denser data representations. Things like convenient enum data types, 1-bit
boolean data types, short integer data types, etc.

-- 
greg



Re: Compression and on-disk sorting

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Andrew Piskorski <atp@piskorski.com> writes:
>> A corrolary of that is forget compression schemes like gzip - it
>> reduces data size nicely but is far too slow on the cpu to be
>> particularly useful in improving overall throughput rates.

> There are some very fast decompression algorithms:

AFAICS the only sane choice here is to use
src/backend/utils/adt/pg_lzcompress.c, on the grounds that (1) it's
already in the backend, and (2) data compression in general is such a
minefield of patents that we'd be foolish to expose ourselves in more
than one direction.

Certainly, if you can't prototype a convincing performance win using
that algorithm, it's unlikely to be worth anyone's time to look harder.
        regards, tom lane


Re: Compression and on-disk sorting

From
"Albe Laurenz"
Date:
Andrew Piskorski wrote:
>>> Rod Taylor wrote:
>>>>Disk storage is cheap. Disk bandwidth or throughput is very
expensive.
>
> Oracle has included "table compression" since 9iR2.  They report table
> size reductions of 2x to 4x as typical, with proportional reductions
> in I/O, and supposedly, usually low to negligible overhead for writes:

[...]

> The main tricks seem to be:  One, EXTREMELY lightweight compression
> schemes - basically table lookups designed to be as cpu friendly as
> posible.  Two, keep the data compressed in RAM as well so that you can
> also cache more of the data, and indeed keep it the compressed until
> as late in the CPU processing pipeline as possible.

Oracle's compression seems to work as follows:
- At the beginning of each data block, there is a 'lookup table' containing frequently used values in table entries (of
thatblock). 
- This lookup table is referenced from within the block.

There is a White Paper that describes the algorithm and contains
praise for the effects:
http://www.oracle.com/technology/products/bi/pdf/o9ir2_compression_perfo
rmance_twp.pdf

Oracle does not compress tables by default.
This is what they have to say about it:

Table compression should be used with highly redundant data, such as
tables
with many foreign keys. You should avoid compressing tables with much
update
or other DML activity. Although compressed tables or partitions are
updatable,
there is some overhead in updating these tables, and high update
activity
may work against compression by causing some space to be wasted.

Yours,
Laurenz Albe


Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
On Wed, May 17, 2006 at 09:45:35AM +0200, Albe Laurenz wrote:
> Oracle's compression seems to work as follows:
> - At the beginning of each data block, there is a 'lookup table'
>   containing frequently used values in table entries (of that block).
> - This lookup table is referenced from within the block.

Clever idea, pity we can't use it (what's the bet it's patented?). I'd
wager anything beyond simple compression is patented by someone.

The biggest issue is really that once postgres reads a block from disk
and uncompresses it, this block will be much larger than 8K. Somehow
you have to arrange storage for this.

I have some ideas though, but as Tom says, should go for the quick and
dirty numbers first, to determine if it's even worth doing.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
On Wed, May 17, 2006 at 12:03:15AM -0400, Tom Lane wrote:
> AFAICS the only sane choice here is to use
> src/backend/utils/adt/pg_lzcompress.c, on the grounds that (1) it's
> already in the backend, and (2) data compression in general is such a
> minefield of patents that we'd be foolish to expose ourselves in more
> than one direction.

Unfortunatly, the interface provided by pg_lzcompress.c is probably
insufficient for this purpose. You want to be able to compress tuples
as they get inserted and start a new block once the output reaches a
certain size. pg_lzcompress.c only has the options compress-whole-block
and decompress-whole-block.

zlib allows you to compress as the data comes along, keeping an eye on
the output buffer while you do it. For an initial test, using zlib
directly would probably be easier. If it works out we can look into
alternatives.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
Andrew Piskorski
Date:
On Tue, May 16, 2006 at 11:48:21PM -0400, Greg Stark wrote:

> There are some very fast decompression algorithms:
> 
> http://www.oberhumer.com/opensource/lzo/

Sure, and for some tasks in PostgreSQL perhaps it would be useful.
But at least as of July 2005, a Sandor Heman, one of the MonetDB guys,
had looked at zlib, bzlib2, lzrw, and lzo, and claimed that:
 "... in general, it is very unlikely that we could achieve any bandwidth gains with these algorithms. LZRW and LZO
mightincrease bandwidth on relatively slow disk systems, with bandwidths up to 100MB/s, but this would induce high
processingoverheads, which interferes with query execution. On a fast disk system, such as our 350MB/s 12 disk RAID,
allthe generic algorithms will fail to achieve any speedup."
 
 http://www.google.com/search?q=MonetDB+LZO+Heman&btnG=Search http://homepages.cwi.nl/~heman/downloads/msthesis.pdf

> I think most of the mileage from "lookup tables" would be better implemented
> at a higher level by giving tools to data modellers that let them achieve
> denser data representations. Things like convenient enum data types, 1-bit
> boolean data types, short integer data types, etc.

Things like enums and 1 bit booleans certainly could be useful, but
they cannot take advantage of duplicate values across multiple rows at
all, even if 1000 rows have the exact same value in their "date"
column and are all in the same disk block, right?

Thus I suspect that the exact opposite is true, a good table
compression scheme would render special denser data types largely
redundant and obsolete.

Good table compression might be a lot harder to do, of course.
Certainly Oracle's implementation of it had some bugs which made it
difficult to use reliably in practice (in certain circumstances
updates could fail, or if not fail perhaps have pathological
performance), bugs which are supposed to be fixed in 10.2.0.2, which
was only released within the last few months.

-- 
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/


Re: Compression and on-disk sorting

From
"Zeugswetter Andreas DCP SD"
Date:
> Certainly, if you can't prototype a convincing performance win using
> that algorithm, it's unlikely to be worth anyone's time to
> look harder.

That should be easily possible with LZO. It would need to be the lib
that
we can optionally link to (--with-lzo), since the lib is GPL.

lzo even allows for inplace decompression and overlapping compression.

Andreas


Re: Compression and on-disk sorting

From
Hannu Krosing
Date:
Ühel kenal päeval, K, 2006-05-17 kell 12:20, kirjutas Zeugswetter
Andreas DCP SD:
> > Certainly, if you can't prototype a convincing performance win using
> > that algorithm, it's unlikely to be worth anyone's time to 
> > look harder.
> 
> That should be easily possible with LZO. It would need to be the lib
> that
> we can optionally link to (--with-lzo), since the lib is GPL.
> 
> lzo even allows for inplace decompression and overlapping compression.

Does being GPL also automatically imply that it is patent-free, so that
we could re-implement it under BSD license if it gives good results?

------------
Hannu




Re: Compression and on-disk sorting

From
"Zeugswetter Andreas DCP SD"
Date:
> Unfortunatly, the interface provided by pg_lzcompress.c is probably
> insufficient for this purpose. You want to be able to compress tuples
> as they get inserted and start a new block once the output reaches a

I don't think anything that compresses single tuples without context is
going to be a win under realistic circumstances.

I would at least compress whole pages. Allow a max ratio of 1:n,
have the pg buffercache be uncompressed, and only compress on write
(filesystem cache then holds compressed pages).

The tricky part is predicting whether a tuple still fits in a n*8k
uncompressed
8k compressed page, but since lzo is fast you might even test it in
corner cases.
(probably logic that needs to also be in the available page freespace
calculation)
Choosing a good n is also tricky, probably 2 (or 3 ?) is good.

You probably also want to always keep the header part of the page
uncompressed.

Andreas


Re: Compression and on-disk sorting

From
"Jonah H. Harris"
Date:
On 5/17/06, Martijn van Oosterhout <kleptog@svana.org> wrote:
> Clever idea, pity we can't use it (what's the bet it's patented?). I'd
> wager anything beyond simple compression is patented by someone.

Oracle's patent application 20040054858 covers the method itself
including the process for storing and retrieving compressed data.

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation            | fax: 732.331.1301
33 Wood Ave S, 2nd Floor            | jharris@enterprisedb.com
Iselin, New Jersey 08830            | http://www.enterprisedb.com/


Re: Compression and on-disk sorting

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Clever idea, pity we can't use it (what's the bet it's patented?). I'd
> wager anything beyond simple compression is patented by someone.

You're in for a rude awakening: even "simple compression" is anything
but simple.  As I said, it's a minefield of patents.  I recall reading a
very long statement by one of the zlib developers (Jean-loup Gailly, I
think) explaining exactly how they had threaded their way through that
minefield, and why they were different enough from half-a-dozen
similar-looking patented methods to not infringe any of them.

I feel fairly confident that zlib is patent-free, first because they did
their homework and second because they've now been out there and highly
visible for a good long time without getting sued.  I've got no such
confidence in any other random algorithm you might choose --- in fact,
I'm not at all sure that pg_lzcompress.c is safe.  If we were
aggressively trying to avoid patent risks we might well consider
dropping pg_lzcompress.c and using zlib exclusively.
        regards, tom lane


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Tue, May 16, 2006 at 06:48:25PM -0400, Greg Stark wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> 
> > > It might be easier to switch to giving each tape it's own file...
> > 
> > I don't think it would make much difference. OTOH, if this turns out to
> > be a win, the tuplestore could have the same optimisation.
> 
> Would giving each tape its own file make it easier to allow multiple temporary
> sort areas and allow optimizing to avoid seeking when multiple spindles area
> available?

Only if those spindles weren't all in a single RAID array and if we went
through the trouble of creating all the machinery so you could tell
PostgreSQL where all those spindles were mounted in the filesystem. And
even after all that work, there's not much that says it would perform
better than a simple RAID10.

What *might* make sense would be to provide two locations for pgsql_tmp,
because a lot of operations in there involve reading and writing at the
same time:

Read from heap while writing tapes to pgsql_tmp
read from tapes while writing final version to pgsql_tmp

There's probably some gain to be had by writing the final version to a
tablespace other than the default one (which is where pgsql_tmp would
be, I think). But recent changes in -HEAD mean that the second step is
now only performed in certain cases.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
If we're going to consider table-level compression, ISTM the logical
first step is to provide greater control over TOASTing; namely
thresholds for when to compress and/or go to external storage that can
be set on a per-field or at least per-table basis.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Wed, May 17, 2006 at 10:06:04AM +0200, Martijn van Oosterhout wrote:
> On Wed, May 17, 2006 at 09:45:35AM +0200, Albe Laurenz wrote:
> > Oracle's compression seems to work as follows:
> > - At the beginning of each data block, there is a 'lookup table'
> >   containing frequently used values in table entries (of that block).
> > - This lookup table is referenced from within the block.
> 
> Clever idea, pity we can't use it (what's the bet it's patented?). I'd
> wager anything beyond simple compression is patented by someone.
> 
> The biggest issue is really that once postgres reads a block from disk
> and uncompresses it, this block will be much larger than 8K. Somehow
> you have to arrange storage for this.

It's entirely possible that the best performance would be found from not
un-compressing blocks when putting them into shared_buffers, though.
That would mean you'd "only" have to deal with compression when pulling
individual tuples. Simple, right? :)
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> What *might* make sense would be to provide two locations for pgsql_tmp,
> because a lot of operations in there involve reading and writing at the
> same time:

> Read from heap while writing tapes to pgsql_tmp
> read from tapes while writing final version to pgsql_tmp

Note that a large part of the reason for the current logtape.c design
is to avoid requiring 2X or more disk space to sort X amount of data.
AFAICS, any design that does the above will put us right back in the 2X
regime.  That's a direct, measurable penalty; it'll take more than
handwaving arguments to convince me we should change it in pursuit of
unquantified speed benefits.
        regards, tom lane


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Wed, May 17, 2006 at 11:38:05AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > What *might* make sense would be to provide two locations for pgsql_tmp,
> > because a lot of operations in there involve reading and writing at the
> > same time:
> 
> > Read from heap while writing tapes to pgsql_tmp
> > read from tapes while writing final version to pgsql_tmp
> 
> Note that a large part of the reason for the current logtape.c design
> is to avoid requiring 2X or more disk space to sort X amount of data.
> AFAICS, any design that does the above will put us right back in the 2X
> regime.  That's a direct, measurable penalty; it'll take more than
> handwaving arguments to convince me we should change it in pursuit of
> unquantified speed benefits.

I certainly agree that there's no reason to make this change without
testing, but if you've got enough spindles laying around to actually
make use of this I suspect that requiring 2X the disk space to sort X
won't bother you.

Actually, I suspect in most cases it won't matter; I don't think people
make a habit of trying to sort their entire database. :) But we'd want
to protect for the oddball cases... yech.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Wed, May 17, 2006 at 11:38:05AM -0400, Tom Lane wrote:
>> Note that a large part of the reason for the current logtape.c design
>> is to avoid requiring 2X or more disk space to sort X amount of data.

> Actually, I suspect in most cases it won't matter; I don't think people
> make a habit of trying to sort their entire database. :)

Well, some years ago we used something like 4X space to sort X amount of
data (this is the native behavior of 7-way polyphase merge, it turns out)
and we got yelled at.  Which is what prompted the writing of logtape.c.
Maybe disk space has gotten so cheap since then that it no longer
matters ... but I suspect the nature of DB applications is that people
are always pushing the envelope of what their hardware can do.
        regards, tom lane


Re: Compression and on-disk sorting

From
Rod Taylor
Date:
> Actually, I suspect in most cases it won't matter; I don't think people
> make a habit of trying to sort their entire database. :) But we'd want
> to protect for the oddball cases... yech.

I can make query result sets that are far larger than the database
itself.

create table fat_table_with_few_tuples(fat_status_id serial primary key,
fat_1 text, fat_2 text);

create table thin_table_with_many_tuples(fat_status_id integer
references fat_table_with_few_tuples, thin_1 integer, thin_2 integer);

SELECT * FROM thin_table_with_many_tuples NATURAL JOIN
fat_table_with_few_tuples order by fat_1, thin_1, thin_2, fat_2;


I would be asking the folks trying to use PostgreSQL for data
warehousing what their opinion is. A few fact tables in an audit query
could easily result in a very large amount of temporary diskspace being
required.

-- 



Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
For all those people not subscribed to -patches (should appear in
archive soon), I just posted a patch there implemented zlib compression
for logtape.c. If people have test machines for speed-testing this sort
of stuff, please have at it.

You can also download it here:
http://svana.org/kleptog/temp/compress-sort.patch

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
Greg Stark
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:

> Only if those spindles weren't all in a single RAID array and if we went
> through the trouble of creating all the machinery so you could tell
> PostgreSQL where all those spindles were mounted in the filesystem.

I think the way you do this is simply by documenting that the admin should
create precisely one temp area per physical spindle (or raid array).


-- 
greg



Re: Compression and on-disk sorting

From
Greg Stark
Date:
Andrew Piskorski <atp@piskorski.com> writes:

> Things like enums and 1 bit booleans certainly could be useful, but
> they cannot take advantage of duplicate values across multiple rows at
> all, even if 1000 rows have the exact same value in their "date"
> column and are all in the same disk block, right?

That's an interesting direction to go in. Generic algorithms would still help
in that case since the identical value would occur more frequently than other
values it would be encoded in a smaller symbol. But there's going to be a
limit to how compressed it can get the data.

The ideal way to handle the situation you're describing would be to interleave
the tuples so that you have all 1000 values of the first column, followed by
all 1000 values of the second column and so on. Then you run a generic
algorithm on this and it achieves very high compression rates since there are
a lot of repeating patterns.

I don't see how you build a working database with data in this form however.
For example, a single insert would require updating small pieces of data
across the entire table. Perhaps there's some middle ground with interleaving
the tuples within a single compressed page, or something like that?

-- 
greg



Re: Compression and on-disk sorting

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> The ideal way to handle the situation you're describing would be to interleave
> the tuples so that you have all 1000 values of the first column, followed by
> all 1000 values of the second column and so on. Then you run a generic
> algorithm on this and it achieves very high compression rates since there are
> a lot of repeating patterns.

It's not obvious to me that that yields a form more compressible than
what we have now.  As long as the previous value is within the lookback
window, an LZ-style compressor will still be able to use it.  More
importantly, the layout you describe would be unable to take advantage
of any cross-column correlation, which in real data is likely to be a
useful property for compression.
        regards, tom lane


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:
> 
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> 
> > Only if those spindles weren't all in a single RAID array and if we went
> > through the trouble of creating all the machinery so you could tell
> > PostgreSQL where all those spindles were mounted in the filesystem.
> 
> I think the way you do this is simply by documenting that the admin should
> create precisely one temp area per physical spindle (or raid array).

And you still need some way to tell PostgreSQL about all of that.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Wed, May 17, 2006 at 12:16:13PM -0400, Rod Taylor wrote:
> > Actually, I suspect in most cases it won't matter; I don't think people
> > make a habit of trying to sort their entire database. :) But we'd want
> > to protect for the oddball cases... yech.
> 
> I can make query result sets that are far larger than the database
> itself.
> 
> create table fat_table_with_few_tuples(fat_status_id serial primary key,
> fat_1 text, fat_2 text);
> 
> create table thin_table_with_many_tuples(fat_status_id integer
> references fat_table_with_few_tuples, thin_1 integer, thin_2 integer);
> 
> SELECT * FROM thin_table_with_many_tuples NATURAL JOIN
> fat_table_with_few_tuples order by fat_1, thin_1, thin_2, fat_2;
> 
> 
> I would be asking the folks trying to use PostgreSQL for data
> warehousing what their opinion is. A few fact tables in an audit query
> could easily result in a very large amount of temporary diskspace being
> required.

Note my last sentence: we'd need to provide for cases where this was a
problem. How much that would complicate the code, I don't know...

This is another case where someone (with more skills than me) would need
to hack the backend enough to be able to test it and see how big a
performance gain there was.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Hannu Krosing
Date:
Ühel kenal päeval, K, 2006-05-17 kell 10:01, kirjutas Jim C. Nasby:
> If we're going to consider table-level compression, ISTM the logical
> first step is to provide greater control over TOASTing; namely
> thresholds for when to compress and/or go to external storage that can
> be set on a per-field or at least per-table basis.

also, we would get a lot of "compression", if we could get rid of index
on toast table, and use the ctid directly.

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Wed, May 17, 2006 at 10:55:19PM +0300, Hannu Krosing wrote:
> ??hel kenal p??eval, K, 2006-05-17 kell 10:01, kirjutas Jim C. Nasby:
> > If we're going to consider table-level compression, ISTM the logical
> > first step is to provide greater control over TOASTing; namely
> > thresholds for when to compress and/or go to external storage that can
> > be set on a per-field or at least per-table basis.
> 
> also, we would get a lot of "compression", if we could get rid of index
> on toast table, and use the ctid directly.

It'd also be damn handy to be able to ditch the 2-pass vacuum
requirement. I've often wondered if treating the toast table as a real
table was overkill; for example, it should be possible to include WAL
information for it in with the parent table. That plus using ctid as a
reference would hopefully allow for removing a lot of overhead from it.
Presumably all the MVCC info could go, since that would be taken care of
by the parent table.

Of course the downside is that it'd mean a different set of code for
handling toast storage...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Greg Stark
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:

> On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:
> > 
> > "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > 
> > > Only if those spindles weren't all in a single RAID array and if we went
> > > through the trouble of creating all the machinery so you could tell
> > > PostgreSQL where all those spindles were mounted in the filesystem.
> > 
> > I think the way you do this is simply by documenting that the admin should
> > create precisely one temp area per physical spindle (or raid array).
> 
> And you still need some way to tell PostgreSQL about all of that.

No, my point was that you tell Postges how many spindles you have and where to
find them by creating precisely one temp area on each spindle. It then knows
that it should strive to maximize sequential reads within one temp area and
expect switching between temp areas (which represent multiple spindles) to be
better than multiplexing multiple tapes within a single temp area (which
represents a single spindle).

-- 
greg



Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Wed, May 17, 2006 at 05:44:22PM -0400, Greg Stark wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> 
> > On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:
> > > 
> > > "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > > 
> > > > Only if those spindles weren't all in a single RAID array and if we went
> > > > through the trouble of creating all the machinery so you could tell
> > > > PostgreSQL where all those spindles were mounted in the filesystem.
> > > 
> > > I think the way you do this is simply by documenting that the admin should
> > > create precisely one temp area per physical spindle (or raid array).
> > 
> > And you still need some way to tell PostgreSQL about all of that.
> 
> No, my point was that you tell Postges how many spindles you have and where to
> find them by creating precisely one temp area on each spindle. It then knows

Which means we need all the interface bits to be able to tell PostgreSQL
where every single temp storage area is. Presumably much of the
tablespace mechanism could be used for this, but it's still a bunch of
work. And you can't just say "I have 8 spindles", you have to tell
PostgreSQL exactly where to put each temporary area (unless you just
have it put one on every tablespace you have defined).

> that it should strive to maximize sequential reads within one temp area and
> expect switching between temp areas (which represent multiple spindles) to be
> better than multiplexing multiple tapes within a single temp area (which
> represents a single spindle).

Which adds yet more complexity to all the code that uses the temp area.
And as others have brought up, you still have to allow for the case when
splitting all of this out into multiple files means you end up using
substantially more disk space. That further drives up the complexity.

My point is that unless someone shows that there's a non-trivial
performance gain here, it's not going to happen.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Greg Stark
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:

> Which means we need all the interface bits to be able to tell PostgreSQL
> where every single temp storage area is. Presumably much of the
> tablespace mechanism could be used for this, but it's still a bunch of
> work. And you can't just say "I have 8 spindles", you have to tell
> PostgreSQL exactly where to put each temporary area (unless you just
> have it put one on every tablespace you have defined).

Yes, if you have more than one temporary area you definitely need a way to
tell Postgres where to put them since obviously they won't be in the postgres
base directory. But I think that's all Postgres really needs to know.

One could imagine a more complex version where Postgres has meta information
about the bandwidth and seek penalty for each sort area separately. Presumably
also for each table space. But that's a whole lot more complexity than
Postgres's current cost model.


> > that it should strive to maximize sequential reads within one temp area and
> > expect switching between temp areas (which represent multiple spindles) to be
> > better than multiplexing multiple tapes within a single temp area (which
> > represents a single spindle).
> 
> Which adds yet more complexity to all the code that uses the temp area.
> And as others have brought up, you still have to allow for the case when
> splitting all of this out into multiple files means you end up using
> substantially more disk space. That further drives up the complexity.

You also have to consider that it won't always be a benefit to spread the sort
over multiple sort areas. If there's only one sort going on and you can reach
a 1:1 ratio between tapes and spindles then I think it would be a huge
benefit. Effectively boosting the sort speed by random_page_cost.

But if you don't have as many spindles as your algorithm needs tapes
then it's unclear which to multiplex down and whether you gain any benefit
once you're multiplexing over simply using a single sort area.

And worse, if there are multiple sorts going on in the system then you're not
going to get sequential access even if you have multiple sort areas available.
If you have N sort areas and N sorts are going on then you're probably better
off multiplexing each one down to a single sort area and letting them each
proceed without interfering with each other rather than having each one hog
all the sort areas and forcing the OS to do the multiplexing blindly.

> My point is that unless someone shows that there's a non-trivial
> performance gain here, it's not going to happen.

I think two extreme cases are well worth pursuing: 

1) Use n sort areas for n tapes making everything purely sequential access.  That would be useful for large DSS systems
wherelarge sorts are running  and i/o bandwidth is high for sequential access. That gives effectively a
random_page_costspeed boost.
 

2) Use the current algorithm unchanged but have each sort use a different sort  area in some sort of round-robin
fashion.That helps the OLTP type  environment (ignoring for the moment that OLTP environments really ought  not be
doingdisk sorts) where people complain about unreliable execution  times more than slow execution times. If you can
provideenough sort areas  then it would remove one big reason other queries concurrent impact the  execution time of
queries.

-- 
greg



Re: Compression and on-disk sorting

From
Florian Weimer
Date:
* Jim C. Nasby:

> The problem is that it seems like there's never enough ability to clue
> the OS in on what the application is trying to accomplish. For a long
> time we didn't have a background writer, because the OS should be able
> to flush things out on it's own before checkpoint. Now there's talk of a
> background reader, because backends keep stalling on waiting on disk IO.

I've recently seen this on one of our test systems -- neither CPU nor
disk I/O were maxed out.

Have you considered using asynchronous I/O?  Maybe it results in less
complexity and fewer context switches than a background reader.


Re: Compression and on-disk sorting

From
"Zeugswetter Andreas DCP SD"
Date:
> 1) Use n sort areas for n tapes making everything purely sequential
access.

Some time ago testing I did has shown, that iff the IO block size is
large enough
(256k) it does not really matter that much if the blocks are at random
locations.
I think that is still true for current model disks.

So unless we parallelize, it is imho sufficient to see to it that we
write
(and read) large enough blocks with single calls. This also has no
problem in
highly concurrent scenarios, where you do not have enough spindles.

Andreas


Re: Compression and on-disk sorting

From
Simon Riggs
Date:
On Tue, 2006-05-16 at 15:42 -0500, Jim C. Nasby wrote:
> On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> > In any case, my curiousity is aroused, so I'm currently benchmarking
> > pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
> > some benchmarks with pg_tmp compressed.
>  
> Results: http://jim.nasby.net/bench.log
> 
> As expected, compressing $PGDATA/base was a loss. But compressing
> pgsql_tmp and then doing some disk-based sorts did show an improvement,
> from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
> a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
> a worst-case scenario. On the other hand, XP's compression algorithm
> appears to be pretty aggressive, as it cut the size of the on-disk sort
> file from almost 700MB to 82MB. There's probably gains to be had from a
> different compression algorithm.

Can you re-run these tests using "SELECT aid FROM accounts ..."
"SELECT 1 ... " is of course highly compressible.

I also note that the compressed file fits within memory and may not even
have been written out at all. That's good, but this sounds like the very
best case of what we can hope to achieve. We need to test a whole range
of cases to see if it is generally applicable, or only in certain cases
- and if so which ones.

Would you be able to write up some extensive testing of Martijn's patch?
He's followed through on your idea and written it (on -patches now...)

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com



Re: Compression and on-disk sorting

From
Bruce Momjian
Date:
Uh, TODO already has:
       o %Add a GUC variable to control the tablespace for temporary objects         and sort files
         It could start with a random tablespace from a supplied list and         cycle through the list.

Do we need to add to this?

---------------------------------------------------------------------------

Greg Stark wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> 
> > Which means we need all the interface bits to be able to tell PostgreSQL
> > where every single temp storage area is. Presumably much of the
> > tablespace mechanism could be used for this, but it's still a bunch of
> > work. And you can't just say "I have 8 spindles", you have to tell
> > PostgreSQL exactly where to put each temporary area (unless you just
> > have it put one on every tablespace you have defined).
> 
> Yes, if you have more than one temporary area you definitely need a way to
> tell Postgres where to put them since obviously they won't be in the postgres
> base directory. But I think that's all Postgres really needs to know.
> 
> One could imagine a more complex version where Postgres has meta information
> about the bandwidth and seek penalty for each sort area separately. Presumably
> also for each table space. But that's a whole lot more complexity than
> Postgres's current cost model.
> 
> 
> > > that it should strive to maximize sequential reads within one temp area and
> > > expect switching between temp areas (which represent multiple spindles) to be
> > > better than multiplexing multiple tapes within a single temp area (which
> > > represents a single spindle).
> > 
> > Which adds yet more complexity to all the code that uses the temp area.
> > And as others have brought up, you still have to allow for the case when
> > splitting all of this out into multiple files means you end up using
> > substantially more disk space. That further drives up the complexity.
> 
> You also have to consider that it won't always be a benefit to spread the sort
> over multiple sort areas. If there's only one sort going on and you can reach
> a 1:1 ratio between tapes and spindles then I think it would be a huge
> benefit. Effectively boosting the sort speed by random_page_cost.
> 
> But if you don't have as many spindles as your algorithm needs tapes
> then it's unclear which to multiplex down and whether you gain any benefit
> once you're multiplexing over simply using a single sort area.
> 
> And worse, if there are multiple sorts going on in the system then you're not
> going to get sequential access even if you have multiple sort areas available.
> If you have N sort areas and N sorts are going on then you're probably better
> off multiplexing each one down to a single sort area and letting them each
> proceed without interfering with each other rather than having each one hog
> all the sort areas and forcing the OS to do the multiplexing blindly.
> 
> > My point is that unless someone shows that there's a non-trivial
> > performance gain here, it's not going to happen.
> 
> I think two extreme cases are well worth pursuing: 
> 
> 1) Use n sort areas for n tapes making everything purely sequential access.
>    That would be useful for large DSS systems where large sorts are running
>    and i/o bandwidth is high for sequential access. That gives effectively a
>    random_page_cost speed boost.
> 
> 2) Use the current algorithm unchanged but have each sort use a different sort
>    area in some sort of round-robin fashion. That helps the OLTP type
>    environment (ignoring for the moment that OLTP environments really ought
>    not be doing disk sorts) where people complain about unreliable execution
>    times more than slow execution times. If you can provide enough sort areas
>    then it would remove one big reason other queries concurrent impact the
>    execution time of queries.
> 
> -- 
> greg
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
> 

--  Bruce Momjian   http://candle.pha.pa.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Thu, May 18, 2006 at 11:34:51AM +0100, Simon Riggs wrote:
> On Tue, 2006-05-16 at 15:42 -0500, Jim C. Nasby wrote:
> > On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> > > In any case, my curiousity is aroused, so I'm currently benchmarking
> > > pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
> > > some benchmarks with pg_tmp compressed.
> >  
> > Results: http://jim.nasby.net/bench.log
> > 
> > As expected, compressing $PGDATA/base was a loss. But compressing
> > pgsql_tmp and then doing some disk-based sorts did show an improvement,
> > from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
> > a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
> > a worst-case scenario. On the other hand, XP's compression algorithm
> > appears to be pretty aggressive, as it cut the size of the on-disk sort
> > file from almost 700MB to 82MB. There's probably gains to be had from a
> > different compression algorithm.
> 
> Can you re-run these tests using "SELECT aid FROM accounts ..."
> "SELECT 1 ... " is of course highly compressible.
> 
> I also note that the compressed file fits within memory and may not even
> have been written out at all. That's good, but this sounds like the very
> best case of what we can hope to achieve. We need to test a whole range
> of cases to see if it is generally applicable, or only in certain cases
> - and if so which ones.
> 
> Would you be able to write up some extensive testing of Martijn's patch?
> He's followed through on your idea and written it (on -patches now...)

Yes, I'm working on that. I'd rather test his stuff than XP's
compression anyway.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Thu, May 18, 2006 at 10:57:16AM +0200, Zeugswetter Andreas DCP SD wrote:
> 
> > 1) Use n sort areas for n tapes making everything purely sequential
> access.
> 
> Some time ago testing I did has shown, that iff the IO block size is
> large enough
> (256k) it does not really matter that much if the blocks are at random
> locations.
> I think that is still true for current model disks.
> 
> So unless we parallelize, it is imho sufficient to see to it that we
> write
> (and read) large enough blocks with single calls. This also has no
> problem in 
> highly concurrent scenarios, where you do not have enough spindles.

AFAIK logtape currently reads in much less than 256k blocks. Of course
if you get lucky you'll read from one tape for some time before
switching to another, which should have a sort-of similar effect if the
drives aren't very busy with other things.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Uh, TODO already has:

>         o %Add a GUC variable to control the tablespace for temporary objects
>           and sort files

>           It could start with a random tablespace from a supplied list and
>           cycle through the list.

> Do we need to add to this?

The list bit strikes me as lily-gilding, or at least designing features
well beyond proven need.
        regards, tom lane


Re: Compression and on-disk sorting

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Uh, TODO already has:
> 
> >         o %Add a GUC variable to control the tablespace for temporary objects
> >           and sort files
> 
> >           It could start with a random tablespace from a supplied list and
> >           cycle through the list.
> 
> > Do we need to add to this?
> 
> The list bit strikes me as lily-gilding, or at least designing features
> well beyond proven need.

I have seen other databases do the round-robin idea, so it is worth
testing.

--  Bruce Momjian   http://candle.pha.pa.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
On Thu, May 18, 2006 at 11:22:46AM -0500, Jim C. Nasby wrote:
> AFAIK logtape currently reads in much less than 256k blocks. Of course
> if you get lucky you'll read from one tape for some time before
> switching to another, which should have a sort-of similar effect if the
> drives aren't very busy with other things.

Logtape works with BLCKSZ blocks. Whether it's advisable or not, I
don't know. One thing I'm noticing with this compression-in-logtape is
that the memory cost per tape increases considerably. Currently we rely
heavily on the OS to cache for us.

Lets say we buffer 256KB per tape, and we assume a large sort with many
tapes, you're going to blow all your work_mem on buffers. If using
compression uses more memory so that you can't have as many tapes and
thus need multiple passes, well, we need to test if this is still a
win.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Thu, May 18, 2006 at 08:32:10PM +0200, Martijn van Oosterhout wrote:
> On Thu, May 18, 2006 at 11:22:46AM -0500, Jim C. Nasby wrote:
> > AFAIK logtape currently reads in much less than 256k blocks. Of course
> > if you get lucky you'll read from one tape for some time before
> > switching to another, which should have a sort-of similar effect if the
> > drives aren't very busy with other things.
> 
> Logtape works with BLCKSZ blocks. Whether it's advisable or not, I
> don't know. One thing I'm noticing with this compression-in-logtape is
> that the memory cost per tape increases considerably. Currently we rely
> heavily on the OS to cache for us.
> 
> Lets say we buffer 256KB per tape, and we assume a large sort with many
> tapes, you're going to blow all your work_mem on buffers. If using
> compression uses more memory so that you can't have as many tapes and
> thus need multiple passes, well, we need to test if this is still a
> win.

So you're sticking with 8K blocks on disk, after compression? And then
decompressing blocks as they come in?

Actually, I guess the amount of memory used for zlib's lookback buffer
(or whatever they call it) could be pretty substantial, and I'm not sure
if there would be a way to combine that across all tapes.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Actually, I guess the amount of memory used for zlib's lookback buffer
> (or whatever they call it) could be pretty substantial, and I'm not sure
> if there would be a way to combine that across all tapes.

But there's only one active write tape at a time.  My recollection of
zlib is that compression is memory-hungry but decompression not so much,
so it seems like this shouldn't be a huge deal.
        regards, tom lane


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Thu, May 18, 2006 at 04:55:17PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > Actually, I guess the amount of memory used for zlib's lookback buffer
> > (or whatever they call it) could be pretty substantial, and I'm not sure
> > if there would be a way to combine that across all tapes.
> 
> But there's only one active write tape at a time.  My recollection of
> zlib is that compression is memory-hungry but decompression not so much,
> so it seems like this shouldn't be a huge deal.

It seems more appropriate to discuss results here, rather than on
-patches...

http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
I've run into a slight problem in that even at a compression level of
-3, zlib is cutting the on-disk size of sorts by 25x. So my pgbench sort
test with scale=150 that was producing a 2G on-disk sort is now
producing a 80M sort, which obviously fits in memory. And cuts sort
times by more than half.

So, if nothing else, it looks like compression is definately a win if it
means you can now fit the sort within the disk cache. While that doesn't
sound like something very worthwhile, it essentially extends work_mem
from a fraction of memory to up to ~25x memory.

I'm currently loading up a pgbench database with a scaling factor of
15000; hopefully I'll have results from that testing in the morning.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
On Thu, May 18, 2006 at 10:02:44PM -0500, Jim C. Nasby wrote:
> http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
> I've run into a slight problem in that even at a compression level of
> -3, zlib is cutting the on-disk size of sorts by 25x. So my pgbench sort
> test with scale=150 that was producing a 2G on-disk sort is now
> producing a 80M sort, which obviously fits in memory. And cuts sort
> times by more than half.

I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
unbeleiveable. What's in the table? It would seem to imply that our
tuple format is far more compressable than we expected.

Do you have any stats on CPU usage? Memory usage?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
"Luke Lonergan"
Date:
Jim,

> http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
> I've run into a slight problem in that even at a compression
> level of -3, zlib is cutting the on-disk size of sorts by
> 25x. So my pgbench sort test with scale=150 that was
> producing a 2G on-disk sort is now producing a 80M sort,
> which obviously fits in memory. And cuts sort times by more than half.

When you're ready, we can test this on some other interesting cases and
on fast hardware.

BTW - external sorting is *still* 4x slower than popular commercial DBMS
(PCDB) on real workload when full rows are used in queries.  The final
results we had after the last bit of sort improvements were limited to
cases where only the sort column was used in the query, and for that
case the improved external sort code was as fast as PCDB provided lots
of work_mem are used, but when the whole contents of the row are
consumed (as with TPC-H and in many real world cases) the performance is
still far slower.

So, compression of the tuples may be just what we're looking for.

- Luke



Re: Compression and on-disk sorting

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
> unbeleiveable. What's in the table?

Yeah, I'd tend to question the test data being used.  gzip does not do
that well on typical text (especially not at the lower settings we'd
likely want to use).
        regards, tom lane


Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
On Fri, May 19, 2006 at 09:03:31AM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
> > unbeleiveable. What's in the table?
>
> Yeah, I'd tend to question the test data being used.  gzip does not do
> that well on typical text (especially not at the lower settings we'd
> likely want to use).

However, postgres tables are very highly compressable, 10-to-1 is not
that uncommon. pg_proc and pg_index compress by that for example.
Indexes compress even more (a few on my system compress 25-to-1 but
that could just be slack space, the record being 37-to-1
(pg_constraint_conname_nsp_index)).

The only table on my test system over 32KB that doesn't reach 2-to-1
compression with gzip -3 is one of the toast tables.

So getting 25-to-1 is a lot, but possibly not that extreme.
pg_statistic, which is about as close to random data as you're going to
get on a postgres system, compresses 5-to-1.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> However, postgres tables are very highly compressable, 10-to-1 is not
> that uncommon. pg_proc and pg_index compress by that for example.
> Indexes compress even more (a few on my system compress 25-to-1 but
> that could just be slack space, the record being 37-to-1
> (pg_constraint_conname_nsp_index)).

Anything containing a column of type "name" will compress amazingly well
because of all the padding spaces.  I don't think that's representative
of user data though ... except maybe for the occasional novice using
"char(255)" ...
        regards, tom lane


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:
> On Thu, May 18, 2006 at 10:02:44PM -0500, Jim C. Nasby wrote:
> > http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
> > I've run into a slight problem in that even at a compression level of
> > -3, zlib is cutting the on-disk size of sorts by 25x. So my pgbench sort
> > test with scale=150 that was producing a 2G on-disk sort is now
> > producing a 80M sort, which obviously fits in memory. And cuts sort
> > times by more than half.
> 
> I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
> unbeleiveable. What's in the table? It would seem to imply that our
> tuple format is far more compressable than we expected.

It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
If the tape routines were actually storing visibility information, I'd
expect that to be pretty compressible in this case since all the tuples
were presumably created in a single transaction by pgbench.

If needs be, I could try the patch against http://stats.distributed.net,
assuming that it would apply to REL_8_1.

> Do you have any stats on CPU usage? Memory usage?

I've only been taking a look at vmstat from time-to-time, and I have yet
to see the machine get CPU-bound. Haven't really paid much attention to
memory. Is there anything in partucular you're looking for? I can log
vmstat for the next set of runs (with a scaling factor of 10000). I plan
on doing those runs tonight...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:
>> I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
>> unbeleiveable. What's in the table? It would seem to imply that our
>> tuple format is far more compressable than we expected.

> It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
> If the tape routines were actually storing visibility information, I'd
> expect that to be pretty compressible in this case since all the tuples
> were presumably created in a single transaction by pgbench.

It's worse than that: IIRC what passes through a heaptuple sort are
tuples manufactured by heap_form_tuple, which will have consistently
zeroed header fields.  However, the above isn't very helpful since the
rest of us have no idea what that "accounts" table contains.  How wide
is the tuple data, and what's in it?

(This suggests that we might try harder to strip unnecessary header info
from tuples being written to tape inside tuplesort.c.  I think most of
the required fields could be reconstructed given the TupleDesc.)
        regards, tom lane


Re: Compression and on-disk sorting

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2006-05-19 kell 14:53, kirjutas Tom Lane:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:
> >> I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
> >> unbeleiveable. What's in the table? It would seem to imply that our
> >> tuple format is far more compressable than we expected.
> 
> > It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
> > If the tape routines were actually storing visibility information, I'd
> > expect that to be pretty compressible in this case since all the tuples
> > were presumably created in a single transaction by pgbench.
> 
> It's worse than that: IIRC what passes through a heaptuple sort are
> tuples manufactured by heap_form_tuple, which will have consistently
> zeroed header fields.  However, the above isn't very helpful since the
> rest of us have no idea what that "accounts" table contains.  How wide
> is the tuple data, and what's in it?

Was he not using pg_bench data ?

> (This suggests that we might try harder to strip unnecessary header info
> from tuples being written to tape inside tuplesort.c.  I think most of
> the required fields could be reconstructed given the TupleDesc.)

I guess that tapefiles compress better than averahe table because they
are sorted, and thus at least a little more repetitive than the rest. 
If there are varlen types, then they usually also have abundance of
small 4-byte integers, which should also compress at least better than
4/1, maybe a lot better.


-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:
> > > It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
> > > If the tape routines were actually storing visibility information, I'd
> > > expect that to be pretty compressible in this case since all the tuples
> > > were presumably created in a single transaction by pgbench.
>
> Was he not using pg_bench data ?

Hmm, so there was only 3 integer fields and one varlena structure which
was always empty. This prepended with a tuple header with mostly blank
fields or at least repeated, yes, I can see how we might get a 25-to-1
compression.

Maybe we need to change pgbench so that it puts random text in the
filler field, that would at least put some strain on the compression
algorithm...

> I guess that tapefiles compress better than averahe table because they
> are sorted, and thus at least a little more repetitive than the rest.
> If there are varlen types, then they usually also have abundance of
> small 4-byte integers, which should also compress at least better than
> 4/1, maybe a lot better.

Hmm, that makes sense. That also explains the 37-to-1 compression I was
seeing on indexes :).

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:
> ??hel kenal p??eval, R, 2006-05-19 kell 14:53, kirjutas Tom Lane:
> > "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > > On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:
> > >> I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
> > >> unbeleiveable. What's in the table? It would seem to imply that our
> > >> tuple format is far more compressable than we expected.
> > 
> > > It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
> > > If the tape routines were actually storing visibility information, I'd
> > > expect that to be pretty compressible in this case since all the tuples
> > > were presumably created in a single transaction by pgbench.
> > 
> > It's worse than that: IIRC what passes through a heaptuple sort are
> > tuples manufactured by heap_form_tuple, which will have consistently
> > zeroed header fields.  However, the above isn't very helpful since the
> > rest of us have no idea what that "accounts" table contains.  How wide
> > is the tuple data, and what's in it?
> 
> Was he not using pg_bench data ?

I am. For reference:

bench=# \d accounts       Table "public.accounts" Column  |     Type      | Modifiers 
----------+---------------+-----------aid      | integer       | not nullbid      | integer       | abalance | integer
    | filler   | character(84) | 
 


> > (This suggests that we might try harder to strip unnecessary header info
> > from tuples being written to tape inside tuplesort.c.  I think most of
> > the required fields could be reconstructed given the TupleDesc.)
> 
> I guess that tapefiles compress better than averahe table because they
> are sorted, and thus at least a little more repetitive than the rest. 
> If there are varlen types, then they usually also have abundance of
> small 4-byte integers, which should also compress at least better than
> 4/1, maybe a lot better.

If someone wants to provide a patch that strips out the headers I can test that
as well.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Fri, May 19, 2006 at 09:29:44PM +0200, Martijn van Oosterhout wrote:
> On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:
> > > > It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
> > > > If the tape routines were actually storing visibility information, I'd
> > > > expect that to be pretty compressible in this case since all the tuples
> > > > were presumably created in a single transaction by pgbench.
> > 
> > Was he not using pg_bench data ?
> 
> Hmm, so there was only 3 integer fields and one varlena structure which
> was always empty. This prepended with a tuple header with mostly blank
> fields or at least repeated, yes, I can see how we might get a 25-to-1
> compression.
> 
> Maybe we need to change pgbench so that it puts random text in the
> filler field, that would at least put some strain on the compression
> algorithm...

Wow, I thought there was actually something in there...

True random data wouldn't be such a great test either; what would
probably be best is a set of random words, since in real life you're
unlikely to have truely random data.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> True random data wouldn't be such a great test either; what would
> probably be best is a set of random words, since in real life you're
> unlikely to have truely random data.

True random data would provide worst-case compression behavior, so
we'd want to try that to find out what the downside is; but we shouldn't
consider it to be the design center.
        regards, tom lane


Re: Compression and on-disk sorting

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2006-05-19 kell 14:57, kirjutas Jim C. Nasby:
> On Fri, May 19, 2006 at 09:29:44PM +0200, Martijn van Oosterhout wrote:
> > On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:
> > > > > It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
> > > > > If the tape routines were actually storing visibility information, I'd
> > > > > expect that to be pretty compressible in this case since all the tuples
> > > > > were presumably created in a single transaction by pgbench.
> > > 
> > > Was he not using pg_bench data ?
> > 
> > Hmm, so there was only 3 integer fields and one varlena structure which
> > was always empty. This prepended with a tuple header with mostly blank
> > fields or at least repeated, yes, I can see how we might get a 25-to-1
> > compression.
> > 
> > Maybe we need to change pgbench so that it puts random text in the
> > filler field, that would at least put some strain on the compression
> > algorithm...
> 
> Wow, I thought there was actually something in there...
> 
> True random data wouldn't be such a great test either; what would
> probably be best is a set of random words, since in real life you're
> unlikely to have truely random data.

I usually use something like the following for my "random name" tests:

#!/usr/bin/python

import random

words = [line.strip() for line in open('/usr/share/dict/words')]

def make_random_name(min_items, max_items):   l = []   for w in range(random.randint(min_items, max_items)):
l.append(random.choice(words))  return ' '.join(l)
 

it gives out somewhat justifyable but still quite amusing results:

>>> make_random_name(2,4)
'encroaches Twedy'
>>> make_random_name(2,4)
'annuloida Maiah commends imputatively'
>>> make_random_name(2,4)
'terebral wine-driven pacota'
>>> make_random_name(2,4)
'ballads disenfranchise cabriolets spiny-fruited'


-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Compression and on-disk sorting

From
Martijn van Oosterhout
Date:
On Fri, May 19, 2006 at 01:39:45PM -0500, Jim C. Nasby wrote:
> > Do you have any stats on CPU usage? Memory usage?
>
> I've only been taking a look at vmstat from time-to-time, and I have yet
> to see the machine get CPU-bound. Haven't really paid much attention to
> memory. Is there anything in partucular you're looking for? I can log
> vmstat for the next set of runs (with a scaling factor of 10000). I plan
> on doing those runs tonight...

I've got some more info on zlibs memory usage:

Compression: 5816 bytes + 256KB buffer = approx 261KB
Decompression: 9512 bytes + 32KB buffer = approx 42KB

As Tom said, you only run one compression at a time but logtape doesn't
know that. It can only free the compression structures on Rewind or
Freeze, neither of which are run until the merge pass. I don't
understand the algorithm enough to know if it's safe to rewind the old
tape in selectnewtape. That would seem to defeat the "freeze if only
one tape" optimisation.

One final thing, with trace_sort=on on my machine I get this with
compression:

LOG:  performsort done (except 28-way final merge): CPU 1.48s/7.49u sec elapsed 10.24 sec
LOG:  external sort ended, 163 disk blocks used: CPU 1.48s/7.49u sec elapsed 10.30 sec

and without compression:

LOG:  performsort done (except 28-way final merge): CPU 2.85s/1.90u sec elapsed 14.76 sec
LOG:  external sort ended, 18786 disk blocks used: CPU 2.88s/1.90u sec elapsed 15.70 sec

This indicates an awful lot of I/O waiting, some 60% of the time
without compression. The compression has cut the I/O wait from 10sec to
1.5sec at the expense of 5.5sec of compression time. If you had a
faster compression algorithm (zlib is not that fast) the results would
be even better...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
Finally completed testing of a dataset that doesn't fit in memory with
compression enabled. Results are at
http://jim.nasby.net/misc/pgsqlcompression .

Summary:                   work_mem    compressed  not compressed  gain
in-memory           20000       400.1       797.7           49.8%
in-memory           2000        371.4       805.7           53.9%
not in-memory       20000       8537        17436           51.0%
not in-memory       2000        8152        17820           54.3%

I find it very interesting that the gains are identical even when the
tapes should fit in memory. My guess is that for some reason the OS is
flushing those to disk anyway. In fact, watching gstat during a run, I
do see write activity hitting the drives. So if there was some way to
tune that behavior, the in-memory case would probably be much, much
faster. Anyone know FreeBSD well enough to suggest how to change this?
Anyone want to test on linux and see if the results are the same? This
could indicate that it might be advantageous to attempt an in-memory
sort with compressed data before spilling that compressed data to
disk...

As for CPU utilization, it was ~33% with compression and ~13% without.
That tells me that CPU could become a factor if everything was truely in
memory (including the table we were reading from), but if that's the
case there's a good chance that we wouldn't even be switching to an
on-disk sort. If everything isn't in memory then you're likely to be IO
bound anyway...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
"Joshua D. Drake"
Date:
Jim C. Nasby wrote:
> Finally completed testing of a dataset that doesn't fit in memory with
> compression enabled. Results are at
> http://jim.nasby.net/misc/pgsqlcompression .
> 
> Summary:
>                     work_mem    compressed  not compressed  gain
> in-memory           20000       400.1       797.7           49.8%
> in-memory           2000        371.4       805.7           53.9%
> not in-memory       20000       8537        17436           51.0%
> not in-memory       2000        8152        17820           54.3%
> 
> I find it very interesting that the gains are identical even when the
> tapes should fit in memory. My guess is that for some reason the OS is
> flushing those to disk anyway. In fact, watching gstat during a run, I
> do see write activity hitting the drives. So if there was some way to
> tune that behavior, the in-memory case would probably be much, much
> faster. Anyone know FreeBSD well enough to suggest how to change this?
> Anyone want to test on linux and see if the results are the same? This
> could indicate that it might be advantageous to attempt an in-memory
> sort with compressed data before spilling that compressed data to
> disk...
> 

I can test it on linux just let me know what you need.

J


> As for CPU utilization, it was ~33% with compression and ~13% without.
> That tells me that CPU could become a factor if everything was truely in
> memory (including the table we were reading from), but if that's the
> case there's a good chance that we wouldn't even be switching to an
> on-disk sort. If everything isn't in memory then you're likely to be IO
> bound anyway...


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/
 




Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Wed, May 24, 2006 at 02:20:43PM -0700, Joshua D. Drake wrote:
> Jim C. Nasby wrote:
> >Finally completed testing of a dataset that doesn't fit in memory with
> >compression enabled. Results are at
> >http://jim.nasby.net/misc/pgsqlcompression .
> >
> >Summary:
> >                    work_mem    compressed  not compressed  gain
> >in-memory           20000       400.1       797.7           49.8%
> >in-memory           2000        371.4       805.7           53.9%
> >not in-memory       20000       8537        17436           51.0%
> >not in-memory       2000        8152        17820           54.3%
> >
> >I find it very interesting that the gains are identical even when the
> >tapes should fit in memory. My guess is that for some reason the OS is
> >flushing those to disk anyway. In fact, watching gstat during a run, I
> >do see write activity hitting the drives. So if there was some way to
> >tune that behavior, the in-memory case would probably be much, much
> >faster. Anyone know FreeBSD well enough to suggest how to change this?
> >Anyone want to test on linux and see if the results are the same? This
> >could indicate that it might be advantageous to attempt an in-memory
> >sort with compressed data before spilling that compressed data to
> >disk...
> >
> 
> I can test it on linux just let me know what you need.

Actually, after talking to Larry he mentioned that it'd be worth
checking to see if we're doing something like opening the files in
O_DIRECT, which I haven't had a chance to do. Might be worth looking at
that before running more tests.

Anyway, I've posted the patch now as well, and compress_sort.txt has the
commands I was running. Those are just against a plain pgbench database
that's been freshly initialized (ie: no dead tuples). I just created two
install directories from a checkout of HEAD via --prefix=, one with the
patch and one without. Both hit the same $PGDATA. I've posted the
postgresql.conf as well.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
I've done some more testing with Tom's recently committed changes to
tuplesort.c, which remove the tupleheaders from the sort data. It does
about 10% better than compression alone does. What's interesting is that
the gains are about 10% regardless of compression, which means
compression isn't helping very much on all the redundant header data,
which I find very odd. And the header data is very redundant:

bench=# select xmin,xmax,cmin,cmax,aid from accounts order by aid limit 1; xmin  | xmax | cmin | cmax | aid
--------+------+------+------+-----280779 |    0 |    0 |    0 |   1
(1 row)

bench=# select xmin,xmax,cmin,cmax,aid from accounts order by aid desc limit 1; xmin  | xmax | cmin | cmax |    aid
--------+------+------+------+-----------310778 |    0 |    0 |    0 | 300000000
(1 row)

Makes sense, since pgbench loads the database via a string of COPY commands,
each of which loads 10000 rows.

Something else worth mentioning is that sort performance is worse with
larger work_mem for all cases except the old HEAD, prior to the
tuplesort.c changes. It looks like whatever was done to fix that will
need to be adjusted/rethought pending the outcome of using compression.

In any case, compression certainly seems to be a clear win, at least in
this case. If there's interest, I can test this on some larger hardware,
or if someone wants to produce a patch for pgbench that will load some
kind of real data into accounts.filler, I can test that as well.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Something else worth mentioning is that sort performance is worse with
> larger work_mem for all cases except the old HEAD, prior to the
> tuplesort.c changes. It looks like whatever was done to fix that will
> need to be adjusted/rethought pending the outcome of using compression.

Please clarify.  What are you comparing here exactly, and what cases did
you test?
        regards, tom lane


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Fri, May 26, 2006 at 12:35:36PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > Something else worth mentioning is that sort performance is worse with
> > larger work_mem for all cases except the old HEAD, prior to the
> > tuplesort.c changes. It looks like whatever was done to fix that will
> > need to be adjusted/rethought pending the outcome of using compression.
> 
> Please clarify.  What are you comparing here exactly, and what cases did
> you test?

Sorry, forgot to put the url in:
http://jim.nasby.net/misc/pgsqlcompression/compress_sort.txt

But the meat is:                                       -- work_mem --                       Scale           2000
20000
not compressed          150             805.7   797.7
not compressed          3000            17820   17436
compressed              150             371.4   400.1
compressed              3000            8152    8537
compressed, no headers  3000            7325    7876

Performance degrades with more work_mem any time compression is used. I
thought I had data on just your tuplesort.c change without compression,
but I guess I don't. :( I can run that tonight if desired.

As for the code, the 3 things I've tested are HEAD as of 5/17/06 with no
patches (labeld 'not compressed'); that code with the compression patch
(compressed), and that code with both the compression patch and your change to
tuplesort.c that removes tuple headers from the sorted data (compressed, no
headers).
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Simon Riggs
Date:
On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:

> But the meat is:
>                                         -- work_mem --
>                         Scale           2000    20000
> not compressed          150             805.7   797.7
> not compressed          3000            17820   17436
> compressed              150             371.4   400.1
> compressed              3000            8152    8537
> compressed, no headers  3000            7325    7876

Since Tom has committed the header-removing patch, we need to test
not compressed, no headers v compressed, no headers

There is a noticeable rise in sort time with increasing work_mem, but
that needs to be offset from the benefit that in-general comes from
using a large Heap for the sort. With the data you're using that always
looks like a loss, but that isn't true with all input data orderings.

--  Simon Riggs EnterpriseDB          http://www.enterprisedb.com



Re: Compression and on-disk sorting

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> There is a noticeable rise in sort time with increasing work_mem, but
> that needs to be offset from the benefit that in-general comes from
> using a large Heap for the sort. With the data you're using that always
> looks like a loss, but that isn't true with all input data orderings.

Yeah, these are all the exact same test data, right?  We need a bit more
variety in the test cases before drawing any sweeping conclusions.
        regards, tom lane


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Fri, May 26, 2006 at 04:41:51PM -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > There is a noticeable rise in sort time with increasing work_mem, but
> > that needs to be offset from the benefit that in-general comes from
> > using a large Heap for the sort. With the data you're using that always
> > looks like a loss, but that isn't true with all input data orderings.
> 
> Yeah, these are all the exact same test data, right?  We need a bit more
> variety in the test cases before drawing any sweeping conclusions.

All testing is select count(*) from (select * from accounts order by
bid) a; hitting a pgbench database, since that's something anyone can
(presumably) reproduce. Suggestions for other datasets welcome.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Fri, May 26, 2006 at 09:21:44PM +0100, Simon Riggs wrote:
> On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:
> 
> > But the meat is:
> >                                         -- work_mem --
> >                         Scale           2000    20000
> > not compressed          150             805.7   797.7
> > not compressed          3000            17820   17436
> > compressed              150             371.4   400.1
> > compressed              3000            8152    8537
> > compressed, no headers  3000            7325    7876
> 
> Since Tom has committed the header-removing patch, we need to test
> 
>     not compressed, no headers v compressed, no headers
                                       -- work_mem --                       Scale           2000    20000
not compressed          150             805.7   797.7
not compressed          3000            17820   17436
not compressed, no hdr  3000            14470   14507
compressed              150             371.4   400.1
compressed              3000            8152    8537
compressed, no headers  3000            7325    7876

> There is a noticeable rise in sort time with increasing work_mem, but
> that needs to be offset from the benefit that in-general comes from
> using a large Heap for the sort. With the data you're using that always
> looks like a loss, but that isn't true with all input data orderings.

I thought that a change had been made to the on-disk sort specifically to
eliminate the problem of more work_mem making the sort take longer. I also
thought that there was something about that fix that was tunable.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Simon Riggs
Date:
On Wed, 2006-06-07 at 01:33 -0500, Jim C. Nasby wrote:
> On Fri, May 26, 2006 at 09:21:44PM +0100, Simon Riggs wrote:
> > On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:
> > 
> > > But the meat is:
> > >                                         -- work_mem --
> > >                         Scale           2000    20000
> > > not compressed          150             805.7   797.7
> > > not compressed          3000            17820   17436
> > > compressed              150             371.4   400.1
> > > compressed              3000            8152    8537
> > > compressed, no headers  3000            7325    7876
> > 
> > Since Tom has committed the header-removing patch, we need to test
> > 
> >     not compressed, no headers v compressed, no headers
> 
>                                         -- work_mem --
>                         Scale           2000    20000
> not compressed          150             805.7   797.7
> not compressed          3000            17820   17436
> not compressed, no hdr  3000            14470   14507
> compressed              150             371.4   400.1
> compressed              3000            8152    8537
> compressed, no headers  3000            7325    7876

That looks fairly conclusive. Can we try tests with data in reverse
order, so we use more tapes? We're still using a single tape, so the
additional overhead of compression doesn't cause any pain.

> > There is a noticeable rise in sort time with increasing work_mem, but
> > that needs to be offset from the benefit that in-general comes from
> > using a large Heap for the sort. With the data you're using that always
> > looks like a loss, but that isn't true with all input data orderings.
> 
> I thought that a change had been made to the on-disk sort specifically to
> eliminate the problem of more work_mem making the sort take longer. 

There was a severe non-optimal piece of code...but the general effect
still exists. As does the effect that having higher work_mem produces
fewer runs which speeds up the final stages of the sort.

> I also
> thought that there was something about that fix that was tunable.

Increasing work_mem makes *this* test case take longer. 

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com



Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Wed, Jun 07, 2006 at 11:59:50AM +0100, Simon Riggs wrote:
> On Wed, 2006-06-07 at 01:33 -0500, Jim C. Nasby wrote:
> > On Fri, May 26, 2006 at 09:21:44PM +0100, Simon Riggs wrote:
> > > On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:
> > > 
> > > > But the meat is:
> > > >                                         -- work_mem --
> > > >                         Scale           2000    20000
> > > > not compressed          150             805.7   797.7
> > > > not compressed          3000            17820   17436
> > > > compressed              150             371.4   400.1
> > > > compressed              3000            8152    8537
> > > > compressed, no headers  3000            7325    7876
> > > 
> > > Since Tom has committed the header-removing patch, we need to test
> > > 
> > >     not compressed, no headers v compressed, no headers
> > 
> >                                         -- work_mem --
> >                         Scale           2000    20000
> > not compressed          150             805.7   797.7
> > not compressed          3000            17820   17436
> > not compressed, no hdr  3000            14470   14507
> > compressed              150             371.4   400.1
> > compressed              3000            8152    8537
> > compressed, no headers  3000            7325    7876
> 
> That looks fairly conclusive. Can we try tests with data in reverse
> order, so we use more tapes? We're still using a single tape, so the
> additional overhead of compression doesn't cause any pain.

Would simply changing the ORDER BY to DESC suffice for this? FWIW:

bench=# select correlation from pg_stats where tablename='accounts' and attname='bid';correlation 
-------------          1
(1 row)

> > > There is a noticeable rise in sort time with increasing work_mem, but
> > > that needs to be offset from the benefit that in-general comes from
> > > using a large Heap for the sort. With the data you're using that always
> > > looks like a loss, but that isn't true with all input data orderings.
> > 
> > I thought that a change had been made to the on-disk sort specifically to
> > eliminate the problem of more work_mem making the sort take longer. 
> 
> There was a severe non-optimal piece of code...but the general effect
> still exists. As does the effect that having higher work_mem produces
> fewer runs which speeds up the final stages of the sort.
> 
> > I also
> > thought that there was something about that fix that was tunable.
> 
> Increasing work_mem makes *this* test case take longer. 
> 
> -- 
>   Simon Riggs             
>   EnterpriseDB   http://www.enterprisedb.com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
> 

-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Compression and on-disk sorting

From
Simon Riggs
Date:
On Wed, 2006-06-07 at 09:35 -0500, Jim C. Nasby wrote:

> Would simply changing the ORDER BY to DESC suffice for this? FWIW:

Try sorting on aid also, both ascneding and descending.

We need to try lots of tests, not just one thats chosen to show the
patch in the best light. I want this, but we need to check.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com



Re: Compression and on-disk sorting

From
"Jim C. Nasby"
Date:
On Wed, Jun 07, 2006 at 04:11:57PM +0100, Simon Riggs wrote:
> On Wed, 2006-06-07 at 09:35 -0500, Jim C. Nasby wrote:
> 
> > Would simply changing the ORDER BY to DESC suffice for this? FWIW:
> 
> Try sorting on aid also, both ascneding and descending.
> 
> We need to try lots of tests, not just one thats chosen to show the
> patch in the best light. I want this, but we need to check.

Well, correlation on everything in that table is 1. At this point maybe
it makes more sense to just come up with a different test case, possibly
generate_series and random. Better yet would be if someone came up with
a patch that actually populated the filler field in pgbench. Better
still would be allowing the user to define how large they wanted the
filler field to be...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461