Re: Is there anything special about pg_dump's compression? - Mailing list pgsql-sql

From Shane Ambler
Subject Re: Is there anything special about pg_dump's compression?
Date
Msg-id 473DC81A.3010806@Sheeky.Biz
Whole thread Raw
In response to Re: Is there anything special about pg_dump's compression?  (Jean-David Beyer <jeandavid8@verizon.net>)
Responses Re: Is there anything special about pg_dump's compression?  (Jean-David Beyer <jeandavid8@verizon.net>)
List pgsql-sql
Jean-David Beyer wrote:
> Tom Lane wrote:
>> Jean-David Beyer <jeandavid8@verizon.net> writes:
>>> I turned the software compression off. It took:
>>> 524487428 bytes (524 MB) copied, 125.394 seconds, 4.2 MB/s
>>> When I let the software compression run, it uses only 30 MBytes. So whatever
>>> compression it uses is very good on this kind of data.
>>> 29810260 bytes (30 MB) copied, 123.145 seconds, 242 kB/s
>> Seems to me the conclusion is obvious: you are writing about the same
>> number of bits to physical tape either way. 
> 
> I guess so. I _am_ impressed by how much compression is achieved.

Plain text tends to get good compression in most algorithms, repetitive 
content tends to improve things a lot. (think of how many CREATE TABLE 
COPY FROM stdin ALTER TABLE ADD CONSTRAINT GRANT ALL ON SCHEMA REVOKE 
ALL ON SCHEMA ...... are in your backup files)

To test that create a text file with one line - "this is data\n"
Then bzip that file - the original uses 13 bytes the compressed uses 51 
bytes.

now change the file to have 4000 lines of "this is data\n"
the original is 52,000 bytes and compressed it is 76 bytes
- it uses 25 bytes to indicate the same string is repeated 4000 times

>> The physical tape speed is
>> surely the real bottleneck here, and the fact that the total elapsed
>> time is about the same both ways proves that about the same number of
>> bits went onto tape both ways.
> 
> I do not get that. If the physical tape speed is the bottleneck, why is it
> only about 242 kB/s in the software-compressed case, and 4.2 MB/s in the
> hardware-uncompressed case? The tape drive usually gives over 6 MB/s rates
> when running a BRU (similar to find > cpio) when doing a backup of the rest

It would really depend on where the speed measurement comes from and how 
they are calculated. Is it data going to the drive controller or is it 
data going to tape? Is it the uncompressed size of data going to tape?

My guess is that it is calculated as the uncompressed size going to 
tape. In the two examples you give similar times for the same original 
uncompressed data.

I would say that both methods send 30MB to tape which takes around 124 
seconds

The first example states 4.2MB/s - calculated from the uncompressed size 
of 524MB, yet the drive compresses that to 30MB which is written to 
tape. So it is saying it got 524MB and saved it to tape in 125 seconds 
(4.2MB/s), but it still only put 30MB on the tape.

524MB/125 seconds = 4.192MB per second

The second example states 242KB/s - calculated from the size sent to the 
drive - as the data the drive gets is compressed it can't compress it 
any smaller - the data received is the same size as the data written to 
tape. This would indicate your tape speed.

30MB/123 seconds = 243KB/s

To verify this -

524/30=17 - the compressed data is 1/17 the original size.

242*17=4114 - that's almost the 4.2MB/s that you get sending 
uncompressed data, I would say you get a little more compression from 
the tape hardware that gives you the slightly better transfer rate.
Or sending compressed data to the drive with it set to compress incoming 
data is causing a delay as the drive tries to compress the data without 
reducing the size sent to tape. (my guess is that if you disabled the 
drive compression and sent the compressed pg_dump to the drive you would 
get about 247KB/s)


I would also say the 6MB/s from a drive backup would come about from -
1. less overhead as data is sent directly from disk to tape. (DMA should 
reduce the software overhead as well). (pg_dump formats the data it gets 
and waits for responses from postgres - no DMA)

And maybe -
2. A variety of file contents would also offer different rates of 
compression - some of your file system contents can be compressed more 
than pg_dump output.
3. Streamed as one lot to the drive it may also allow it to treat your 
entire drive contents as one file - allowing duplicates in different 
files to be compressed the way the above example does.


-- 

Shane Ambler
pgSQL@Sheeky.Biz

Get Sheeky @ http://Sheeky.Biz


pgsql-sql by date:

Previous
From: Jean-David Beyer
Date:
Subject: Re: Is there anything special about pg_dump's compression?
Next
From: Jean-David Beyer
Date:
Subject: Re: Is there anything special about pg_dump's compression?