Re: libpq compression - Mailing list pgsql-hackers

From Daniil Zakhlystov
Subject Re: libpq compression
Date
Msg-id D5354E7A-3B9F-4D32-B3AB-F65058D36500@yandex-team.ru
Whole thread Raw
In response to Re: libpq compression  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: libpq compression
Re: libpq compression
List pgsql-hackers
> On Dec 10, 2020, at 1:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I still think this is excessively baroque and basically useless.
> Nobody wants to allow compression levels 1, 3, and 5 but disallow 2
> and 4. At the very most, somebody might want to start a maximum or
> minimum level. But even that I think is pretty pointless. Check out
> the "Decompression Time" and "Decompression Speed" sections from this
> link:
>
> https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/
>
> This shows that decompression time and speed is basically independent
> of compression method for all three of these compressors; to the
> extent that there is a difference, higher compression levels are
> generally slightly faster to decompress. I don't really see the
> argument for letting either side be proscriptive here. Deciding with
> algorithms you're willing to accept is totally reasonable since
> different things may be supported, security concerns, etc. but
> deciding you're only willing to accept certain levels seems unuseful.
> It's also unenforceable, I think, since the receiving side has no way
> of knowing what the sender actually did.

I agree that decompression time and speed are basically the same for different compression ratios for most algorithms.
But it seems like that this may not be true for memory usage.

Check out these links: http://mattmahoney.net/dc/text.html and
https://community.centminmod.com/threads/round-4-compression-comparison-benchmarks-zstd-vs-brotli-vs-pigz-vs-bzip2-vs-xz-etc.18669/

According to these sources, zstd uses significantly more memory while decompressing the data which has been compressed
withhigh compression ratios. 

So I’ll test the different ZSTD compression ratios with the current version of the patch and post the results later
thisweek. 


> On Dec 10, 2020, at 1:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> Good points. I guess you need to arrange to "flush" at the compression
> layer as well as the libpq layer so that you don't end up with data
> stuck in the compression buffers.

I think that “flushing” the libpq and compression buffers before setting the new compression method will help to solve
issuesonly at the compressing (sender) side 
but won't help much on the decompressing (receiver) side.

In the current version of the patch, the decompressor acts as a proxy between secure_read and PqRecvBuffer /
conn->inBuffer.It is unaware of the Postgres protocol and  
will fail to do anything other than decompressing the bytes received from the secure_read function and appending them
tothe PqRecvBuffer. 
So the problem is that we can’t decouple the compressed bytes from the uncompressed ones (actually ZSTD detects the
compressedblock end, but some other algorithms don’t). 

We may introduce some hinges to control the decompressor behavior from the underlying levels after reading the
SetCompressionMethodmessage 
from PqRecvBuffer, but I don’t think that it is the correct approach.

> On Dec 10, 2020, at 1:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Another idea is that you could have a new message type that says "hey,
> the payload of this is 1 or more compressed messages." It uses the
> most-recently set compression method. This would make switching
> compression methods easier since the SetCompressionMethod message
> itself could always be sent uncompressed and/or not take effect until
> the next compressed message. It also allows for a prudential decision
> not to bother compressing messages that are short anyway, which might
> be useful. On the downside it adds a little bit of overhead. Andres
> was telling me on a call that he liked this approach; I'm not sure if
> it's actually best, but have you considered this sort of approach?

This may help to solve the above issue. For example, we may introduce the CompressedData message:

CompressedData (F & B)

Byte1(‘m’) // I am not so sure about the ‘m’ identifier :)
Identifies the message as compressed data.

Int32
Length of message contents in bytes, including self.

Byten
Data that forms part of a compressed data stream.

Basically, it wraps some chunk of compressed data (like the CopyData message).

On the sender side, the compressor will wrap all outgoing message chunks into the CopyData messages.

On the receiver side, some intermediate component between the secure_read and the decompressor will do the following:
1. Read the next 5 bytes (type and length) from the buffer
2.1 If the message type is other than CompressedData, forward it straight to the PqRecvBuffer /  conn->inBuffer.
2.2 If the message type is CompressedData, forward its contents to the current decompressor.

What do you think of this approach?

—
Daniil Zakhlystov


pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Next
From: Pavel Stehule
Date:
Subject: Re: Rethinking plpgsql's assignment implementation