Re: libpq compression - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: libpq compression
Date
Msg-id 826687dc-f9f6-e854-5975-65122a7549a4@garret.ru
Whole thread Raw
In response to Re: libpq compression  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers


On 24.11.2020 21:35, Robert Haas wrote:
On Tue, Nov 24, 2020 at 12:35 PM Daniil Zakhlystov
<usernamedt@yandex-team.ru> wrote:
To sum up, I think that the current implementation already introduces good benefits. As I proposed in the Usability review, we may introduce the new approaches later as separate compression 'algorithms'.
I don't think the current patch is so close to being committable that
we shouldn't be considering what we really want to have here. It's one
thing to say, well, this patch is basically done, let's not start
redesigning it now. But that's not the case here. For example, I don't
see any committer accepting the comments in zpq_stream.c as adequate,
or the documentation, either. Some comments that have been made
previously, like Andres's remark about the non-standard message
construction in pq_configure(), have not been addressed, and I do not
think any committer is going to agree with the idea that the novel
method chosen by the patch is superior here, not least but not only
because it seems like it's endian-dependent. That function also uses
goto, which anybody thinking of committing this will surely try to get
rid of, and I'm pretty sure the sscanf() isn't good enough to reject
trailing garbage, and the error message that follows is improperly
capitalized. I'm sure there's other stuff, too: this is just based on
a quick look.

Before we start worrying about any of that stuff in too much detail, I
think it makes a lot of sense to step back and consider the design.
Honestly, the work of changing the design might be smaller than the
amount of cleanup the patch needs. But even if it's larger, it's
probably not vastly larger. And in any case, I quite disagree with the
idea that we should commit to a user-visible interface that exposes a
subset of the functionality that we needed and then try to glue the
rest of the functionality on top of it later. If we make a libpq
connection option called compression that controls the type of
compression that is used in both direction, then how exactly would we
extend that later to allow for different compression in the two
directions? Some syntax like compression=zlib/none, where the value
before the slash controls one direction and the value after the slash
controls the other? Maybe. But on the other hand, maybe it's better to
have separate connection options for client compression and server
compression. Or, maybe the kind of compression used by the server
should be controlled via a GUC rather than a connection option. Or,
maybe none of that is right and we should stick with the approach the
patch currently takes. But it's not like we can do something for v1
and then just change things randomly later: there will be
backward-compatibility to worry about. So the time to talk about the
general approach here is now, before anything gets committed, before
the project has committed itself to any particular design. If we
decide in that discussion that certain things can be left for the
future, that's fine. If we've have discussed how they could be added
without breaking backward compatibility, even better. But we can't
just skip over having that discussion.

--
Robert Haas
EDB: http://www.enterprisedb.com

First of all thank you for review.
And I completely agree with you that this patch is not ready for committing -
at least documentation written in my bad english has to be checked and fixed. Also I have not tesyed it much at Windows
and other non-unix systems.

I do not want to discuss small technical things like sending compression message
in pq_configure. I have answered Andres why I can not use standard functions in this case:
just because this function is called in initialization of connection and so connection handle is not ready yet.
But this definitely can be changed (although it is not endian-dependent: libpq messages are using big-endian order).
Also it seems to be strange to discuss presence of "goto" in the code: we are not "puritans", are we? ;)
Yes, I also try to avoid use of goto-s as much as possible as them can make code less readable.
But complete prohibiting of goto-s and replacing them with artificial loops and redundant checks seems to be
a kind of radical approaches which rarely lead to something good. By the way - there 3 thousands gotos in Postgres code
(may be some of them are in generated code - I have not checked:)

So lets discuss more fundamental things, like your suggestion for complete redesign of compression support.
I am not against such discussion, although I personally do not think that there are a lot of topics for discussion here.
Definitely I do not want to say that my implementation is perfect and it can not be reimplemented in better way. Certainly it can.
But compression itself, especially compression of protocol messages is not a novel area. It was implemented many times in many
different systems (recently it was added to mysql) and it is hard to find some "afflatus" here.

We can suggest many different things which can make compression more complex and flexible:
- be able to switch it on/off on the fly
- use different compression algorithms in different directions
- be able to change compression algorithm and compression level on the fly
- dynamically choose the best compression algorithm based on data stream content
...

I am sure that after thinking few moments any of us can add several more items to this list.
If we try to develop protocol which will be able to support all of them,
then most likely it will be too complicated, too expensive and inefficient.
There is general rule: the more flexible some thing is, the less efficient it is...

Before suggesting any of this advanced feature, it is necessary to have strong arguments why it is actually needed
and which problem it is trying to fix.

Right now, "statu quo" is that we have few well known compression algorithms: zlib, zstd, lz4,... which characteristics are not so
different. For some of them it is possible to say that algorithm A is almost always better than algorithm B, for example
modern zstd almost always better than old zlib. But zlib is available almost everywhere and is already supported by default by postgres.
Some algorithms are faster, some provide better compression ratio. But for small enough protocol messages speed is much more
important than compression quality.

Sorry, for my may be too "
chaotic" arguments. I really do not think that there is much sense in supporting
a lot of different compression algorithms.
Both zstd or lz4 will provide comparable speed and compression ratio. And zlib  can be used as universal solution.
And definitely it will be much better than lack of any compression.

What makes me really worry is that there are several places in Postgres which requires compression and all of them are implemented
in their own way: TOAST compression, WAL compression, backup compression,...  zedheap compression.
There is now custom compression methods patch at commit fest, which propose ... very flexible way of fixing just one of
this cases. And which is completely orthogonal to the proposed libpq compression.
And my main concern is this awful code duplication and the fact that postgres is still using its own pglz implementation
which is much worse (according to zedstore benchmarks and my own measurements) than lz4 and any other popular algorithms.
And I do not believe that situation will be changed in PG14 or PG15.

So why do we need to be able to switch compression on/off?
High CPU overhead? Daniil measurements don't prove it... Compression should not be used for all connections.
There are some well known connections for which it will be useful: first of all replication protocol.
If client uploads data to the database using COPY command or perform some complex OLAP queries returning huge results
then compression also may be useful.
I can not imagine scenario when it is necessary to be able to switch on/off compression on the fly.
And it is not clear who (user himself, libpq library, JDBC driver,...?) and how will control it.

Possibility to use different algorithms in different directions seems to be even less sensible. 
Especially if we do not use data-specific compression algorithms. And using such algorithms is very very
doubtful, taken in account that data between client and server is now mostly transferred in text format
(ok, in case of replication it is not true).

Dynamic switch of used compression algorithm depending on content of data stream is also interesting
but practically useless idea, because modern compression algorithms are doing them best trying to adopt to
the input data.

Finally I just want to notice once again that from my point of view compression is not the area where we need to invent
something new. It is enough to provide the similar functionality which is available at most of other communication protocols.
IMHO there is no sense to allow user to specify particular algorithm and compression level.
Like it is done in SSL: you can switch compression on or off. You can not specify algorithm, compression level...
and moreover somehow dynamically control compression on the fly.

Unfortunately your proposal to throw away this patch, start design discussion from the scratch most likely mean
that we will not have compression in some observable feature. And it is a pity, because I see many real use cases
where compression is actually needed and where Oracle wins Postgres several times just because it is sending less data
through network.


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: mark/restore failures on unsorted merge joins
Next
From: Tom Lane
Date:
Subject: Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path