Re: libpq compression - Mailing list pgsql-hackers
From | Daniil Zakhlystov |
---|---|
Subject | Re: libpq compression |
Date | |
Msg-id | 67498D86-006D-4D56-85E2-88ABA07B2715@yandex-team.ru Whole thread Raw |
In response to | Re: libpq compression (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>) |
List | pgsql-hackers |
Hi,
I have a couple of comments regarding the last patch, mostly these are minor issues.
In src/backend/libpq/pqcomm.c, starting from the line 1114:
int
pq_getbyte_if_available(unsigned char *c)
{
int r;
Assert(PqCommReadingMsg);
if (PqRecvPointer < PqRecvLength || (0) > 0) // not easy to understand optimization (maybe add a comment?)
{
*c = PqRecvBuffer[PqRecvPointer++];
return 1;
}
return r; // returned value is not initialized
}
In src/interfaces/libpq/fe-connect.c, starting from the line 3255:
pqGetc(&algorithm, conn);
impl = zpq_get_algorithm_impl(algorithm);
{ // I believe that if (impl < 0) condition is missing here, otherwise there is always an error
appendPQExpBuffer(&conn->errorMessage,
libpq_gettext(
"server is not supported requested compression algorithm %c\n"), algorithm);
goto error_return;
}
In configure, starting from the line 1587:
--without-zlib do not use Zlib
--with-zstd do not use zstd // is this correct?
Thanks
--
Daniil Zakhlystov
On Nov 1, 2020, at 12:08 AM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:Hi
On 31.10.2020 00:03, Andres Freund wrote:Hi,Sorry, may be I do not completely understand your suggestion.
On 2020-10-29 16:45:58 +0300, Konstantin Knizhnik wrote:I think it's pretty important for features like this to be able to fail- What does " and it is up to the client whether to continue workIt can not happen.
without compression or report error" actually mean for a libpq parameter?
The client request from server use of compressed protocol only if
"compression=XXX" was specified in connection string.
But XXX should be supported by client, otherwise this request will be
rejected.
So supported protocol string sent by client can never be empty.
softly when the feature is not available on the other side. Otherwise a
lot of applications need to have unnecessary version dependencies coded
into them.
Right now user jut specify that he wants to use compression.
Libpq client sends to the server list of supported algorithms
and server choose among them the best one is supported.
It sends it chose to the client and them are both using this algorithm.
Sorry, that in previous mail I have used incorrect samples: client is not explicitly specifying compression algorithm -
it just request compression. And server choose the most efficient algorithm which is supported both by client and server.
So client should not know names of the particular algorithms (i.e. zlib, zstd) and
choice is based on the assumption that server (or better say programmer)
knows better than user which algorithms is more efficient.
Last assumption me be contested because user better know which content will be send and which
algorithm is more efficient for this content. But right know when the choice is between zstd and zlib,
the first one is always better: faster and provides better quality.Ok, I will remove this phrase.To me that seems like unnecessary detail in the user oriented parts of- What is the point of the "streaming mode" reference?There are ways of performing compression:
- block mode when each block is individually compressed (compressor stores
dictionary in each compressed blocked)
- stream mode
Block mode allows to independently decompress each page. It is good for
implementing page or field compression (as pglz is used to compress toast
values).
But it is not efficient for compressing client-server protocol commands.
It seems to me to be important to explain that libpq is using stream mode
and why there is no pglz compressor
the docs at least.Yes, I agree that it provides more flexibility.It's pretty darn trivial to have a variable width protocol message,Why are compression methods identified by one byte identifiers? ThatIt is mostly for simplicity of implementation: it is always simple to work
seems unnecessarily small, given this is commonly a once-per-connection
action?
with fixed size messages (or with array of chars rather than array of
strings).
And I do not think that it can somehow decrease flexibility: this one-letter
algorihth codes are not visible for user. And I do not think that we
sometime will support more than 127 (or even 64 different compression
algorithms).
given that we have all the tooling for that. Having a variable length
descriptor allows us to extend the compression identifier with e.g. the
level without needing to change the protocol. E.g. zstd:9 or
zstd:level=9 or such.
I suggest using a variable length string as the identifier, and split it
at the first : for the lookup, and pass the rest to the compression
method.
And it is not a big problem to handle arbitrary strings instead of chars.
But right now my intention was to prevent user from choosing compression algorithm and especially specifying compression level
(which are algorithm-specific). Its matter of server-client handshake to choose best compression algorithm supported by both of them.
I can definitely rewrite it, but IMHO given to much flexibility for user will just complicate things without any positive effect.Definitely stream is flushed after writing each message.Really? We need to be able to ensure a message is flushed out to theThe protocol sounds to me like there's no way to enable/disableIt will significantly complicate implementation (because of buffering at
compression in an existing connection. To me it seems better to have an
explicit, client initiated, request to use a specific method of
compression (including none). That allows to enable compression for bulk
work, and disable it in other cases (e.g. for security sensitive
content, or for unlikely to compress well content).
different levels).
network at precise moments - otherwise a lot of stuff will break.
The problem is at receiver side: several messages can be sent without waiting response and will be read into the buffer. If first is compressed
and subsequent - not, then it will be not so trivial to handle it. I have already faced with this problem when compression is switched on:
backend may send some message right after acknowledging compression protocol. This message is already compressed but is delivered together with uncompressed compression acknowledgement message. I have solved this problem in switching on compression at client side,
but really do not want to solve it at arbitrary moments. And once again: my opinion is that too much flexibility is not so good here:
there is no sense to switch off compression for short messages (overhead of switching can be larger than compression itself).Sorry, but I still think that possibility to turn compression on/off on the fly is very doubtful idea.Also it is not clear to me who and how will control enabling/disablingThe client could still trigger that. I don't think it should ever be
compression in this case?
I can imagine that "\copy" should trigger compression. But what about
(server side) "copy" command?
server controlled.
And it will require extension of protocol (adding some extra functions to libpq library to control it).Sorry, if I was unclear. I am not specialist in security.And concerning security risks... In most cases such problem is not relevantThe trend is towards never trusting the network, even if internal, not
at all because both client and server are located within single reliable
network.
the opposite.It if security of communication really matters, you should not switchHuh?
compression in all cases (including COPY and other bulk data transfer).
It is very strange idea to let client to decide which data is "security
sensitive" and which not.
But it seems to be very dangerous when client (and not server) decides which data is "security sensitive"
and which not. Just an example: you have VerySecreteTable. And when you perform selects on this table,
you switch compression off. But then client want to execute COPY command for this table and as far as it requires bulk data transfer,
user decides to switch on compression.
Actually specking about security risks has no sense: if you want to provide security, then you use TLS. And it provides its own compression.
So there is no need to perform compression at libpq level. libpq comression is for the cases when you do not need SSL at all.My expectation was that pooler is installed in the same network as database and there is no needIt makes poolers a lot more expensive if they have to decompress andI think that would also make cross-version handling easier, because aI do not completely understand the problem with connection pooler.
newer client driver can send the compression request and handle the
error, without needing to reconnect or such.
Most importantly, I think such a design is basically a necessity to make
connection poolers to work in a sensible way.
Right now developers of Yandex Odyssey are trying to support libpq
compression in their pooler.
If them will be faced with some problems, I will definitely address
them.
then recompress again. It'd be awesome if even the decompression could
be avoided in at least some cases, but compression is the really
expensive part. So if a client connects to the pooler with
compression=zlib and an existing server connection is used, the pooler
should be able to switch the existing connection to zlib.
to compress traffic between pooler and backends.
But even if it is really needed I do not see problems here.
Definitely we need zpq library to support different compression algorithms in the same application.
But it is done.
To avoid pooler perform decompression+compression ist is necessary to send command header in raw format.
But it will greatly reduce effect of stream compression.Sorry, I do not understand you.But if you think that it is so important, I will try to implement it. ManyI don't think the server needs to know about what the client does,
questions arise in this case: which side should control compression level?
Should client affect compression level both at client side and at server
side? Or it should be possible to specify separately compression level for
client and for server?
compression level wise.
Compression takes place at sender side.
So both client compressing its messages before sending to server
and server compresses responses sent to the client.
In both cases compression level may be specified.
And it is not clear whether client should control compression level of the server.All I was saying is that I think you should not name it ""Facebook zstd",Please notice that it is internal encoding, user will specify+<para>I would prefer this just be referenced as zstd or zstandard, not
+ Used compression algorithm. Right now the following streaming compression algorithms are supported: 'f' - Facebook zstd, 'z' - zlib, 'n' - no compression.
+</para>
facebook zstd. There's an RFC (albeit only "informational"), and it
doesn't name facebook, except as an employer:
https://tools.ietf.org/html/rfc8478
psql -d "dbname=postgres compression=zstd"
If name "zstd" is not good, I can choose any other.
just "zstd".
Sorry, it was my error.
Right now compression name is not specified by user at all: just boolean value on/off.I think you should just use something like
pq_beginmessage(buf, 'z');
pq_sendbyte(buf, compression_method_byte);
pq_endmessage(buf);
like most of the backend does? And if you, as I suggest, use a variable
length compression identifier, use
pq_sendcountedtext(buf, compression_method, strlen(compression_method));
or such.
Sorry, I can not do it in this way because pq connection with client is not yet initialized.
So I can not use pq_endmessage here and have to call secure_write manually.I am not saying it's necessarily the best approach, but it doesn't seem
like it'd be hard to have zpq not send / receive the data from the
network, but just transform the data actually to-be-sent/received from
the network.
E.g. in pq_recvbuf() you could do something roughly like the following,
after a successful secure_read()
if (compression_enabled)
{
zpq_decompress(network_data, network_data_length,
&input_processed, &output_data, &output_data_length);
copy_data_into_recv_buffer(output_data, output_data_length);
network_data += input_processed;
network_data_length -= input_processed;
}
Sorry, it is not possible because to produce some output, compressor may need to perform multiple read.
And doing it in two different places, i.e. in pqsecure_read and secure_read is worse than doing it on once place - s it is done now.and the inverse on the network receive side. The advantage would be that
we can then more easily use zpq stuff in other places.
If we do go with the current callback model, I think we should at least
extend it so that a 'context' parameter is shuffled through zpq, so that
other code can work without global state.
Callback is has void* arg where arbitrary daat can be passed to callback.?: makes sense when it's either much shorter, or when it allows to avoidIsn't it a matter of style preference?+ r = PqStream? : doesn't make sense to me in this case. This should be an if/else.
+ ? zpq_read(PqStream, PqRecvBuffer + PqRecvLength,
+ PQ_RECV_BUFFER_SIZE - PqRecvLength, &processed)
+ : secure_read(MyProcPort, PqRecvBuffer + PqRecvLength,
+ PQ_RECV_BUFFER_SIZE - PqRecvLength);
+ PqRecvLength += processed;
Why if/else is principle better than ?:
repeating a bunch of code. E.g. if it's just a conditional argument to a
function with a lot of other arguments.
But when, as here, it just leads to weirder formatting, it really just
makes the code harder to read.
From my point of view the main role of ?: construction is not just saving some bytes/lines of code.
It emphasizes that both parts of conditional construction produce the same result.
As in this case it is result of read operation. If you replace it with if-else, this relation between actions done in two branches is missed
(or at least less clear).I agree that sometimes ?: leads to more complex and obscure expressions.Yes, pretty clearly for me.
But to you think that if-else in this case will lead to more clear or
readable code?
r = PqStream
? zpq_read(PqStream, PqRecvBuffer + PqRecvLength,
PQ_RECV_BUFFER_SIZE - PqRecvLength, &processed)
: secure_read(MyProcPort, PqRecvBuffer + PqRecvLength,
PQ_RECV_BUFFER_SIZE - PqRecvLength);
vs
if (PqStream)
r = zpq_read(PqStream, PqRecvBuffer + PqRecvLength,
PQ_RECV_BUFFER_SIZE - PqRecvLength, &processed);
else
r = secure_read(MyProcPort, PqRecvBuffer + PqRecvLength,
PQ_RECV_BUFFER_SIZE - PqRecvLength);
the if / else are visually more clearly distinct, the sole added
repetition is r =. And most importantly if / else are more common.
I agree that ?: is less clear than if/else. In most of modern programming languages there is no construction ?:
but if-else can be used as expression. And repetition is always bad, isn't it? Not just because of writing extra code,
but because of possible inconsistencies and errors (just like normal forms in databases).
In any case - I do not think that it is principle: if you prefer - I can replace it with if-elseAnother question is whether conditional expression here is really good idea.A branch is cheaper than indirect calls, FWIW>
I prefer to replace with indirect function call...
Really? Please notice that we do not compare just branch with indirect call, but branch+direct call vs. indirect call.
I didn't measure it. But in any case IMHO performance aspect is less important here than clearance and maintainability of code.
And indirect call is definitely better in this sense... Unfortunately I can not use it (certainly it is not absolutely impossible, but it requires much
more changes).I don't mean arguments as in configurable, I mean that you should add aWhat are the suggested way to specify them?+#define ZSTD_BUFFER_SIZE (8*1024)Add some arguments for choosing these parameters.
+#define ZSTD_COMPRESSION_LEVEL 1
I can not put them in GUCs (because them are also needed at client side).
comment explaining why 8KB / level 1 was chosen.
Sorry, will do.We support palloc in both(). But even without that, you can just haveI tried to avoid use of dynamic memory allocation because zpq is used both+/*Uh, doesn't this bake ZPQ_MAX_ALGORITHMS into the ABI? That seems
+ * Get list of the supported algorithms.
+ * Each algorithm is identified by one letter: 'f' - Facebook zstd, 'z' - zlib.
+ * Algorithm identifies are appended to the provided buffer and terminated by '\0'.
+ */
+void
+zpq_get_supported_algorithms(char algorithms[ZPQ_MAX_ALGORITHMS])
+{
+ int i;
+ for (i = 0; zpq_algorithms[i].name != NULL; i++)
+ {
+ Assert(i < ZPQ_MAX_ALGORITHMS);
+ algorithms[i] = zpq_algorithms[i].name();
+ }
+ Assert(i < ZPQ_MAX_ALGORITHMS);
+ algorithms[i] = '\0';
+}
entirely unnecessary?
in client and server environments with different memory allocation
policies.
one function to get the number of algorithms, and then have the caller
pass in a large enough array.
Certainly it is possible.
But why complicate implementation if it is internal function used only in single place in the code?And I afraid that using the _pq_ parameter stuff makes enabling ofI didn't intend to suggest that the _pq_ stuff should be used in a
compression even less user friendly.
client facing manner. What I suggesting is that if the connection string
contains compression=, the driver would internally translate that to
_pq_.compression to pass that on to the server, which would allow the
connection establishment to succeed, even if the server is a bit older.
New client can establish connection with old server if it is not using compression.
And if it wants to use compression, then _pq_.compression will not help much.
Old server will ignore it (unlike compression= option) but then client will wait for acknowledgement from
server for used compression algorithm and didn't receive one. Certainly it is possible to handle this situation (if we didn't
receive expected message) but why it is better than adding compression option to startup package?
But if you think that adding new options to startup package should be avoided as much as possible,
then I will replace it with _pq_.compression.Greetings,Thank you very much for review and explanations.
Andres Freund
pgsql-hackers by date: