Thread: adding 'zstd' as a compression algorithm

adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

15 February 2022, 18:20:32

Hi,

Over on the rather-long "refactoring basebackup.c" thread, there is a
proposal, which I endorse, to add base-backup compression via zstd. To
do that, we'd need to patch configure.ac to create a new --with-zstd
flag and appropriate supporting infrastructure. My colleague Jeevan
Ladhe included that in a patch posted over there; I've extracted just
the part adding configure support for libzstd and attach it here. I
thought it would be good to have a new thread specifically devoted to
the topic of whether zstd is a thing that PostgreSQL ought to support
in general.

In general, deciding on new compression algorithms can feel a bit like
debating the merits of vi vs. emacs, or one political party vs.
another. Everyone has their own favorites, for reasons that can often
seem idiosyncratic. One advantage of zstd is that it is already being
used by other prominent open-source projects, including the Linux
kernel.[1] This means that it is unlikely to just dry up and vanish,
and it also reduces the risk of legal issues. On a technical level,
zstd offers compression ratios similar to or better than gzip, but
with much faster compression speed. Furthermore, the zstd library has
built-in multi-threaded compression which we may be able to leverage
for even better performance. In fact, parallel zstd might be able to
compress faster than lz4, which is already extremely fast.

What I imagine if this patch is accepted is that we (or our users)
will end up using lz4 for places where compression needs to be very
lightweight, and zstd for places where it's acceptable or even
desirable to spend more CPU cycles in exchange for better compression.
I think that gzip and pglz are really only of historical interest -
and I don't say that to mean that we shouldn't continue to support
them or that they won't get use. Lots of people are perfectly happy
with TOAST compression using pglz, and I'm perfectly happy if they
continue to do that forever, even though I'm glad LZ4 is now an
option. Likewise, I still download the .tar.gz version of anything
that gives me that option, basically because I'm familiar with the
format and it's easy for me to just carry on using it -- and in a
similar way I expect a lot of people will be happy to continue to
compress backups with gzip for many years to come. But I think there
is value in supporting newer and better technology, too. I realize
that we don't want to support every new and shiny thing that shows up,
but I don't think that's what I am proposing here.

Anyway, those are my thoughts. What are yours?

Thanks,

[1] https://en.wikipedia.org/wiki/Zstd#Usage

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

0001-Add-support-for-building-with-ZSTD.patch

Re: adding 'zstd' as a compression algorithm

From

Justin Pryzby

Date:

15 February 2022, 18:55:55

I think the WAL patch (4035cd5d4) should support zstd if library support is
added.  A supplementary patch for that already exists.
https://www.postgresql.org/message-id/YNqWd2GSMrnqWIfx@paquier.xyz

There's also some in-progress patches:

 - Konstantin and Daniil have a patch to add libpq compression, which ultimately
   wants to use zstd.
   https://commitfest.postgresql.org/37/3499/

 - I had a patch to add zstd to pg_dump, but I ended up storing backups to
   ZFS+zstd rather than waiting to progress the patch.
   https://commitfest.postgresql.org/32/2888/

Regarding the patch:

I suppose it should support windows, and whatever patches use zstd should
update the install instructions.  See 9ca40dcd4, 02a93e7ef, 4035cd5d4.

Re: adding 'zstd' as a compression algorithm

From

Peter Geoghegan

Date:

15 February 2022, 19:40:20

On Tue, Feb 15, 2022 at 10:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> In general, deciding on new compression algorithms can feel a bit like
> debating the merits of vi vs. emacs, or one political party vs.
> another.

Really? I don't get that impression myself. (That's not important, though.)

> What I imagine if this patch is accepted is that we (or our users)
> will end up using lz4 for places where compression needs to be very
> lightweight, and zstd for places where it's acceptable or even
> desirable to spend more CPU cycles in exchange for better compression.

Sounds reasonable to me.

> Likewise, I still download the .tar.gz version of anything
> that gives me that option, basically because I'm familiar with the
> format and it's easy for me to just carry on using it -- and in a
> similar way I expect a lot of people will be happy to continue to
> compress backups with gzip for many years to come.

Maybe, but it seems more likely that they'll usually do whatever the
easiest reasonable-seeming thing is. Perhaps we need to distinguish
between "container format" (e.g., a backup that is produced by a tool
like pgBackrest, including backup manifest) and compression algorithm.

Isn't it an incontrovertible fact that LZ4 is superior to pglz in
every way? LZ4 is pretty much its successor. And so it seems totally
fine to assume that users will always want to use the clearly better
option, and that that option will be generally available going
forward. TOAST compression is applied selectively already, based on
various obscure implementation details, some of which are quite
arbitrary.

I know less about zstd, but I imagine that a similar dynamic exists
there. Overall, everything that you've said makes sense to me.

-- 
Peter Geoghegan

Re: adding 'zstd' as a compression algorithm

From

Andres Freund

Date:

15 February 2022, 19:54:10

Hi,

On 2022-02-15 11:40:20 -0800, Peter Geoghegan wrote:
> On Tue, Feb 15, 2022 at 10:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Likewise, I still download the .tar.gz version of anything
> > that gives me that option, basically because I'm familiar with the
> > format and it's easy for me to just carry on using it -- and in a
> > similar way I expect a lot of people will be happy to continue to
> > compress backups with gzip for many years to come.

There's a difference between downloading a source tarball of 1.5x the size,
and 3x the decompression cost (made up numbers), because the cost of either is
not a relevant factor. I think basebackups are a different story.

> Isn't it an incontrovertible fact that LZ4 is superior to pglz in
> every way? LZ4 is pretty much its successor. And so it seems totally
> fine to assume that users will always want to use the clearly better
> option, and that that option will be generally available going
> forward. TOAST compression is applied selectively already, based on
> various obscure implementation details, some of which are quite
> arbitrary.

Yea, we should really default to lz4 in initdb when available. Expecting users
to know about magic incantation to get often vastly superior performance isn't
a good approach. And it even makes initdb itself faster, because it turns out
we compress enough that the cpu cycles for it are a measurable factor.

> I know less about zstd, but I imagine that a similar dynamic exists
> there. Overall, everything that you've said makes sense to me.

IMO zstd has pretty much won the "gzip" type usecase of compression. Even
prior lz4 uses have even been converted to it because it's often cheap enough.

Greetings,

Andres Freund

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

15 February 2022, 20:00:33

On Tue, Feb 15, 2022 at 2:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Feb 15, 2022 at 10:20 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > In general, deciding on new compression algorithms can feel a bit like
> > debating the merits of vi vs. emacs, or one political party vs.
> > another.
>
> Really? I don't get that impression myself. (That's not important, though.)

Well, that's a good thing. Maybe we're getting better at keeping
things constructive...

> > What I imagine if this patch is accepted is that we (or our users)
> > will end up using lz4 for places where compression needs to be very
> > lightweight, and zstd for places where it's acceptable or even
> > desirable to spend more CPU cycles in exchange for better compression.
>
> Sounds reasonable to me.

Cool.

> > Likewise, I still download the .tar.gz version of anything
> > that gives me that option, basically because I'm familiar with the
> > format and it's easy for me to just carry on using it -- and in a
> > similar way I expect a lot of people will be happy to continue to
> > compress backups with gzip for many years to come.
>
> Maybe, but it seems more likely that they'll usually do whatever the
> easiest reasonable-seeming thing is. Perhaps we need to distinguish
> between "container format" (e.g., a backup that is produced by a tool
> like pgBackrest, including backup manifest) and compression algorithm.

I'm not sure I completely follow this. There are cases where we use
compression algorithms for internal purposes, and we can change the
defaults without users knowing or caring. For example, we could decide
that LZ4-compressing FPIs in WAL records is a sensible default and
just start doing it, and users need not care. But backups are
different, because when you pg_basebackup -Ft, you get .tar or .tar.gz
or .tar.lz4 files which we don't give you tools to extract. You have
to be able to extract them with OS tools. In cases like that, people
are going to be more likely to pick formats with which they are
familiar even if they are not technically best, and I don't think we
really dare change the defaults. That seems OK, though.

> Isn't it an incontrovertible fact that LZ4 is superior to pglz in
> every way? LZ4 is pretty much its successor. And so it seems totally
> fine to assume that users will always want to use the clearly better
> option, and that that option will be generally available going
> forward. TOAST compression is applied selectively already, based on
> various obscure implementation details, some of which are quite
> arbitrary.

I pretty much agree with all of that. Nevertheless, there's a lot of
pglz-compressed data that's already on a disk someplace which people
are not necessarily keen to rewrite, and that will probably remain
true for the foreseable future. If we change the default to something
that's not pglz, and then wait 10 years, we MIGHT be able to get rid
of it without pushback. But I wouldn't be surprised to learn that even
then there are a lot of people who have just pg_upgrade'd the same
database over and over again. And honestly I think that's fine. Yeah,
lz4 is better. But, on some data sets, there's very little real
difference in the compression ratio you get. Until keeping pglz around
starts costing us something tangible, there's no reason to spend any
effort worrying about it.

> I know less about zstd, but I imagine that a similar dynamic exists
> there. Overall, everything that you've said makes sense to me.

Great, thanks for chiming in.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

15 February 2022, 20:09:37

On Tue, Feb 15, 2022 at 2:54 PM Andres Freund <andres@anarazel.de> wrote:
> There's a difference between downloading a source tarball of 1.5x the size,
> and 3x the decompression cost (made up numbers), because the cost of either is
> not a relevant factor. I think basebackups are a different story.

To be clear, I'm not saying that people shouldn't choose to adopt the
new stuff. I'm just saying that many of them won't, for various
reasons, including inertia. There may come a point when we want to
make a push to remove obsolete stuff, but I think what is far more
important right now is making the new stuff available.

> Yea, we should really default to lz4 in initdb when available. Expecting users
> to know about magic incantation to get often vastly superior performance isn't
> a good approach. And it even makes initdb itself faster, because it turns out
> we compress enough that the cpu cycles for it are a measurable factor.

I don't mind if you want to propose a patch for that, but it seems
like a separate topic from this thread.

> IMO zstd has pretty much won the "gzip" type usecase of compression. Even
> prior lz4 uses have even been converted to it because it's often cheap enough.

You know more about this than I do...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

15 February 2022, 20:23:30

On Tue, Feb 15, 2022 at 1:55 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Regarding the patch:
>
> I suppose it should support windows, and whatever patches use zstd should
> update the install instructions.  See 9ca40dcd4, 02a93e7ef, 4035cd5d4.

Thanks, this is helpful. To me, it looks like we need something
similar to 9ca40dcd4, plus something like the last hunk of 02a93e7ef.
The rest of 02a93e7ef and all of 4035cd5d4 seem like they address the
application of lz4 rather than just adding support for it, so I don't
see the need for those elements in this patch. Am I missing something?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: adding 'zstd' as a compression algorithm

From

Justin Pryzby

Date:

15 February 2022, 20:33:32

On Tue, Feb 15, 2022 at 03:23:30PM -0500, Robert Haas wrote:
> On Tue, Feb 15, 2022 at 1:55 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > Regarding the patch:
> >
> > I suppose it should support windows, and whatever patches use zstd should
> > update the install instructions.  See 9ca40dcd4, 02a93e7ef, 4035cd5d4.
> 
> Thanks, this is helpful. To me, it looks like we need something
> similar to 9ca40dcd4, plus something like the last hunk of 02a93e7ef.

If you mean a patch which only adds the configure options, I don't think 02a9
is applicable at all.

What I mean is that any patches which *use* the zstd support should update
those for completeness.

 doc/src/sgml/config.sgml                      | 14 +++++++++-----
 doc/src/sgml/install-windows.sgml             |  2 +-
 doc/src/sgml/installation.sgml                |  5 +++--

> The rest of 02a93e7ef and all of 4035cd5d4 seem like they address the
> application of lz4 rather than just adding support for it, so I don't
> see the need for those elements in this patch. Am I missing something?

Re: adding 'zstd' as a compression algorithm

From

Andres Freund

Date:

15 February 2022, 21:42:55

Hi,

On 2022-02-15 15:09:37 -0500, Robert Haas wrote:
> On Tue, Feb 15, 2022 at 2:54 PM Andres Freund <andres@anarazel.de> wrote:
> > There's a difference between downloading a source tarball of 1.5x the size,
> > and 3x the decompression cost (made up numbers), because the cost of either is
> > not a relevant factor. I think basebackups are a different story.
> 
> To be clear, I'm not saying that people shouldn't choose to adopt the
> new stuff. I'm just saying that many of them won't, for various
> reasons, including inertia. There may come a point when we want to
> make a push to remove obsolete stuff, but I think what is far more
> important right now is making the new stuff available.

I think well before removing stuff we should default to decent compression
algorithms. E.g. -Z/--compress should probably not continue to use gzip if
better things are available.

Greetings,

Andres Freund

Re: adding 'zstd' as a compression algorithm

From

Peter Geoghegan

Date:

15 February 2022, 22:10:47

On Tue, Feb 15, 2022 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not sure I completely follow this. There are cases where we use
> compression algorithms for internal purposes, and we can change the
> defaults without users knowing or caring. For example, we could decide
> that LZ4-compressing FPIs in WAL records is a sensible default and
> just start doing it, and users need not care. But backups are
> different, because when you pg_basebackup -Ft, you get .tar or .tar.gz
> or .tar.lz4 files which we don't give you tools to extract.

What I meant is that you're buying into an ecosystem by choosing to
use a tool like pgBackrest. That might not be the right thing to do
universally, but it comes pretty close. You as a user are largely
deferring to the maintainers and their choice of defaults. Not
entirely, of course, but to a significant degree, especially in
matters like this. There aren't that many dimensions to the problem
once the question of compatibility is settled (it's settled here
because you've already bought into a tool that requires the library as
standard, say).

A lot of things are *not* like that, but ISTM that backup compression
really is -- we have the luxury of a constrained problem space, for
once. There aren't that many metrics to consider, because it must be
lossless compression in any case, and because the requirements are
relatively homogenous. The truly important thing (once compatibility
is accounted for) is to use something basically reasonable, with no
noticeable weaknesses relative to any of the alternatives.

> I pretty much agree with all of that. Nevertheless, there's a lot of
> pglz-compressed data that's already on a disk someplace which people
> are not necessarily keen to rewrite, and that will probably remain
> true for the foreseable future. If we change the default to something
> that's not pglz, and then wait 10 years, we MIGHT be able to get rid
> of it without pushback. But I wouldn't be surprised to learn that even
> then there are a lot of people who have just pg_upgrade'd the same
> database over and over again. And honestly I think that's fine.

I think that it's fine too. It's unlikely that anybody is going to go
to any trouble to get better compression, certainly, but we should
give them every opportunity to use better alternatives. Ideally
without anybody having to even think about it.

-- 
Peter Geoghegan

Re: adding 'zstd' as a compression algorithm

From

David Steele

Date:

16 February 2022, 00:09:07

On 2/15/22 16:10, Peter Geoghegan wrote:
> On Tue, Feb 15, 2022 at 12:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm not sure I completely follow this. There are cases where we use
>> compression algorithms for internal purposes, and we can change the
>> defaults without users knowing or caring. For example, we could decide
>> that LZ4-compressing FPIs in WAL records is a sensible default and
>> just start doing it, and users need not care. 

I'm not sure that this is entirely true. lz4 is available on most 
non-EOL distros but you don't need to look to far to find a distro that 
does not have it. So, choosing lz4 by default could impact binary 
portability. Of course, there are lots of other factors that impact 
binary portability, e.g. collations, but this would add one more.

This is even more true for zstd since it is not as ubiquitous as lz4. In 
fact, it is not even included with base RHEL8. You need to install EPEL 
to get zstd.

>> But backups are
>> different, because when you pg_basebackup -Ft, you get .tar or .tar.gz
>> or .tar.lz4 files which we don't give you tools to extract.
> 
> What I meant is that you're buying into an ecosystem by choosing to
> use a tool like pgBackrest. That might not be the right thing to do
> universally, but it comes pretty close. You as a user are largely
> deferring to the maintainers and their choice of defaults. Not
> entirely, of course, but to a significant degree, especially in
> matters like this. There aren't that many dimensions to the problem
> once the question of compatibility is settled (it's settled here
> because you've already bought into a tool that requires the library as
> standard, say).

As much as we would like to, we have not changed the default compression 
for pgBackRest. lz4 and zstd are both optional at build time since they 
are not always available (though lz4 is getting close). So we have left 
gzip as the default, though a lot of that has to do with maintaining 
compatibility with prior versions of pgBackRest.

Having said all that, I'm absolutely in favor of adding zstd to Postgres 
as an optional compression format.

Regards,
David

Re: adding 'zstd' as a compression algorithm

From

Michael Paquier

Date:

16 February 2022, 00:58:44

On Tue, Feb 15, 2022 at 01:42:55PM -0800, Andres Freund wrote:
> I think well before removing stuff we should default to decent compression
> algorithms. E.g. -Z/--compress should probably not continue to use gzip if
> better things are available.

Which one of lz4 or zstd would it be though?  LZ4 is light on CPU, and
compresses less than zstd which is more CPU intensive with more
compression, so the choice does not look that straight-forward to me.
Both of them are better than zlib, still they are less available on
the platforms Postgres needs to provide support for.  It does not seem
really intuitive to me to have different defaults depending on the
build options provided, either :/

Saying that, I'd like to think that zstd would still be a better
default choice than LZ4.
--
Michael

Attachment

signature.asc

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

16 February 2022, 01:35:20

On Tue, Feb 15, 2022 at 7:09 PM David Steele <david@pgmasters.net> wrote:
> I'm not sure that this is entirely true. lz4 is available on most
> non-EOL distros but you don't need to look to far to find a distro that
> does not have it. So, choosing lz4 by default could impact binary
> portability. Of course, there are lots of other factors that impact
> binary portability, e.g. collations, but this would add one more.
>
> This is even more true for zstd since it is not as ubiquitous as lz4. In
> fact, it is not even included with base RHEL8. You need to install EPEL
> to get zstd.

Yeah, I agree. One thing I was thinking about but didn't include in
the previous email is that if we did choose to make something like LZ4
the default, it would presumably only be the default on builds that
include LZ4 support. Other builds would need to use something else,
unless we just chose to make LZ4 a hard requirement, which would be
bolder than we usually are. And that has the consequence that you
mention. It's something we should consider as we think about changing
defaults.

> Having said all that, I'm absolutely in favor of adding zstd to Postgres
> as an optional compression format.

Cool.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: adding 'zstd' as a compression algorithm

From

Tom Lane

Date:

16 February 2022, 01:56:15

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Feb 15, 2022 at 7:09 PM David Steele <david@pgmasters.net> wrote:
>> This is even more true for zstd since it is not as ubiquitous as lz4. In
>> fact, it is not even included with base RHEL8. You need to install EPEL
>> to get zstd.

> Yeah, I agree. One thing I was thinking about but didn't include in
> the previous email is that if we did choose to make something like LZ4
> the default, it would presumably only be the default on builds that
> include LZ4 support. Other builds would need to use something else,
> unless we just chose to make LZ4 a hard requirement, which would be
> bolder than we usually are. And that has the consequence that you
> mention. It's something we should consider as we think about changing
> defaults.

Yeah.  I'm +1 on adding zstd as an option, but I think we need to
move *very* slowly on changing any defaults for user-accessible
data (like backup files).  Maybe we have a bit more flexibility
for TOAST, not sure.

            regards, tom lane

Re: adding 'zstd' as a compression algorithm

From

Andres Freund

Date:

16 February 2022, 02:24:13

Hi,

On 2022-02-16 09:58:44 +0900, Michael Paquier wrote:
> On Tue, Feb 15, 2022 at 01:42:55PM -0800, Andres Freund wrote:
> > I think well before removing stuff we should default to decent compression
> > algorithms. E.g. -Z/--compress should probably not continue to use gzip if
> > better things are available.
> 
> Which one of lz4 or zstd would it be though?  LZ4 is light on CPU, and
> compresses less than zstd which is more CPU intensive with more
> compression, so the choice does not look that straight-forward to me.

For backups it's pretty obviously zstd imo. At the lower levels it achieves
considerably higher compression ratios while still being vastly faster than
gzip. Without even using the threaded compression support the library has.

For something like wal_compression it'd be a harder question.

Greetings,

Andres Freund

Re: adding 'zstd' as a compression algorithm

From

Michael Paquier

Date:

16 February 2022, 02:33:19

On Tue, Feb 15, 2022 at 02:33:32PM -0600, Justin Pryzby wrote:
> What I mean is that any patches which *use* the zstd support should update
> those for completeness.
>
>  doc/src/sgml/config.sgml                      | 14 +++++++++-----
>  doc/src/sgml/install-windows.sgml             |  2 +-
>  doc/src/sgml/installation.sgml                |  5 +++--

Yes, the patch misses the fact that each ./configure switch is
documented.  For consistency, I think that you should also add that in
the MSVC scripts in the first version.  It is really straight-forward
to do so, and it should be just a matter of grepping for the code
paths of lz4, then adjust things for zstd.
--
Michael

Attachment

signature.asc

Re: adding 'zstd' as a compression algorithm

From

Andres Freund

Date:

16 February 2022, 02:42:35

Hi,m

On 2022-02-15 20:56:15 -0500, Tom Lane wrote:
> Maybe we have a bit more flexibility for TOAST, not sure.

It'd be nice to at least add it as an option for initdb. Afaics there's no way
to change the default at that point. initdb itself is measurably
faster. Although sadly it's a bigger difference for optimized builds.

I think it matter a bit even after initdb, pg_rewrite is most of the
compressed data and some of the views are regularly used...

- Andres

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

16 February 2022, 15:40:01

On Tue, Feb 15, 2022 at 9:33 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Feb 15, 2022 at 02:33:32PM -0600, Justin Pryzby wrote:
> > What I mean is that any patches which *use* the zstd support should update
> > those for completeness.
> >
> >  doc/src/sgml/config.sgml                      | 14 +++++++++-----
> >  doc/src/sgml/install-windows.sgml             |  2 +-
> >  doc/src/sgml/installation.sgml                |  5 +++--
>
> Yes, the patch misses the fact that each ./configure switch is
> documented.  For consistency, I think that you should also add that in
> the MSVC scripts in the first version.  It is really straight-forward
> to do so, and it should be just a matter of grepping for the code
> paths of lz4, then adjust things for zstd.

Makes sense. Here's an attempt by me to do that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

v2-0001-Add-support-for-building-with-ZSTD.patch

Re: adding 'zstd' as a compression algorithm

From

Michael Paquier

Date:

17 February 2022, 01:48:13

On Tue, Feb 15, 2022 at 06:24:13PM -0800, Andres Freund wrote:
> For backups it's pretty obviously zstd imo. At the lower levels it achieves
> considerably higher compression ratios while still being vastly faster than
> gzip. Without even using the threaded compression support the library has.

Noted.

> For something like wal_compression it'd be a harder question.

FWIW, I have done some measurements for wal_compression with zstd, as
of:
https://www.postgresql.org/message-id/YMmlvyVyAFlxZ+/H@paquier.xyz

The result is not surprising, a bit more CPU for zstd with more
compression compared to LZ4, both outclassing easily zlib.  I am not
sure which one would be more adapted as default as FPI patterns depend
on the workload, for one, and this is just one corner case.
--
Michael

Attachment

signature.asc

Re: adding 'zstd' as a compression algorithm

From

Michael Paquier

Date:

17 February 2022, 04:34:08

On Wed, Feb 16, 2022 at 10:40:01AM -0500, Robert Haas wrote:
> On Tue, Feb 15, 2022 at 9:33 PM Michael Paquier <michael@paquier.xyz> wrote:
>> Yes, the patch misses the fact that each ./configure switch is
>> documented.  For consistency, I think that you should also add that in
>> the MSVC scripts in the first version.  It is really straight-forward
>> to do so, and it should be just a matter of grepping for the code
>> paths of lz4, then adjust things for zstd.
>
> Makes sense. Here's an attempt by me to do that.

Thanks.  This looks pretty much right, except for two things that I
have taken the freedom to fix as of the v3 attached.

%define needs to include HAVE_LIBZSTD, HAVE_ZSTD_H and USE_ZSTD, so
this version fails the sanity check between pg_config.h.in and the
MSVC scripts checking that all flags exist.

@@ -351,6 +351,7 @@ XGETTEXT = @XGETTEXT@
 GZIP   = gzip
 BZIP2  = bzip2
 LZ4    = @LZ4@
+ZSTD   = @ZSTD@
A similar refresh is needed in vcregress.pl.

+       $proj->AddLibrary($self->{options}->{zstd} . '\lib\libzstd.lib');
The upstream code is also using this library name, so that should be
fine.
--
Michael

Attachment

Re: adding 'zstd' as a compression algorithm

From

Andrey Borodin

Date:

17 February 2022, 06:12:50


> 15 февр. 2022 г., в 23:20, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> Anyway, those are my thoughts. What are yours?

+1 for adding Zstd everywhere. Quite similar patches are already a part of "libpq compression" and "zstd for pg_dump"
commitfestentries, so pushing this will simplify those entries IMV. 

Also I like the idea of defaulting FPI compression to lz4 and adding Zstd as wal_compression option.
I don't think Zstd is obvious choice for default FPI compression: there are HW setups when disks are almost as fast as
RAM.In this case lz4 can improve performance, while Zstd might make things slower. 

Thanks!

Best regards, Andrey Borodin.

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

17 February 2022, 14:40:13

On Wed, Feb 16, 2022 at 11:34 PM Michael Paquier <michael@paquier.xyz> wrote:
> Thanks.  This looks pretty much right, except for two things that I
> have taken the freedom to fix as of the v3 attached.

Ah, OK, cool. It seems like we have consensus on this being a good
direction, so I plan to commit this later today unless objections or
additional review comments turn up.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: adding 'zstd' as a compression algorithm

From

Michael Paquier

Date:

18 February 2022, 01:18:51

On Thu, Feb 17, 2022 at 09:40:13AM -0500, Robert Haas wrote:
> Ah, OK, cool. It seems like we have consensus on this being a good
> direction, so I plan to commit this later today unless objections or
> additional review comments turn up.

So, will there be a part of the system where we'll make use of that in
15?  pg_basebackup for server-side and client-side compression, at
least, as far as I can guess?
--
Michael

Attachment

signature.asc

Re: adding 'zstd' as a compression algorithm

From

Andres Freund

Date:

18 February 2022, 01:36:12

Hi,

On 2022-02-17 13:34:08 +0900, Michael Paquier wrote:
> %define needs to include HAVE_LIBZSTD, HAVE_ZSTD_H and USE_ZSTD, so
> this version fails the sanity check between pg_config.h.in and the
> MSVC scripts checking that all flags exist.

Do we really need all three defines? How about using AC_CHECK_HEADER() instead
of AC_CHECK_HEADERS()? That wouldn't define HAVE_ZSTD_H. Cases where we error
out if a header isn't found make it a bit pointless to then still define
HAVE_*_H.  Plenty other cases in configure.ac just use AC_CHECK_HEADER.

Greetings,

Andres Freund

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

18 February 2022, 13:08:23

On Thu, Feb 17, 2022 at 8:18 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Feb 17, 2022 at 09:40:13AM -0500, Robert Haas wrote:
> > Ah, OK, cool. It seems like we have consensus on this being a good
> > direction, so I plan to commit this later today unless objections or
> > additional review comments turn up.
>
> So, will there be a part of the system where we'll make use of that in
> 15?  pg_basebackup for server-side and client-side compression, at
> least, as far as I can guess?

That's my plan, yes. I think the patches only need a small amount of
work at this point, but I'd like to get this part committed first.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

18 February 2022, 13:48:42

On Thu, Feb 17, 2022 at 8:36 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-02-17 13:34:08 +0900, Michael Paquier wrote:
> > %define needs to include HAVE_LIBZSTD, HAVE_ZSTD_H and USE_ZSTD, so
> > this version fails the sanity check between pg_config.h.in and the
> > MSVC scripts checking that all flags exist.
>
> Do we really need all three defines? How about using AC_CHECK_HEADER() instead
> of AC_CHECK_HEADERS()? That wouldn't define HAVE_ZSTD_H. Cases where we error
> out if a header isn't found make it a bit pointless to then still define
> HAVE_*_H.  Plenty other cases in configure.ac just use AC_CHECK_HEADER.

I have to admit to being somewhat confused by the apparent lack of
consistency in the way we do configure checks. The ZSTD check we added
here is just based on the LZ4 check just above it, which was the
result of my commit of Dilip's patch to add LZ4 TOAST compression. So
if we want to do something different we should change them both. But
that just begs the question of why the LZ4 support looks the way it
does, and to be honest I don't recall. The zlib and XSLT formulas just
above are much simpler, but for some reason what we're doing here
seems to be based on the more-complex formula we use for XML support
instead of either of those.

But having said that, the proposed patch adds no new call to
AC_CHECK_HEADER(), and does add a new call to AC_CHECK_HEADERS(), so I
don't understand the specifics of your complaint here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

18 February 2022, 14:02:43

On Fri, Feb 18, 2022 at 8:48 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 17, 2022 at 8:36 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-02-17 13:34:08 +0900, Michael Paquier wrote:
> > > %define needs to include HAVE_LIBZSTD, HAVE_ZSTD_H and USE_ZSTD, so
> > > this version fails the sanity check between pg_config.h.in and the
> > > MSVC scripts checking that all flags exist.
> >
> > Do we really need all three defines? How about using AC_CHECK_HEADER() instead
> > of AC_CHECK_HEADERS()? That wouldn't define HAVE_ZSTD_H. Cases where we error
> > out if a header isn't found make it a bit pointless to then still define
> > HAVE_*_H.  Plenty other cases in configure.ac just use AC_CHECK_HEADER.
>
> I have to admit to being somewhat confused by the apparent lack of
> consistency in the way we do configure checks. The ZSTD check we added
> here is just based on the LZ4 check just above it, which was the
> result of my commit of Dilip's patch to add LZ4 TOAST compression. So
> if we want to do something different we should change them both. But
> that just begs the question of why the LZ4 support looks the way it
> does, and to be honest I don't recall. The zlib and XSLT formulas just
> above are much simpler, but for some reason what we're doing here
> seems to be based on the more-complex formula we use for XML support
> instead of either of those.
>
> But having said that, the proposed patch adds no new call to
> AC_CHECK_HEADER(), and does add a new call to AC_CHECK_HEADERS(), so I
> don't understand the specifics of your complaint here.

Oh wait ... you want it the other way. Yeah, that seems harmless to
change. I wonder how many others there are that could be changed
similarly...

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

18 February 2022, 14:24:41

On Fri, Feb 18, 2022 at 9:02 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Oh wait ... you want it the other way. Yeah, that seems harmless to
> change. I wonder how many others there are that could be changed
> similarly...

I went through configure.ac looking for instances of
AC_CHECK_HEADERS() where the corresponding symbol was not used. I
found four:

AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is
required for LZ4])])
AC_CHECK_HEADERS(gssapi/gssapi.h, [], [AC_MSG_ERROR([gssapi.h header
file is required for GSSAPI])])])
AC_CHECK_HEADERS(ldap.h, [], [AC_MSG_ERROR([header file <ldap.h> is
required for LDAP])]
AC_CHECK_HEADER(winldap.h, [], ...stuff...)

I guess we could clean all of those up similarly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

18 February 2022, 14:52:52

On Fri, Feb 18, 2022 at 9:24 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Feb 18, 2022 at 9:02 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Oh wait ... you want it the other way. Yeah, that seems harmless to
> > change. I wonder how many others there are that could be changed
> > similarly...
>
> I went through configure.ac looking for instances of
> AC_CHECK_HEADERS() where the corresponding symbol was not used. I
> found four:
>
> AC_CHECK_HEADERS(lz4.h, [], [AC_MSG_ERROR([lz4.h header file is
> required for LZ4])])
> AC_CHECK_HEADERS(gssapi/gssapi.h, [], [AC_MSG_ERROR([gssapi.h header
> file is required for GSSAPI])])])
> AC_CHECK_HEADERS(ldap.h, [], [AC_MSG_ERROR([header file <ldap.h> is
> required for LDAP])]
> AC_CHECK_HEADER(winldap.h, [], ...stuff...)
>
> I guess we could clean all of those up similarly.

So here's a revised patch for zstd support that uses
AC_CHECK_HEADER(), plus a second patch to change the occurrences above
to use AC_CHECK_HEADER() and remove all traces of the corresponding
preprocessor symbol.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: adding 'zstd' as a compression algorithm

From

Robert Haas

Date:

18 February 2022, 18:53:39

On Fri, Feb 18, 2022 at 9:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
> So here's a revised patch for zstd support that uses
> AC_CHECK_HEADER(), plus a second patch to change the occurrences above
> to use AC_CHECK_HEADER() and remove all traces of the corresponding
> preprocessor symbol.

And I've now committed the first of these.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: adding 'zstd' as a compression algorithm

From

Andres Freund

Date:

19 February 2022, 21:54:24

On 2022-02-18 09:52:52 -0500, Robert Haas wrote:
> plus a second patch to change the occurrences above to use AC_CHECK_HEADER()
> and remove all traces of the corresponding preprocessor symbol.

LGTM.  I'm not entirely sure the gssapi one is a real improvement, because we
kind of test for that branch, via the #else in HAVE_GSSAPI_H. But on balance,
I'd probably go for it.