Thread: wal_compression=zstd
Since 4035cd5d4, wal_compression=lz4 is supported. I think pg15 should also support wal_compression=zstd. There are free bits in the WAL header. The last message on that thread includes a patch doing just that, which I've rebased. https://www.postgresql.org/message-id/YNqWd2GSMrnqWIfx@paquier.xyz It might be nice if to also support wal_compression=zstd-level, but that seems optional and could wait for pg16... In [0], I showed a case where lz4 is just as fast as uncompressed data, and writes ~60% as much. zstd writes half as much as LZ4 and 12% slower. [0] https://www.postgresql.org/message-id/20210614012412.GA31772%40telsasoft.com I am not claiming that zstd is generally better for WAL. Rather, if we're going to support alternate compression methods, it's nice to give a couple options (in addition to pglz). Some use cases would certainly suffer from slower WAL. But providing the option will benefit some others. Note that a superuser can set wal_compression for a given session - this would probably benefit bulk-loading in an otherwise OLTP environment. As writen, this patch uses zstd level=1 (whereas the ZSTD's default compress level is 6). 0001 adds support for zstd 0002 makes more efficient our use of bits in the WAL header 0003 makes it the default compression, to exercise on CI (windows will fail). 0004 adds support for zstd-level 0005-6 are needed to allow recovery tests to pass when wal compression is enabled; 0007 (not included) also adds support for zlib. I'm of the impression nobody cares about this, otherwise it would've been included 5-10 years ago. Only 0001 should be reviewed for pg15 - the others are optional/future work.
Attachment
On Tue, Feb 22, 2022 at 05:19:48PM -0600, Justin Pryzby wrote: > I am not claiming that zstd is generally better for WAL. Rather, if we're > going to support alternate compression methods, it's nice to give a couple > options (in addition to pglz). Some use cases would certainly suffer from > slower WAL. But providing the option will benefit some others. Note that a > superuser can set wal_compression for a given session - this would probably > benefit bulk-loading in an otherwise OLTP environment. Well, I cannot say which one is better either as it depends on your workload, mostly, but we know for sure that both have benefits, so I don't mind adding it now. What you are proposing is built on top of the existing code, making it a very simple change. > As writen, this patch uses zstd level=1 (whereas the ZSTD's default compress > level is 6). Why? ZSTD using this default has its reasons, no? And it would be consistent to do the same for ZSTD as for the other two methods. - The supported methods are <literal>pglz</literal> and - <literal>lz4</literal> (if <productname>PostgreSQL</productname> was - compiled with <option>--with-lz4</option>). The default value is - <literal>off</literal>. Only superusers can change this setting. + The supported methods are <literal>pglz</literal>, and + (if configured when <productname>PostgreSQL</productname>was built) + <literal>lz4</literal> and zstd. + The default value is <literal>off</literal>. + Only superusers can change this setting. This is making the docs less verbose. I think that you should keep the mention to respectively --with-lz4 and --with-zstd after each value. <para> Build with <productname>ZSTD</productname> compression support. + This enables use of <productname>ZSTD</productname> for + compression of WAL data. This addition is not necessary, see d7a9786. Not related to your patch, but we have more reasons to add an check in the block of BKPIMAGE_COMPRESSED() in RestoreBlockImage() of xlogreader.c to make sure that only one bit is set for the compression type. We could use a pg_popcount() == 1 for that, returning report_invalid_record() when we see some corrupted data. As a whole, 0001 looks pretty good to me after a detailed read (not tested, though). > Only 0001 should be reviewed for pg15 - the others are optional/future work. That's wiser for v15. FWIW, I have the impression that we don't need most of what's proposed in 0002~. /* compression methods supported */ -#define BKPIMAGE_COMPRESS_PGLZ 0x04 -#define BKPIMAGE_COMPRESS_LZ4 0x08 -#define BKPIMAGE_COMPRESS_ZSTD 0x10 - +#define BKPIMAGE_COMPRESS_METHOD1 0x04 /* bits to encode compression method */ +#define BKPIMAGE_COMPRESS_METHOD2 0x08 /* 0=none, 1=pglz, 2=lz4, 3=zstd */ As of 0002. We still have some room for this data and this makes the code harder to follow, so I'd live this part of the code as it is proposed in 0001. 0003, defaulting to zstd, and 0004 to extend compression to support a level would require a lot of benchmarking to be justified. I have argued against making the code more complicated for such things in the past, with reloptions. The footprint on the code is much smaller, here, still.. 0007, for ZLIB, does not make sense once one can choose between LZ4 and ZSTD. -- Michael
Attachment
On Fri, Mar 04, 2022 at 04:19:32PM +0900, Michael Paquier wrote: > On Tue, Feb 22, 2022 at 05:19:48PM -0600, Justin Pryzby wrote: > > > As writen, this patch uses zstd level=1 (whereas the ZSTD's default compress > > level is 6). > > Why? ZSTD using this default has its reasons, no? And it would be > consistent to do the same for ZSTD as for the other two methods. In my 1-off test, it gets 610/633 = 96% of the benefit at 209/273 = 77% of the cost. postgres=# SET wal_compression= 'zstd-6'; postgres=# \set QUIET \\ \timing on \\ SET max_parallel_maintenance_workers=0; SELECT pg_stat_reset_shared('wal'); begin;CREATE INDEX ON t(a); rollback; SELECT * FROM pg_stat_wal; Duración: 2729,017 ms (00:02,729) wal_bytes | 6102403 postgres=# SET wal_compression= 'zstd-1'; postgres=# \set QUIET \\ \timing on \\ SET max_parallel_maintenance_workers=0; SELECT pg_stat_reset_shared('wal'); begin;CREATE INDEX ON t(a); rollback; SELECT * FROM pg_stat_wal; Duración: 2090,459 ms (00:02,090) wal_bytes | 6330269 It may be relevant that we're only compressing 8k [0]. It would probably pay off to preserve a compression "dictionary" to be re-used across FPIs. > - The supported methods are <literal>pglz</literal> and > - <literal>lz4</literal> (if <productname>PostgreSQL</productname> was > - compiled with <option>--with-lz4</option>). The default value is > - <literal>off</literal>. Only superusers can change this setting. > + The supported methods are <literal>pglz</literal>, and > + (if configured when <productname>PostgreSQL</productname>was built) > + <literal>lz4</literal> and zstd. > + The default value is <literal>off</literal>. > + Only superusers can change this setting. > > This is making the docs less verbose. I think that you should keep > the mention to respectively --with-lz4 and --with-zstd after each > value. I changed this relative to your latest zstd patch in July since it reads better to me. YMMV. On Tue, Feb 22, 2022 at 05:19:48PM -0600, Justin Pryzby wrote: >> 0001 adds support for zstd >> 0002 makes more efficient our use of bits in the WAL header >> 0003 makes it the default compression, to exercise on CI (windows will fail). >> 0004 adds support for zstd-level >> 0005-6 are needed to allow recovery tests to pass when wal compression is enabled; >> 0007 (not included) also adds support for zlib. I'm of the impression nobody >> cares about this, otherwise it would've been included 5-10 years ago. > 0003, defaulting to zstd, would require a lot of benchmarking to be > justified. Maybe you didn't see that I wrote that it was for CI ? (In any case, it's impossible to do that without first taking care of 005-6). > and 0004 to extend compression to support a level > I have argued against making the code more complicated for such things in the > past, with reloptions. I suppose that you're referring to reloptions for toast compression, which was about controlling LZ4 compression level. https://www.postgresql.org/message-id/flat/CAFiTN-vMHU_yakMgNo90SUih_6LtvmqZ5hpQb2iTgQtVyOjyFA@mail.gmail.com I think you're right that it's not interesting to control the compression level of LZ4 - but there's no reason to say the same thing of zstd. We're already debating which is the "right" level to use (1 or 6). I don't think there is a "right" level - some people will want to trade more CPU for disk writes and some people won't.
On Fri, Mar 4, 2022 at 6:44 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > Why? ZSTD using this default has its reasons, no? And it would be > > consistent to do the same for ZSTD as for the other two methods. > > In my 1-off test, it gets 610/633 = 96% of the benefit at 209/273 = 77% of the > cost. I agree with Michael. Your 1-off test is exactly that, and the results will have depended on the data you used for the test. I'm not saying we could never decide to default to a compression level other than the library's default, but I do not think we should do it casually or as the result of any small number of tests. There should be a strong presumption that the authors of the library have a good idea what is sensible in general unless we can explain compellingly why our use case is different from typical ones. There's an ease-of-use concern here too. It's not going to make things any easier for users to grok if zstd is available in different parts of the system but has different defaults in each place. It wouldn't be the end of the world if that happened, but neither would it be ideal. -- Robert Haas EDB: http://www.enterprisedb.com
On 2/22/22 18:19, Justin Pryzby wrote: > As writen, this patch uses zstd level=1 (whereas the ZSTD's default compress > level is 6). I think this choice needs to be supported by some benchmarks. > > 0001 adds support for zstd > 0002 makes more efficient our use of bits in the WAL header > 0003 makes it the default compression, to exercise on CI (windows will fail). > 0004 adds support for zstd-level > 0005-6 are needed to allow recovery tests to pass when wal compression is enabled; > 0007 (not included) also adds support for zlib. I'm of the impression nobody > cares about this, otherwise it would've been included 5-10 years ago. > > Only 0001 should be reviewed for pg15 - the others are optional/future work. I don't see why patch 5 shouldn't be applied forthwith. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Mar 04, 2022 at 08:08:03AM -0500, Robert Haas wrote: > On Fri, Mar 4, 2022 at 6:44 AM Justin Pryzby <pryzby@telsasoft.com> wrote: >> In my 1-off test, it gets 610/633 = 96% of the benefit at 209/273 = 77% of the >> cost. Hmm, it may be good to start afresh and compile numbers in a single chart. I did that here with some numbers on the user and system CPU: https://www.postgresql.org/message-id/YMmlvyVyAFlxZ+/H@paquier.xyz For this test, regarding ZSTD, the lowest level did not have much difference with the default level, and at the highest level the user CPU spiked for little gain in compression. All of them compressed more than LZ4, with more CPU used in each case, but the default or a level value lower than the default gives me the impression that it won't matter much in terms of compression gains and CPU usage. > I agree with Michael. Your 1-off test is exactly that, and the results > will have depended on the data you used for the test. I'm not saying > we could never decide to default to a compression level other than the > library's default, but I do not think we should do it casually or as > the result of any small number of tests. There should be a strong > presumption that the authors of the library have a good idea what is > sensible in general unless we can explain compellingly why our use > case is different from typical ones. > > There's an ease-of-use concern here too. It's not going to make things > any easier for users to grok if zstd is available in different parts > of the system but has different defaults in each place. It wouldn't be > the end of the world if that happened, but neither would it be ideal. I'd like to believe that anybody who writes his/her own compression algorithm have a good idea of the default behavior they want to show, so we could remain simple, and believe in them. Now, I would not object to see some fresh numbers, and assuming that all FPIs have the same page size, we could go down to designing a couple of test cases that produce a fixed number of FPIs and measure the compressability in a single session. Repeatability and randomness of data counts, we could have for example one case with a set of 5~7 int attributes, a second with text values that include random data, up to 10~12 bytes each to count on the tuple header to be able to compress some data, and a third with more repeatable data, like one attribute with an int column populate with generate_series(). Just to give an idea. -- Michael
Attachment
On Fri, Mar 04, 2022 at 08:10:35AM -0500, Andrew Dunstan wrote: > I don't see why patch 5 shouldn't be applied forthwith. Only applying 0005 would result in a failure in the TAP test for a problem whose fix is attempted in 0006. This is an issue unrelated to this thread. FWIW, I am a bit disturbed by EnsureTopTransactionIdLogged() and its design in 0006, where we'd finish by using a XLogFlush() call within two SQL functions, but I have not really looked at the problem to see if it is a viable solution or not. -- Michael
Attachment
On Sat, Mar 05, 2022 at 07:26:39PM +0900, Michael Paquier wrote: > Repeatability and randomness of data counts, we could have for example > one case with a set of 5~7 int attributes, a second with text values > that include random data, up to 10~12 bytes each to count on the tuple > header to be able to compress some data, and a third with more > repeatable data, like one attribute with an int column populate > with generate_series(). Just to give an idea. And that's what I did as of the attached set of test: - Cluster on tmpfs. - max_wal_size, min_wal_size at 2GB and shared_buffers at 1GB, aka large enough to include the full data set in memory. - Rather than using Justin's full patch set, I have just patched the code in xloginsert.c to switch the level. - One case with table that uses one int attribute, with rather repetitive data worth 484MB. - One case with table using (int, text), where the text data is made of 10~11 bytes of random data, worth ~340MB. - Use pg_prewarm to load the data into shared buffers. With the cluster mounted on a tmpfs that should not matter though. - Both tables have a fillfactor at 50 to give room to the updates. I have measured the CPU usage with a toy extension, also attached, called pg_rusage() that is a simple wrapper to upstream's pg_rusage.c to initialize a rusage snapshot and print its data with two SQL functions that are called just before and after the FPIs are generated (aka the UPDATE query that rewrites the whole table in the script attached). The quickly-hacked test script and the results are in test.tar.gz, for reference. The toy extension is pg_rusage.tar.gz. Here are the results I compiled, as of results_format.sql in the tarball attached: descr | rel_size | fpi_size | time_s -------------------------------+----------+----------+-------- int column no compression | 429 MB | 727 MB | 13.15 int column ztsd default level | 429 MB | 523 MB | 14.23 int column zstd level 1 | 429 MB | 524 MB | 13.94 int column zstd level 10 | 429 MB | 523 MB | 23.46 int column zstd level 19 | 429 MB | 523 MB | 103.71 int column lz4 default level | 429 MB | 575 MB | 13.37 int/text no compression | 344 MB | 558 MB | 10.08 int/text lz4 default level | 344 MB | 463 MB | 10.29 int/text zstd default level | 344 MB | 415 MB | 11.48 int/text zstd level 1 | 344 MB | 418 MB | 11.25 int/text zstd level 10 | 344 MB | 415 MB | 20.59 int/text zstd level 19 | 344 MB | 413 MB | 62.64 (12 rows) I did not expect zstd to be this slow at a level of 10~ actually. The runtime (elapsed CPU time) got severely impacted at level 19, that I ran just for fun to see how that it would become compared to a level of 10. There is a slight difference between the default level and a level of 1, and the compression size does not change much, nor does the CPU usage really change. Attached is an updated patch, while on it, that I have tweaked before running my own tests. At the end, I'd still like to think that we'd better stick with the default level for this parameter, and that's the suggestion of upstream. So I would like to move on with that for this patch. -- Michael
Attachment
On Fri, Mar 04, 2022 at 05:44:06AM -0600, Justin Pryzby wrote: > On Fri, Mar 04, 2022 at 04:19:32PM +0900, Michael Paquier wrote: > > On Tue, Feb 22, 2022 at 05:19:48PM -0600, Justin Pryzby wrote: > > > > > As writen, this patch uses zstd level=1 (whereas the ZSTD's default compress > > > level is 6). > > > > Why? ZSTD using this default has its reasons, no? And it would be > > consistent to do the same for ZSTD as for the other two methods. > > In my 1-off test, it gets 610/633 = 96% of the benefit at 209/273 = 77% of the > cost. Actually, my test used zstd-6, rather than the correct default of 3. The comparison should have been: postgres=# SET wal_compression='zstd-1'; postgres=# \set QUIET \\ \timing on \\ SET max_parallel_maintenance_workers=0; SELECT pg_stat_reset_shared('wal'); begin;CREATE INDEX ON t(a); rollback; SELECT * FROM pg_stat_wal; Time: 2074.046 ms (00:02.074) 2763 | 2758 | 6343591 | 0 | 5 | 5 | 0 | 0 | 2022-03-0505:04:08.599867-06 vs postgres=# SET wal_compression='zstd-3'; postgres=# \set QUIET \\ \timing on \\ SET max_parallel_maintenance_workers=0; SELECT pg_stat_reset_shared('wal'); begin;CREATE INDEX ON t(a); rollback; SELECT * FROM pg_stat_wal; Time: 2471.552 ms (00:02.472) wal_records | wal_fpi | wal_bytes | wal_buffers_full | wal_write | wal_sync | wal_write_time | wal_sync_time | stats_reset -------------+---------+-----------+------------------+-----------+----------+----------------+---------------+------------------------------- 2762 | 2746 | 6396890 | 274 | 274 | 0 | 0 | 0 | 2022-03-0505:04:31.283432-06 => zstd-1 actually wrote less than zstd-3 (which is odd) but by an insignificant amount. It's no surprise that zstd-1 is faster than zstd-3, but (of course) by a smaller amount than zstd-6. Anyway there's no compelling reason to not use the default. If we were to use a non-default default, we'd have to choose between 1 and 2 (or some negative compression level). My thinking was that zstd-1 would give the lowest-hanging fruits for zstd, while minimizing performance tradeoff, since WAL affects interactivity. But choosing between 1 and 2 seems like bikeshedding.
On Wed, Mar 09, 2022 at 07:14:11AM -0600, Justin Pryzby wrote: > Anyway there's no compelling reason to not use the default. If we were to use > a non-default default, we'd have to choose between 1 and 2 (or some negative > compression level). My thinking was that zstd-1 would give the lowest-hanging > fruits for zstd, while minimizing performance tradeoff, since WAL affects > interactivity. But choosing between 1 and 2 seems like bikeshedding. Yeah, I have looked again at the patch today, and I saw no reason to not apply it to give more options to the user as zstd or lz4 are both good in their own ways. So done, with the default level used. -- Michael
Attachment
On Fri, Mar 11, 2022 at 12:23:59PM +0900, Michael Paquier wrote: > On Wed, Mar 09, 2022 at 07:14:11AM -0600, Justin Pryzby wrote: > > Anyway there's no compelling reason to not use the default. If we were to use > > a non-default default, we'd have to choose between 1 and 2 (or some negative > > compression level). My thinking was that zstd-1 would give the lowest-hanging > > fruits for zstd, while minimizing performance tradeoff, since WAL affects > > interactivity. But choosing between 1 and 2 seems like bikeshedding. > > Yeah, I have looked again at the patch today, and I saw no reason to > not apply it to give more options to the user as zstd or lz4 are both > good in their own ways. So done, with the default level used. It's great news - thanks. -- Justin
While rebasing, I realized this should have bumped XLOG_PAGE_MAGIC. Also, there's a dangling "and". + The supported methods are <literal>pglz</literal>, + <literal>lz4</literal> (if <productname>PostgreSQL</productname> + was compiled with <option>--with-lz4</option>) and + <literal>zstd</literal> (if <productname>PostgreSQL</productname> + was compiled with <option>--with-zstd</option>) and + The default value is <literal>off</literal>. + Only superusers can change this setting. -- Justin
On Fri, Mar 11, 2022 at 03:49:00PM -0600, Justin Pryzby wrote: > While rebasing, I realized this should have bumped XLOG_PAGE_MAGIC. > > Also, there's a dangling "and". Right. I'll address that a bit later today. Thanks! -- Michael