Home > mailing lists
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward - Mailing list pgsql-hackers

From	Tom Lane
Subject	Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
Date	October 14 00:07:52
Msg-id	717335.1760389672@sss.pgh.pa.us Whole thread Raw
In response to	Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward (Chao Li <li.evan.chao@gmail.com>)
Responses	Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
List	pgsql-hackers
Tree view
Chao Li <li.evan.chao@gmail.com> writes:
>> On Oct 12, 2025, at 09:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> While playing around with the test cases for pg_dump compression,
>> I was startled to discover that the performance of compress_lz4's
>> "stream API" code is absolutely abysmal.

> I tested the patch on my MacBook M4 (Sequoia 15.6.1), compiled with -O2 optimization:

Thanks for looking at it!

> I also reviewed the code change, basically LGTM. Just a couple of trivial comments:

> 1 - 0001
> In WriteDataToArchiveLZ4()

> ```
> +        required = LZ4F_compressBound(chunk, &state->prefs);
> +        if (required > state->buflen - state->bufdata)
> +        {
> +            cs->writeF(AH, state->buffer, state->bufdata);
> +            state->bufdata = 0;
> +        }
> ```

> And in EndCompressorLZ4()
> ```
> +    required = LZ4F_compressBound(0, &state->prefs);
> +    if (required > state->buflen - state->bufdata)
> +    {
> +        cs->writeF(AH, state->buffer, state->bufdata);
> +        state->bufdata = 0;
> +    }
> ```

> These two code pieces are similar, only difference is the first parameter passed to LZ4F_compressBound().

> I think we can create an inline function for it. But these are just 5 lines, I don’t have a strong option on that.

Yeah, I don't think that would improve code readability noticeably.
By the time you got done writing a documentation comment for the
new subroutine, the code would probably be longer not shorter.

I've pushed the parts of that patch set that I thought were
uncontroversial.  What's left is the business about increasing
DEFAULT_IO_BUFFER_SIZE and then adjusting the tests appropriately.

So, v4-0001 attached is the previous v3-0002 to increase
DEFAULT_IO_BUFFER_SIZE, plus additions in compress_none.c to make
--compress=none also produce predictably large data blocks.
I decided that if we're going to rely on that behavior as part
of the solution for this thread's original problem, we'd better
make it happen for all compression options.

0002 adds a test case in 002_pg_dump.pl to exercise --compress=none,
because without that we don't have any coverage of the new code
0001 added in compress_none.c.  That makes for a small increase
in the runtime of 002_pg_dump.pl, but I'm inclined to think it's
worth doing.

0003 modifies the existing test cases that manually compress
blobs.toc files so that they also compress toc.dat.  I feel
like it's mostly an oversight that that wasn't done to begin
with; if it had been done, we'd have caught the Gzip_read bug
right away.  Also, AFAICT, this doesn't cost anything measurable
in test runtime.

0004 increases the row width in the existing test case that says
it's trying to push more than DEFAULT_IO_BUFFER_SIZE through
the compressors.  While I agree with the premise, this solution
is hugely expensive: it adds about 12% to the already-long runtime
of 002_pg_dump.pl.  I'd like to find a better way, but ran out of
energy for today.  (I think the reason this costs so much is that
it's effectively iterated hundreds of times because of
002_pg_dump.pl's more or less cross-product approach to testing
everything.  Maybe we should pull it out of that structure?)

            regards, tom lane

From 969b1ba5a94449e10e56103f18ccf4f9c5481796 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:09:36 -0400
Subject: [PATCH v4 1/4] Align the data block sizes of pg_dump's various
 compression modes.

After commit fe8192a95, compress_zstd.c tends to produce data block
sizes around 128K, and we don't really have any control over that
unless we want to overrule ZSTD_CStreamOutSize().  Which seems like
a bad idea.  But let's try to align the other compression modes to
produce block sizes roughly comparable to that, so that pg_restore's
skip-data performance isn't enormously different for different modes.

gzip compression can be brought in line simply by setting
DEFAULT_IO_BUFFER_SIZE = 128K, which this patch does.  That
increases some unrelated buffer sizes, but none of them seem
problematic for modern platforms.

lz4's idea of appropriate block size is highly nonlinear:
if we just increase DEFAULT_IO_BUFFER_SIZE then the output
blocks end up around 200K.  I found that adjusting the slop
factor in LZ4State_compression_init was a not-too-ugly way
of bringing that number roughly into line.

With compress = none you get data blocks the same sizes as the
table rows, which seems potentially problematic for narrow tables.
Introduce a layer of buffering to make that case match the others.

Comments in compress_io.h and 002_pg_dump.pl suggest that if
we increase DEFAULT_IO_BUFFER_SIZE then we need to increase the
amount of data fed through the tests in order to improve coverage.
I've not done that here, leaving it for a separate patch.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_io.h    |  4 +-
 src/bin/pg_dump/compress_lz4.c   |  9 ++++-
 src/bin/pg_dump/compress_none.c  | 64 +++++++++++++++++++++++++++++++-
 src/tools/pgindent/typedefs.list |  1 +
 4 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index 25a7bf0904d..ae008585c89 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -22,9 +22,9 @@
  *
  * When changing this value, it's necessary to check the relevant test cases
  * still exercise all the branches. This applies especially if the value is
- * increased, in which case the overflow buffer may not be needed.
+ * increased, in which case some loops may not get iterated.
  */
-#define DEFAULT_IO_BUFFER_SIZE    4096
+#define DEFAULT_IO_BUFFER_SIZE    (128 * 1024)

 extern char *supports_compression(const pg_compress_specification compression_spec);

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index b817a083d38..450afd4e2be 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -100,9 +100,14 @@ LZ4State_compression_init(LZ4State *state)
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

     /*
-     * Then double it, to ensure we're not forced to flush every time.
+     * Add some slop to ensure we're not forced to flush every time.
+     *
+     * The present slop factor of 50% is chosen so that the typical output
+     * block size is about 128K when DEFAULT_IO_BUFFER_SIZE = 128K.  We might
+     * need a different slop factor to maintain that equivalence if
+     * DEFAULT_IO_BUFFER_SIZE is changed dramatically.
      */
-    state->buflen *= 2;
+    state->buflen += state->buflen / 2;

     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
diff --git a/src/bin/pg_dump/compress_none.c b/src/bin/pg_dump/compress_none.c
index 4abb2e95abc..94c155a572d 100644
--- a/src/bin/pg_dump/compress_none.c
+++ b/src/bin/pg_dump/compress_none.c
@@ -22,6 +22,18 @@
  *----------------------
  */

+/*
+ * We buffer outgoing data, just to ensure that data blocks written to the
+ * archive file are of reasonable size.  The read side could use this struct,
+ * but there's no need because it does not retain data across calls.
+ */
+typedef struct NoneCompressorState
+{
+    char       *buffer;            /* buffer for unwritten data */
+    size_t        buflen;            /* allocated size of buffer */
+    size_t        bufdata;        /* amount of valid data currently in buffer */
+} NoneCompressorState;
+
 /*
  * Private routines
  */
@@ -49,13 +61,45 @@ static void
 WriteDataToArchiveNone(ArchiveHandle *AH, CompressorState *cs,
                        const void *data, size_t dLen)
 {
-    cs->writeF(AH, data, dLen);
+    NoneCompressorState *nonecs = (NoneCompressorState *) cs->private_data;
+    size_t        remaining = dLen;
+
+    while (remaining > 0)
+    {
+        size_t        chunk;
+
+        /* Dump buffer if full */
+        if (nonecs->bufdata >= nonecs->buflen)
+        {
+            cs->writeF(AH, nonecs->buffer, nonecs->bufdata);
+            nonecs->bufdata = 0;
+        }
+        /* And fill it */
+        chunk = nonecs->buflen - nonecs->bufdata;
+        if (chunk > remaining)
+            chunk = remaining;
+        memcpy(nonecs->buffer + nonecs->bufdata, data, chunk);
+        nonecs->bufdata += chunk;
+        data = ((const char *) data) + chunk;
+        remaining -= chunk;
+    }
 }

 static void
 EndCompressorNone(ArchiveHandle *AH, CompressorState *cs)
 {
-    /* no op */
+    NoneCompressorState *nonecs = (NoneCompressorState *) cs->private_data;
+
+    if (nonecs)
+    {
+        /* Dump buffer if nonempty */
+        if (nonecs->bufdata > 0)
+            cs->writeF(AH, nonecs->buffer, nonecs->bufdata);
+        /* Free working state */
+        pg_free(nonecs->buffer);
+        pg_free(nonecs);
+        cs->private_data = NULL;
+    }
 }

 /*
@@ -71,6 +115,22 @@ InitCompressorNone(CompressorState *cs,
     cs->end = EndCompressorNone;

     cs->compression_spec = compression_spec;
+
+    /*
+     * If the caller has defined a write function, prepare the necessary
+     * buffer.
+     */
+    if (cs->writeF)
+    {
+        NoneCompressorState *nonecs;
+
+        nonecs = (NoneCompressorState *) pg_malloc(sizeof(NoneCompressorState));
+        nonecs->buflen = DEFAULT_IO_BUFFER_SIZE;
+        nonecs->buffer = pg_malloc(nonecs->buflen);
+        nonecs->bufdata = 0;
+
+        cs->private_data = nonecs;
+    }
 }


diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..63f9387044b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1758,6 +1758,7 @@ NextValueExpr
 Node
 NodeTag
 NonEmptyRange
+NoneCompressorState
 Notification
 NotificationList
 NotifyStmt
--
2.43.7

From 4407d68c4a4b95979e24fc225573944def989fc9 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:20:38 -0400
Subject: [PATCH v4 2/4] Add a test case to cover pg_dump with --compress=none.

This brings the coverage of compress_none.c up from about 64%
to 90%, in particular covering the new code added in the previous
patch.

This adds perhaps 2% to the runtime of 002_pg_dump.pl.  It could
be argued that compress_none.c is simple enough to not require
permanent memorialization in a test case; but it's less simple
than it was before, so maybe we want this.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/t/002_pg_dump.pl | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 6be6888b977..8a08f9a5f6f 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -81,6 +81,24 @@ my %pgdump_runs = (
         ],
     },

+    compression_none_custom => {
+        test_key => 'compression',
+        dump_cmd => [
+            'pg_dump',
+            '--format' => 'custom',
+            '--compress' => 'none',
+            '--file' => "$tempdir/compression_none_custom.dump",
+            '--statistics',
+            'postgres',
+        ],
+        restore_cmd => [
+            'pg_restore',
+            '--file' => "$tempdir/compression_none_custom.sql",
+            '--statistics',
+            "$tempdir/compression_none_custom.dump",
+        ],
+    },
+
     # Do not use --no-sync to give test coverage for data sync.
     compression_gzip_custom => {
         test_key => 'compression',
--
2.43.7

From 4f0c4e17ac17dd8fd36cbfc4f887fff8cfecb1f6 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:30:55 -0400
Subject: [PATCH v4 3/4] Include compression of toc.dat in manually-compressed
 test cases.

We would have found the bug fixed in commit a239c4a0c much sooner
if we'd done this.  As far as I can tell, this doesn't reduce
test coverage at all, since there are other tests of directory
format that still use an uncompressed toc.dat.  And its effect
on the total runtime of 002_pg_dump.pl seems lost in the noise.

While here, fix a glitch that I noticed in testing: the
$glob_patterns tests were incapable of failing, because glob()
will return 'foo' as 'foo' whether there is a matching file or
not.  (Indeed, the stanza just above this one relies on that.)

I'm slightly tempted to back-patch this part.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
Discussion: https://postgr.es/m/25345.1760289877@sss.pgh.pa.us
---
 src/bin/pg_dump/t/002_pg_dump.pl | 41 +++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 8a08f9a5f6f..8b287a673bf 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -140,15 +140,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'GZIP_PROGRAM'},
-            args => [ '-f', "$tempdir/compression_gzip_dir/blobs_*.toc", ],
+            args => [
+                '-f',
+                "$tempdir/compression_gzip_dir/toc.dat",
+                "$tempdir/compression_gzip_dir/blobs_*.toc",
+            ],
         },
-        # Verify that only data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_gzip_dir/toc.dat",
+            "$tempdir/compression_gzip_dir/toc.dat.gz",
             "$tempdir/compression_gzip_dir/*.dat.gz",
         ],
         restore_cmd => [
@@ -219,18 +222,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'LZ4'},
             args => [
                 '-z', '-f', '-m', '--rm',
+                "$tempdir/compression_lz4_dir/toc.dat",
                 "$tempdir/compression_lz4_dir/blobs_*.toc",
             ],
         },
-        # Verify that data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_lz4_dir/toc.dat",
+            "$tempdir/compression_lz4_dir/toc.dat.lz4",
             "$tempdir/compression_lz4_dir/*.dat.lz4",
         ],
         restore_cmd => [
@@ -303,18 +306,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'ZSTD'},
             args => [
-                '-z', '-f',
-                '--rm', "$tempdir/compression_zstd_dir/blobs_*.toc",
+                '-z', '-f', '--rm',
+                "$tempdir/compression_zstd_dir/toc.dat",
+                "$tempdir/compression_zstd_dir/blobs_*.toc",
             ],
         },
-        # Verify that data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_zstd_dir/toc.dat",
+            "$tempdir/compression_zstd_dir/toc.dat.zst",
             "$tempdir/compression_zstd_dir/*.dat.zst",
         ],
         restore_cmd => [
@@ -5523,8 +5526,12 @@ foreach my $run (sort keys %pgdump_runs)
         foreach my $glob_pattern (@{$glob_patterns})
         {
             my @glob_output = glob($glob_pattern);
-            is(scalar(@glob_output) > 0,
-                1, "$run: glob check for $glob_pattern");
+            my $ok = 0;
+            # certainly found some files if glob() returned multiple matches
+            $ok = 1 if (scalar(@glob_output) > 1);
+            # if just one match, we need to check if it's real
+            $ok = 1 if (scalar(@glob_output) == 1 && -f $glob_output[0]);
+            is($ok, 1, "$run: glob check for $glob_pattern");
         }
     }

--
2.43.7

From 163956c37940aeee1bf4c48fb62ae22612ec89be Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:43:06 -0400
Subject: [PATCH v4 4/4] Widen the wide row used to verify correct (de)
 compression.

Commit 1a05c1d25 advises us (not without reason) to ensure that this
test case fully fills DEFAULT_IO_BUFFER_SIZE, so that loops within
the compression logic will iterate completely.  To follow that
advice with the proposed DEFAULT_IO_BUFFER_SIZE of 128K, we need
something close to this.  This does indeed increase the reported
code coverage by a few lines.

This is a very expensive test, however: it increases the total
runtime of 002_pg_dump.pl by about 12% for me, and that's already
one of the more expensive TAP tests.  I wonder if there isn't a
less brute-force way of getting the same result.  If there's not,
is this really worth it?

One idea is to take this out of the main structure of 002_pg_dump.pl
so that the test instance is run just once per compression type,
rather than many many times as it is here.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/t/002_pg_dump.pl | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 8b287a673bf..8bbd6ed47a9 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -3732,14 +3732,15 @@ my %tests = (
     },

     # Insert enough data to surpass DEFAULT_IO_BUFFER_SIZE during
-    # (de)compression operations
+    # (de)compression operations.  The weird regex is because Perl
+    # restricts us to repeat counts of less than 32K.
     'COPY test_compression_method' => {
         create_order => 111,
         create_sql => 'INSERT INTO dump_test.test_compression_method (col1) '
-          . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,4096) a;',
+          . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,65536) a;',
         regexp => qr/^
             \QCOPY dump_test.test_compression_method (col1) FROM stdin;\E
-            \n(?:\d{15277}\n){1}\\\.\n
+            \n(?:(?:\d\d\d\d\d\d\d\d\d\d){31657}\d\d\d\d\n){1}\\\.\n
             /xm,
         like => {
             %full_runs,
--
2.43.7
pgsql-hackers by date:
From: Kirill Reshke
Date: 13 October, 23:59:19
Subject: Re: Remove custom redundant full page write description from GIN
From: Rishu Bagga
Date: 14 October, 00:14:07
Subject: Re: [PATCH] Write Notifications Through WAL
Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward - Mailing list pgsql-hackers

Previous

Next