Re: Patch: dumping tables data in multiple chunks in pg_dump - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: Patch: dumping tables data in multiple chunks in pg_dump
Date
Msg-id CAMT0RQSwbGpf4WSCb41dcyvHUEYAnVTYxdJ6SqLYf7HiUag+TA@mail.gmail.com
Whole thread Raw
In response to Re: Patch: dumping tables data in multiple chunks in pg_dump  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
Hi Zsolt and Dilip,

Thanks for review and useful comments!

On Tue, Feb 3, 2026 at 10:10 PM Zsolt Parragi <zsolt.parragi@percona.com> wrote:
>
> Hello!
>
> I did some testing with this patch, and I think there are some issues
> during restoration:
>
> 1. Isn't there a possible race / scheduling mistake during restore
> because of missing dependencies? The code now prints out "TABLE DATA
> (pages %u:%u)", while the restore code checks for the explicit "TABLE
> DATA" string for dependency tracking (pg_backup_archiver.c:2013 and a
> few other places). This causes POST DATA to have no dependency on the
> table data, and can be scheduled before we load all table data.

I have resolved this by adding a second array to the reverse dependencies
mechanism in buildTocEntryArrays() for chunked dump where I collect arrays
of ids in AH->tableDataChunkIds.

For this I extracted the list management part from DumpableObject

typedef struct _DependencyList
{
DumpId    *dependencies; /* dumpIds of objects this one depends on */
int nDeps; /* number of valid dependencies */
int allocDeps; /* allocated size of dependencies[] */
} DependencyList;

And added addStandaloneDependency() based addObjectDependency()

I simplified it to always use realloc, as it can handle the NULL case

void
addStandaloneDependency(DependencyList *dobj, DumpId refId)
{
if (dobj->nDeps >= dobj->allocDeps)
{
dobj->allocDeps = (dobj->allocDeps <= 0) ? 16 : dobj->allocDeps * 2;
dobj->dependencies = pg_realloc_array(dobj->dependencies,
  DumpId, dobj->allocDeps);
}
dobj->dependencies[dobj->nDeps++] = refId;
}

And then I use AH->tableDataChunkIds in repoint_table_dependencies() to
- replace the dependency on table def with dependency on first chunk
- add the remaining cunks at the end of dependency list.

> I was able to verify the scheduling issue with an index: the INDEX
> part is scheduled too early, before all TABLE DATA completes, but then
> locking prevents it from progressing, so everything completed fine in
> the end. Even if that's guaranteed, which I'm not 100% sure of, it's
> still based on luck and not proper logic, and takes up a slot (or
> multiple), reducing parallelism.
>
> 2. Fixing the TABLE DATA strcmp checks solves the scheduling issue,
> but it's not that simple, because then it causes truncation issues
> during restore, which needs additional changes in the restore code. I
> did a quick fix for that by adding an additional condition to the
> created flag, and with that it seems to restore everything properly,
> and with proper ordering, only starting index/constraint/etc after all
> table data is completed. However this was definitely just a quick test
> fix, this needs a proper better solution.
>
> Other issues I see are more minor, but numerous:

I collect the chunk dependencies in a separate array, which
should solve the truncation issue.

Can you advise a good check to add to tap tests for verifying?

> 3. The patch still has lots of debug output (pg_log_WARNING("CHUNKING
> ...")); Is this intended? Shouldn't these be behind some verbose
> check, and maybe use info instead of warning?

This left in for easing initial reviewing. I have either removed them
or turned them into pg_log_debug()

> 4. The is_segment macro should have () around the use of tdiptr

Thanks, fixed.

> 5. There's still a 32000 magic constant, shouldn't that have some
> descriptive name / explanatory comment?

I turned this into "ctid < (pagenr+1, 0)" for clarity and
futureproofing, as it is not entirely impossible that we could have
at some point more than 32000 items per page.

> 6. formatting issues at multiple places, mostly missing spaces after
> if/while/for statements

My hope was that the pre-release automatic formatting run takes care of
this.

I will eyeball to see if I find theem, but I don't think I have a good
way to detect them all.

Suggestions very much welcome!

> 7. inconsistent error messages (not in range vs must be in range)

> 8. There's a remaining TODO that seems stale, current_chunk_start is
> already uint64

Removed.

> 9. typo: "the computed from pog_relation_size" -> "then computed from
> pg_relation_size"

Fixed.

On Thu, Feb 12, 2026 at 7:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jan 28, 2026 at 11:00 PM Hannu Krosing <hannuk@google.com> wrote:
> >
> > v13 has added a proper test comparing original and restored table data
> >
> I was reviewing v13 and here are some initial comments I have
>
> 1. IMHO the commit message details about the work progress instead of
> a high level idea about what it actually does and how.
> Suggestion:
>
> SUBJECT: Add --max-table-segment-pages option to pg_dump for parallel
> table dumping.
>
> This patch introduces the ability to split large heap tables into segments
> based on a specified number of pages. These segments can then be dumped in
> parallel using the existing jobs infrastructure, significantly reducing
> the time required to dump very large tables.
>
> The implementation uses ctid-based range queries (e.g., WHERE ctid >=
> '(start,1)'
> AND ctid <= '(end,32000)') to extract specific chunks of the relation.
>
> <more architecture details and limitation if any>

SUBJECT: Add --max-table-segment-pages option to pg_dump for parallel
table dumping.

This patch introduces the ability to split large heap tables into segments
based on a specified number of pages. These segments can then be dumped in
parallel using the existing jobs infrastructure, significantly reducing
the time required to dump very large tables.

This --max-table-segment-pages number specifically applies to main table
pages which does not guarantee anything about output size.
The output could be empty if there are no live tuples in the page range.
Or it can be almost 200 GB if the page has just pointers to 1GB TOAST items.

The implementation uses ctid-based range queries (e.g., WHERE ctid >=
'(startPage,1)' AND ctid <= '(endPage+1,0)') to extract specific chunks of
the relation.

This is only effectively supported for PostgreSQL version 14+ though it does
work inefficiently on earlier versions

The patch only supports "heap" access method as others may not even have the
ctid column

> 2.
> + pg_log_warning("CHUNKING: set dopt.max_table_segment_pages to [%u]",
> dopt.max_table_segment_pages);
> + break;
>
> IMHO we don't need to place warning here while processing the input parameters

Either removed or turned to pg_log_debug()

> 3.
> + printf(_("  --max-table-segment-pages=NUMPAGES\n"
> +      "                               Number of main table pages
> above which data is \n"
> + "                               copied out in chunks, also
> determines the chunk size\n"));
>
> Check the comment formatting, all the parameter description starts
> with lower case, so better we start with "number" rather than "Number"

Fixed

> 4.
> + if (is_segment(tdinfo))
> + {
> + appendPQExpBufferStr(q, tdinfo->filtercond?" AND ":" WHERE ");
> + if(tdinfo->startPage == 0)
> + appendPQExpBuffer(q, "ctid <= '(%u,32000)'", tdinfo->endPage);
> + else if(tdinfo->endPage != InvalidBlockNumber)
> + appendPQExpBuffer(q, "ctid BETWEEN '(%u,1)' AND '(%u,32000)'",
> + tdinfo->startPage, tdinfo->endPage);
> + else
> + appendPQExpBuffer(q, "ctid >= '(%u,1)'", tdinfo->startPage);
> + pg_log_warning("CHUNKING: pages [%u:%u]",tdinfo->startPage, tdinfo->endPage);
> + }
>
> IMHO we should explain this chunking logic in the comment above this code block?

I added the comment.
I also changed the chunk end logic to "ctid < '(LastPage+1,0)'" for clarity and
future-proofing.

----
Best Regards

Hannu

Attachment

pgsql-hackers by date:

Previous
From: Junwang Zhao
Date:
Subject: Re: SQL Property Graph Queries (SQL/PGQ)
Next
From: Daniil Davydov
Date:
Subject: Re: POC: Parallel processing of indexes in autovacuum