Re: Changed behavior in rewriteheap - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Changed behavior in rewriteheap
Date
Msg-id CAEze2Wj3t_N_i=vNm3yHB0TLtgsieYx0_uSMqLqN2Y8qAAbJkw@mail.gmail.com
Whole thread Raw
In response to Changed behavior in rewriteheap  (Erik Nordström <erik@timescale.com>)
Responses Re: Changed behavior in rewriteheap
List pgsql-hackers
On Thu, 21 Nov 2024, 17:18 Erik Nordström, <erik@timescale.com> wrote:
Hello,

I've noticed a change in behavior of the heap rewrite functionality in PostgreSQL 17, used by, e.g., CLUSTER. I've been experimenting with the functionality to implement a way to merge partitions in TimescaleDB. I am using table_relation_copy_for_cluster() to write the data of several tables to a single merged table, and then I do a heap swap on one of the original tables while dropping the others. So, if you have 3 partitions and want to merge them to one, then I write all three partitions to a temporary heap, swap the new heap on partition 1 and then drop partitions 2 and 3.

Now, this approach worked fine for PostgreSQL 15 and 16, but 17 introduced some changes that altered the behavior so that I only see data from one of the partitions after merge (the last one written).

The commit that I think is responsible is the following: https://github.com/postgres/postgres/commit/8af256524893987a3e534c6578dd60edfb782a77

Now, I realize that this is not a problem for PostgreSQL itself since the rewrite functionality isn't used for the purpose I am using it. To my defense, the rewrite code seems to imply that it should be possible to write more data to an existing heap according to this comment in begin_heap_rewrite: /* new_heap needn't be empty, just locked */.

I've also tried recompiling PG17 with the rewriteheap.c file from PG16 and then it works again. I haven't yet been able to figure out exactly what is different but I will continue to try to narrow it down. In the meantime, maybe someone on the mailing list has some insight on what could be the issue and whether my approach is viable?

I think the origin of the issue is at bulk_write.c, in smgr_bulk_flush (specifically after line 282), which assumes the bulk write system is used exclusively on empty relations.

If you use a separate pair of begin/end_heap_rewrite for every relation you merge into that heap, it'll re-initialize the bulk writer, which will thus overwrite the previous rewrites' pages. The pre-PG17 rewriteheap.c doesn't use that API, and thus isn't affected.

I've CC-ed Heikki as author of that patch; maybe a new API to indicate bulk writes into an existing fork would make sense, to solve Timescale's issue and fix rewriteheap's behavior?


Kind regards,

Matthias van de Meent

pgsql-hackers by date:

Previous
From: "Giacchino, Luca"
Date:
Subject: SIMD optimization for list_sort
Next
From: Tom Lane
Date:
Subject: Re: [EXTERNAL] Re: Add non-blocking version of PQcancel