Re: Adding REPACK [concurrently] - Mailing list pgsql-hackers
| From | Amit Kapila |
|---|---|
| Subject | Re: Adding REPACK [concurrently] |
| Date | |
| Msg-id | CAA4eK1LygCDP3FiFzXY9iVNFcHxhf7TT_DFf7tryTu2oipmfpA@mail.gmail.com Whole thread |
| In response to | Re: Adding REPACK [concurrently] (Antonin Houska <ah@cybertec.at>) |
| Responses |
Re: Adding REPACK [concurrently]
|
| List | pgsql-hackers |
On Wed, May 6, 2026 at 1:55 PM Antonin Houska <ah@cybertec.at> wrote: > > Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > On 2026-May-05, Antonin Houska wrote: > > > > > However, I failed to notice that COMMIT record of > > > a transaction listed in the xl_running_xacts WAL record is not guaranteed to > > > follow the xl_running_xacts record in WAL. In other words, even if > > > xl_running_xacts is created before a COMMIT record of the contained > > > transaction, it may end up at higher LSN in WAL. So the cleanup I relied on > > > might not take place. > > > > That's pretty bad news. > > > > > I've got no good idea how to fix that. > > One idea occurred to me yet, effectively it's just a cleanup. Part of it was > already proposed [1]. > Some issues/inefficiencies regarding this fix and base code related to db-specific snapshots built during decoding: * After fix, whenever a db-specific decoder sees a cluster-wide xl_running_xacts record, it unconditionally calls LogStandbySnapshot(MyDatabaseId) and returns. This triggers for every cluster-wide record the decoder encounters (including post snapbuild's CONSISTENT state) , for the entire duration of the decoding session. LogStandbySnapshot acquires ProcArrayLock + XidGenLock, calls GetRunningTransactionData, and writes WAL. With N active db-specific decoding sessions, each cluster-wide record now triggers N additional WAL writes. Additionally, LogStandbySnapshot also logs AccessExclusiveLocks before the running_xacts record. Physical standbys skip db-specific XLOG_RUNNING_XACTS records via standby_redo(), but they do process the preceding XLOG_STANDBY_LOCK records. The same locks may already have been logged in the most recent cluster-wide snapshot. Physical standbys could end up processing these lock records twice which may not be harmful because I think we avoid re-acquiring the lock but still it is a new overhead in the system. * When a cluster-wide running_xacts record arrives: SnapBuildProcessRunningXacts calls LogStandbySnapshot and returns early. ReorderBufferAbortOld is called, but with the cluster-wide oldestRunningXid, which could lag far behind the db-specific value (due to a long-running transaction in another database). When a db-specific record arrives: SnapBuildProcessRunningXacts processes it and advances builder->xmin with the db-specific (more current) oldestRunningXid. But ReorderBufferAbortOld is NOT called for db-specific records. This means the reorder buffer is cleaned up using a conservative, potentially very old, cluster-wide oldestRunningXid, even though builder->xmin has already advanced much further. The reorder buffer holds stale entries longer than necessary, increasing memory pressure. * I also see a design level problem with plugins that have need_shared_catalogs=false and use failover slots. IIUC, the db-specific optimization was designed around a live decoding session on the primary which can emit and immediately read its own db-specific records in the WAL stream to reach consistent state. The LogicalSlotAdvanceAndCheckSnapState path used by slotsync has a bounded WAL window and cannot exploit the feedback loop, making the two mechanisms fundamentally incompatible. I know the slot created by pgrepack doesn't enable failover option but we have not even added any guards or thought about db-specific snapbuilds with other parts of the system that rely on cluster-wide running_xact records, so there could be more problems which we don't see with the current set of tests. -- With Regards, Amit Kapila.
pgsql-hackers by date: