Re: Logical decoding on standby - Mailing list pgsql-hackers
From | Craig Ringer |
---|---|
Subject | Re: Logical decoding on standby |
Date | |
Msg-id | CAMsr+YFYkw9+GhR--yVhdGDHKpgowU+_w0vHkiSpDdjqY5hcjA@mail.gmail.com Whole thread Raw |
In response to | Re: Logical decoding on standby (Craig Ringer <craig@2ndquadrant.com>) |
Responses |
Re: Logical decoding on standby
|
List | pgsql-hackers |
On 22 November 2016 at 10:20, Craig Ringer <craig@2ndquadrant.com> wrote: > I'm currently looking at making detection of replay conflict with a > slot work by separating the current catalog_xmin into two effective > parts - the catalog_xmin currently needed by any known slots > (ProcArray->replication_slot_catalog_xmin, as now), and the oldest > actually valid catalog_xmin where we know we haven't removed anything > yet. OK, more detailed plan. The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid, are already held down by ProcArray's catalog_xmin. But that doesn't mean we haven't removed newer tuples from specific relations and logged that in xl_heap_clean, etc, including catalogs or user catalogs, it only means the clog still exists for those XIDs. We don't emit a WAL record when we advance oldestXid in SetTransactionIdLimit(), and doing so is useless because vacuum will have already removed needed tuples from needed catalogs before calling SetTransactionIdLimit() from vac_truncate_clog(). We know that if oldestXid is n, the true valid catalog_xmin where no needed tuples have been removed must be >= n. But we need to know the lower bound of valid catalog_xmin, which oldestXid doesn't give us. So right now a standby has no way to reliably know if the catalog_xmin requirement for a given replication slot can be satisfied. A standby can't tell based on a xl_heap_cleanup_info record, xl_heap_clean record, etc whether the affected table is a catalog or not, and shouldn't generate conflicts for non-catalogs since otherwise it'll be constantly clobbering walsenders. A 2-phase advance of the global catalog_xmin would mean that GetOldestXmin() would return a value from ShmemVariableCache, not the oldest catalog xmin from ProcArray like it does now. (auto)vacuum would then be responsible for: * Reading the oldest catalog_xmin from procarray * If it has advanced vs what's present in ShmemVariableCache, writing a new xlog record type recording an advance of oldest catalog xmin * advancing ShmemVariableCache's oldest catalog xmin and would do so before it called GetOldestXmin via vacuum_set_xid_limits() to determine what it can remove. GetOldestXmin would return the ProcArray's copy of the oldest catalog_xmin when in recovery, so we report it via hot_standby_fedback to the upstream, it's recorded on our physical slot, and in turn causes vacuum to advance the master's effective oldest catalog_xmin for vacuum. On the standby we'd replay the catalog_xmin advance record, advance the standby's ShmemVariableCache's oldest catalog xmin, and check to see if any replication slots, active or not, have a catalog_xmin < than the new threshold. If none do, there's no conflict and we're fine. If any do, we wait max_standby_streaming_delay/max_standby_archive_delay as appropriate, then generate recovery conflicts against all backends that have an active replication slot based on the replication slot state in shmem. Those backends - walsender or normal decoding backend - would promptly die. New decoding sessions will check the ShmemVariableCache and refuse to start if their catalog_xmin is < the threshold. Since we advance it before generating recovery conflicts there's no race with clients trying to reconnect after their backend is killed with a conflict. If we wanted to get fancy we could set the latches of walsender backends at risk of conflicting and they could check ShmemVariableCache's oldest valid catalog xmin, so they could send immediate keepalives with reply_requested set and hopefully get flush confirmation from the client and advance their catalog_xmin before we terminate them as conflicting with recovery. But that can IMO be done later/separately. Going to prototype this. Alternate approach: --------------- The oldest xid in heap_xlog_cleanup_info is relation-specific, but the standby has no way to know if it's a catalog relation or not during redo and know whether to kill slots and decoding sessions based on its latestRemovedXid. Same for xl_heap_clean and the other records that can cause snapshot conflicts (xl_xlog_visible, xl_heap_freeze_page, xl_btree_delete xl_btree_reuse_page, spgxlogVacuumRedirect). Instead of adding a 2-phase advance of the global catalog_xmin, we can instead add a rider to each of these records that identifies whether it's a catalog table or not. This would only be emitted when wal_level >= logical, but it *would* increase WAL sizes a bit when logical decoding is enabled even if it's not going to be used on a standby. The rider would be a simple: typedef struct xl_rel_catalog_info { bool rel_accessible_from_logical_decoding; } xl_catalog_info; or similar. During redo we call a new ResolveRecoveryConflictWithLogicalSlot function from each of those records' redo routines that does what I outlined above. This way add more info to more xlog records, and the upstream has to use RelationIsAccessibleInLogicalDecoding() to set up the records when writing the xlogs. In exchange, we don't have to add a new field to CheckPoint or ShmemVariableCache or add a new xlog record type. It seems the worse option to me. (BTW, as comments on GetOldestSafeDecodingTransactionId() note, we can't rely on KnownAssignedXidsGetOldestXmin() since it can be incomplete at least on standby.) -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: