Re: [HACKERS] Logical decoding on standby - Mailing list pgsql-hackers
From | Craig Ringer |
---|---|
Subject | Re: [HACKERS] Logical decoding on standby |
Date | |
Msg-id | CAMsr+YHLsiDF2CvDgRv9GOqwEPOat5UdwGqr09+gim8KM-MG5w@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Logical decoding on standby (Craig Ringer <craig@2ndquadrant.com>) |
Responses |
Re: [HACKERS] Logical decoding on standby
|
List | pgsql-hackers |
On 21 March 2017 at 09:05, Craig Ringer <craig@2ndquadrant.com> wrote: > Thanks, that's a helpful point. The commit in question is 978b2f65. I > didn't notice that it introduced XLogReader use in twophase.c, though > I should've realised given the discussion about fetching recent 2pc > info from xlog. I don't see any potential for issues at first glance, > but I'll go over it in more detail. The main concern is validity of > ThisTimeLineID, but since it doesn't run in recovery I don't see much > of a problem there. That also means it can afford to use the current > timeline-oblivious read_local_xlog_page safely. > > TAP tests for 2pc were added by 3082098. I'll check to make sure they > have appropriate coverage for this. The TAP tests pass fine, and I can't see any likely issues either. XLogReader for 2PC doesn't happen on standby, and RecoveryInProgress() will update ThisTimeLineID on promotion. >> Did you check whether ThisTimeLineID is actually always valid in the >> processes logical decoding could run in? IIRC it's not consistently >> update during recovery in any process but the startup process. > > I share your concerns that it may not be well enough maintained. > Thankyou for the reminder, that's been on my TODO and got lost when I > had to task-hop to other priorities. The main place we maintain ThisTimeLineID (outside StartupXLOG of course) is in walsender's GetStandbyFlushRecPtr, which calls GetWalRcvWriteRecPtr. That's not used in walsender's logical decoding or in the SQL interface. I've changed the order of operations in read_local_xlog_page to ensure that RecoveryInProgress() updates ThisTimeLineID if we're promoted, and made it update ThisTimeLineID from GetXLogReplayRecPtr otherwise. pg_logical_slot_get_changes_guts was fine already. Because xlog read callbacks must not attempt to read pages past the flush limit (master) or replay limit (standby), it doesn't matter if ThisTimeLineID is completely up to date, only that it's valid as-of that LSN. I did identify one problem. The startup process renames the last segment in a timeline to .partial when it processes a timeline switch. See xlog.c:7597. So if we have the following order of operations: * Update ThisTimeLineID to 2 at latest redo ptr * XLogReadDetermineTimeline chooses timeline 2 to read from * startup process replays timeline switch to TL 3 and renames last segment in old timeline to .partial * XLogRead() tries to open segment with TL 2 we'll fail. I don't think it matters much though. We're not actually supporting streaming decoding from standby this release by the looks, and even if we did the effect would be limited to an ERROR and a reconnect. It doesn't look like there's really any sort of lock or other synchronisation we can rely on to prevent this, and we should probably just live with it. If we have already opened the segment we'll just keep reading from it without noticing the rename; if we haven't and are switching to it just as it's renamed we'll ERROR when we try to open it. I had cascading and promotion tests in progress for decoding on standby, but doubt there's much point finishing them off now that it's not likely that decoding on standby can be added for this CF. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: