Thread: Streaming replication and pg_xlogfile_name()
Hi, In relation to the functions added recently, I found an annoying problem; pg_xlogfile_name(pg_last_xlog_receive/replay_location()) might report the wrong name because pg_xlogfile_name() always uses the current timeline, and a backend doesn't know the actual timeline related to the location which pg_last_xlog_receive/replay_location() reports. Even if a backend knows that, pg_xlogfile_name() would be unable to determine which timeline should be used. To solve this problem, I'm thiking to add the following functions: * pg_current_timeline() reports the current timeline ID. * pg_last_receive_timeline() reports the timeline ID which is related to the last WAL receive location. * pg_last_replay_timeline() reports the timeline ID which is related to the last WAL replay location. * pg_xlogfile_name(location text [, timeline bigint ]) reports the WAL file name using the given timeline. By default, thecurrent timeline is used. * pg_xlogfile_name_offset(location text [, timeline bigint]) reports the WAL file name and offset using the given timeline.By default, the current timeline is used. If the second parameter is omitted, pg_xlogfile_name() would behave as it does now. We can get the right WAL file name by giving it the result of pg_last_receive/replay_timeline(). Thought? Or we should just drop the support of pg_xlogfile_name() for pg_last_xlog_receive/replay_locadtion()? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > In relation to the functions added recently, I found an annoying problem; > pg_xlogfile_name(pg_last_xlog_receive/replay_location()) might report the > wrong name because pg_xlogfile_name() always uses the current timeline, > and a backend doesn't know the actual timeline related to the location > which pg_last_xlog_receive/replay_location() reports. Even if a backend > knows that, pg_xlogfile_name() would be unable to determine which timeline > should be used. Hmm, I'm not sure what the use case for this is, but I agree it seems annoying that you can almost reconstruct the exact filename, but not quite because of the possible change in timeline ID. > To solve this problem, I'm thiking to add the following functions: > > * pg_current_timeline() reports the current timeline ID. > * pg_last_receive_timeline() reports the timeline ID which is related > to the last WAL receive location. > * pg_last_replay_timeline() reports the timeline ID which is related > to the last WAL replay location. > * pg_xlogfile_name(location text [, timeline bigint ]) reports the WAL > file name using the given timeline. By default, the current timeline > is used. > * pg_xlogfile_name_offset(location text [, timeline bigint]) reports > the WAL file name and offset using the given timeline. By default, > the current timeline is used. That gets quite complicated to use. And there's a little race condition too: when you call pg_last_replay_timeline() and pg_last_xlog_replay_location() functions to get the timeline and XLogRecPtr of the last replayed record, the timeline might change in between the calls, so you end up with a combination that was never actually replayed. How about extending the format of the string returned by pg_last_xlog_receive/replay_location() to include the timeline ID? When it currently returns e.g '6/200016C', it could return '1/6/200016C', where 1 is the timeline ID. Then just teach pg_xlogfile_name[_offset]() to accept that format as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Jan 28, 2010 at 5:28 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > How about extending the format of the string returned by > pg_last_xlog_receive/replay_location() to include the timeline ID? When > it currently returns e.g '6/200016C', it could return '1/6/200016C', > where 1 is the timeline ID. Then just teach pg_xlogfile_name[_offset]() > to accept that format as well. Sounds good. The attached patch does so. Also the code is available in the 'replication' branch in my git repository. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Fujii Masao wrote: > On Thu, Jan 28, 2010 at 5:28 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> How about extending the format of the string returned by >> pg_last_xlog_receive/replay_location() to include the timeline ID? When >> it currently returns e.g '6/200016C', it could return '1/6/200016C', >> where 1 is the timeline ID. Then just teach pg_xlogfile_name[_offset]() >> to accept that format as well. > > Sounds good. The attached patch does so. Also the code is available > in the 'replication' branch in my git repository. > --- 5866,5882 ---- > /* use volatile pointer to prevent code rearrangement */ > volatile XLogCtlData *xlogctl = XLogCtl; > > ! /* > ! * initialize shared replayEndRecPtr, recoveryLastRecPtr and > ! * recoveryLastTLI. Actually, the latter two variables don't need to > ! * be initialized here since they are expected to be updated at least > ! * once until read only connections will have read them. But just in > ! * case. > ! */ > SpinLockAcquire(&xlogctl->info_lck); > xlogctl->replayEndRecPtr = ReadRecPtr; > xlogctl->recoveryLastRecPtr = ReadRecPtr; > + xlogctl->recoveryLastTLI = curFileTLI; > SpinLockRelease(&xlogctl->info_lck); > > InRedo = true; Thinking about this again, I'm not sure this is a good idea. Using curFileTLI makes sense if you're going to call pg_xlogfile_name() and would expect it to return the filename of the file containing the WAL record being replayed. But in other contexts, it seems strange for pg_last_replay_timeline() to return the TLI of the first record in the file, rather than the actual record replayed. I don't have any better ideas, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, Feb 22, 2010 at 9:30 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Thinking about this again, I'm not sure this is a good idea. Using > curFileTLI makes sense if you're going to call pg_xlogfile_name() and > would expect it to return the filename of the file containing the WAL > record being replayed. But in other contexts, it seems strange for > pg_last_replay_timeline() to return the TLI of the first record in the > file, rather than the actual record replayed. Umm... though I might misunderstand your point, curFileTLI is the TLI appearing in the name of WAL file. So it's not the TLI of the first record in the file, isn't it? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Mon, Feb 22, 2010 at 9:30 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Thinking about this again, I'm not sure this is a good idea. Using >> curFileTLI makes sense if you're going to call pg_xlogfile_name() and >> would expect it to return the filename of the file containing the WAL >> record being replayed. But in other contexts, it seems strange for >> pg_last_replay_timeline() to return the TLI of the first record in the >> file, rather than the actual record replayed. > > Umm... though I might misunderstand your point, curFileTLI is the TLI > appearing in the name of WAL file. Yes. > So it's not the TLI of the first record in the file, isn't it? Hmm, or is it the TLI of the last record? Not sure. Anyway, if there's a TLI switch in the current WAL file, curFileTLI doesn't always represent the TLI of the current record. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2010-01-28 at 10:28 +0200, Heikki Linnakangas wrote: > Fujii Masao wrote: > > In relation to the functions added recently, I found an annoying problem; > > pg_xlogfile_name(pg_last_xlog_receive/replay_location()) might report the > > wrong name because pg_xlogfile_name() always uses the current timeline, > > and a backend doesn't know the actual timeline related to the location > > which pg_last_xlog_receive/replay_location() reports. Even if a backend > > knows that, pg_xlogfile_name() would be unable to determine which timeline > > should be used. > > Hmm, I'm not sure what the use case for this is Agreed. What is the use case for this? -- Simon Riggs www.2ndQuadrant.com
On Tue, Feb 23, 2010 at 4:08 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: >> So it's not the TLI of the first record in the file, isn't it? > > Hmm, or is it the TLI of the last record? Not sure. Anyway, if there's a > TLI switch in the current WAL file, curFileTLI doesn't always represent > the TLI of the current record. Hmm. How about using lastPageTLI instead of curFileTLI? lastPageTLI would always represent the TLI of the current record. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Feb 24, 2010 at 7:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2010-01-28 at 10:28 +0200, Heikki Linnakangas wrote: >> Fujii Masao wrote: >> > In relation to the functions added recently, I found an annoying problem; >> > pg_xlogfile_name(pg_last_xlog_receive/replay_location()) might report the >> > wrong name because pg_xlogfile_name() always uses the current timeline, >> > and a backend doesn't know the actual timeline related to the location >> > which pg_last_xlog_receive/replay_location() reports. Even if a backend >> > knows that, pg_xlogfile_name() would be unable to determine which timeline >> > should be used. >> >> Hmm, I'm not sure what the use case for this is > > Agreed. What is the use case for this? Since the current behavior would annoy many users (e.g., [*1]), I proposed to change it. [*1] http://archives.postgresql.org/pgsql-hackers/2010-02/msg02014.php Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, 2010-02-25 at 12:02 +0900, Fujii Masao wrote: > On Wed, Feb 24, 2010 at 7:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Thu, 2010-01-28 at 10:28 +0200, Heikki Linnakangas wrote: > >> Fujii Masao wrote: > >> > In relation to the functions added recently, I found an annoying problem; > >> > pg_xlogfile_name(pg_last_xlog_receive/replay_location()) might report the > >> > wrong name because pg_xlogfile_name() always uses the current timeline, > >> > and a backend doesn't know the actual timeline related to the location > >> > which pg_last_xlog_receive/replay_location() reports. Even if a backend > >> > knows that, pg_xlogfile_name() would be unable to determine which timeline > >> > should be used. > >> > >> Hmm, I'm not sure what the use case for this is > > > > Agreed. What is the use case for this? > > Since the current behavior would annoy many users (e.g., [*1]), > I proposed to change it. > > [*1] > http://archives.postgresql.org/pgsql-hackers/2010-02/msg02014.php OK, go for it. If we expose the timeline as part of an "xlog location", then we should do that everywhere as a change for 9.0. Clearly, "xlog location" has no meaning without the timeline anyway, so this seems like a necessary change not just a quick fix. It breaks compatibility, but since we're changing replication in 9.0 that shouldn't be a problem. -- Simon Riggs www.2ndQuadrant.com
On Thu, Feb 25, 2010 at 6:33 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > If we expose the timeline as part of an "xlog location", then we should > do that everywhere as a change for 9.0. Everywhere? You mean changing the format of the return value of all the following functions? - pg_start_backup() - pg_stop_backup() - pg_switch_xlog() - pg_current_xlog_location() - pg_current_xlog_insert_location() > Clearly, "xlog location" has no > meaning without the timeline anyway, so this seems like a necessary > change not just a quick fix. It breaks compatibility, but since we're > changing replication in 9.0 that shouldn't be a problem. Umm... ISTM a large number of users would complain about that change because of compatibility. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Feb 25, 2010 at 11:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Tue, Feb 23, 2010 at 4:08 PM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >>> So it's not the TLI of the first record in the file, isn't it? >> >> Hmm, or is it the TLI of the last record? Not sure. Anyway, if there's a >> TLI switch in the current WAL file, curFileTLI doesn't always represent >> the TLI of the current record. > > Hmm. How about using lastPageTLI instead of curFileTLI? lastPageTLI > would always represent the TLI of the current record. I attached the revised patch which uses lastPageTLI instead of curFileTLI as the timeline of the last applied record. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Thu, February 25, 2010 17:34, Fujii Masao wrote: > > I attached the revised patch which uses lastPageTLI instead of curFileTLI > as the timeline of the last applied record. > With this patch the standby compiles, tests, installs OK. I wanted to check with you if the following is expected. With standby (correctly) as follows : LOG: redo starts at 0/1000020 LOG: consistent recovery state reached at 0/2000000 LOG: database system is ready to accept read only connections This is OK. However, initially (even after the above 'ready' message) the timeline value as reported by pg_xlogfile_name_offset(pg_last_xlog_replay_location()) is zero. After 5 minutes or so (without any activity on primary or standby), it proceeds to 1 (see below): (standby) 2010.02.25 21:58:21 $ psql psql (9.0devel) Type "help" for help. replicas=# \x Expanded display is on. replicas=# select pg_last_xlog_replay_location() , pg_xlogfile_name_offset(pg_last_xlog_replay_location()) , pg_last_xlog_receive_location() , pg_xlogfile_name_offset(pg_last_xlog_receive_location()) , now(); -[ RECORD 1 ]-----------------+------------------------------------ pg_last_xlog_replay_location | 0/0/2000000 pg_xlogfile_name_offset | (000000000000000000000001,16777216) pg_last_xlog_receive_location | 1/0/2000000 pg_xlogfile_name_offset | (000000010000000000000001,16777216) now | 2010-02-25 22:03:41.585808+01 replicas=# select pg_last_xlog_replay_location() , pg_xlogfile_name_offset(pg_last_xlog_replay_location()) , pg_last_xlog_receive_location() , pg_xlogfile_name_offset(pg_last_xlog_receive_location()) , now(); -[ RECORD 1 ]-----------------+------------------------------------ pg_last_xlog_replay_location | 0/0/2000000 pg_xlogfile_name_offset | (000000000000000000000001,16777216) pg_last_xlog_receive_location | 1/0/2000000 pg_xlogfile_name_offset | (000000010000000000000001,16777216) now | 2010-02-25 22:06:56.008181+01 replicas=# select pg_last_xlog_replay_location() , pg_xlogfile_name_offset(pg_last_xlog_replay_location()) , pg_last_xlog_receive_location() , pg_xlogfile_name_offset(pg_last_xlog_receive_location()) , now(); -[ RECORD 1 ]-----------------+------------------------------- pg_last_xlog_replay_location | 1/0/20000B8 pg_xlogfile_name_offset | (000000010000000000000002,184) pg_last_xlog_receive_location | 1/0/20000B8 pg_xlogfile_name_offset | (000000010000000000000002,184) now | 2010-02-25 22:07:51.368363+01 I not sure this qualifies as a bug, but if not, it should probably be mentioned somewhere in the documentation. (Oh, and to answer Heikki's earlier question, "what you trying to achieve?": I am trying to keep track of how far behind the standby is when I restore a large dump (500 GB or so) into the primary (eventually I want at the same time run pgbench on both).) thanks, Erik Rijkers
Sorry for the delay. On Fri, Feb 26, 2010 at 6:26 AM, Erik Rijkers <er@xs4all.nl> wrote: > With this patch the standby compiles, tests, installs OK. > I wanted to check with you if the following is expected. Thanks for the test and bug report! > With standby (correctly) as follows : > LOG: redo starts at 0/1000020 > LOG: consistent recovery state reached at 0/2000000 > LOG: database system is ready to accept read only connections > > This is OK. > > However, initially (even after the above 'ready' message) > the timeline value as reported by > pg_xlogfile_name_offset(pg_last_xlog_replay_location()) > is zero. When we try to read the WAL record discontinuously (e.g., the REDO starting record and the last applied record), the lastPageTLI is always reset. If that record is not in the buffer, it's read from the disk and the lastPageTLI is set to the right timeline. Otherwise, the lastPageTLI remains at zero wrongly. This is the cause of the problem that you reported. I revised the patch so that the lastPageTLI is always set correctly. Please try this new patch. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
Fujii Masao wrote: > On Fri, Feb 26, 2010 at 6:26 AM, Erik Rijkers <er@xs4all.nl> wrote: >> With this patch the standby compiles, tests, installs OK. >> I wanted to check with you if the following is expected. > > Thanks for the test and bug report! > >> With standby (correctly) as follows : >> LOG: redo starts at 0/1000020 >> LOG: consistent recovery state reached at 0/2000000 >> LOG: database system is ready to accept read only connections >> >> This is OK. >> >> However, initially (even after the above 'ready' message) >> the timeline value as reported by >> pg_xlogfile_name_offset(pg_last_xlog_replay_location()) >> is zero. > > When we try to read the WAL record discontinuously (e.g., the REDO > starting record and the last applied record), the lastPageTLI is > always reset. If that record is not in the buffer, it's read from > the disk and the lastPageTLI is set to the right timeline. Otherwise, > the lastPageTLI remains at zero wrongly. This is the cause of the > problem that you reported. > > I revised the patch so that the lastPageTLI is always set correctly. > Please try this new patch. This still suffers from ambiguity around a shutdown checkpoint that changes the TLI. On the page the shutdown checkpoint is on, what is the TLI in the page header? The TLI before the checkpoint record, I presume. Now consider a record on the same page after the checkpoint record. It's on the new timeline, but pg_last_xlog_replay_location() will return the old TLI, because that's on the page header. It's not clear what it should return, a TLI corresponding the filename of the WAL segment the record was replayed from, so that you can use pg_xlogfile_name() to find out the filename of the WAL segment being replayed, or the accurate TLI of the record being replayed. I'm leaning towards the latter, it feels more correct and accurate, but you could argue for the former too. In any case, it needs to be well-defined. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Mar 2, 2010 at 8:54 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > This still suffers from ambiguity around a shutdown checkpoint that > changes the TLI. On the page the shutdown checkpoint is on, what is the > TLI in the page header? The TLI before the checkpoint record, I presume. > Now consider a record on the same page after the checkpoint record. It's > on the new timeline, but pg_last_xlog_replay_location() will return the > old TLI, because that's on the page header. Oh, I see. You are right. > It's not clear what it should return, a TLI corresponding the filename > of the WAL segment the record was replayed from, so that you can use > pg_xlogfile_name() to find out the filename of the WAL segment being > replayed, or the accurate TLI of the record being replayed. I'm leaning > towards the latter, it feels more correct and accurate, but you could > argue for the former too. In any case, it needs to be well-defined. I agree with you that the latter is more correct and accurate. The simple fix is updating the lastPageTLI with the CheckPoint->ThisTimeLineID when replaying the shutdown checkpoint record. Though we might need to use new variable to keep the last applied timeline instead of the lastPageTLI. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Mar 2, 2010 at 10:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote: >> It's not clear what it should return, a TLI corresponding the filename >> of the WAL segment the record was replayed from, so that you can use >> pg_xlogfile_name() to find out the filename of the WAL segment being >> replayed, or the accurate TLI of the record being replayed. I'm leaning >> towards the latter, it feels more correct and accurate, but you could >> argue for the former too. In any case, it needs to be well-defined. > > I agree with you that the latter is more correct and accurate. The simple > fix is updating the lastPageTLI with the CheckPoint->ThisTimeLineID when > replaying the shutdown checkpoint record. Though we might need to use new > variable to keep the last applied timeline instead of the lastPageTLI. Here is the revised patch. I used new local variable instead of lastPageTLI to track the tli of last applied record. It is updated with the tli of the log page header when reading the page, and with the tli of the checkpoint record when replaying the checkpoint shutdown record that changes the tli. So pg_last_xlog_replay_location() can return the accurate tli of the last applied record. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Wed, March 3, 2010 15:03, Fujii Masao wrote: > On Tue, Mar 2, 2010 at 10:52 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > > Here is the revised patch. I used new local variable instead of lastPageTLI > to track the tli of last applied record. It is updated with the tli of the > log page header when reading the page, and with the tli of the checkpoint > record when replaying the checkpoint shutdown record that changes the tli. > So pg_last_xlog_replay_location() can return the accurate tli of the last > applied record. > > extend_format_of_recovery_info_funcs_v4.patch looks good: on the standby, the initial xlog file_name immediately after startup is now 000000010000000000000001, as expected. I'll do my further testing of HS/SR with this patch included. thanks, Erik Rijekrs