Thread: More subtle issues with cascading replication over timeline switches
When a standby starts up, and catches up with the master through the archive, it copies the target timeline's history file from the archive to pg_xlog. That's enough for that standby's purposes, but if there is a cascading standby or pg_receivexlog connected to it, it will request the timeline history files of *all* timelines between the starting point and current target. For example, create a master, and take a base backup from it. Use the base backup to initialize two standby servers. Now perform failover first to the first standby, and once the second standby has caught up, fail over again, to the second standby. (Or as a shortcut, forget about the standbys, and just create a recovery.conf file in the master with restore_command='/bin/false' and restart it. That causes a timeline switch. Repeat twice) Now use the base backup to initialize a new standby server (you can kill and delete the old ones), using the WAL archive. Set up a second, cascading, standby server that connects to the first standby using streaming replication, not WAL archive. This cascading standby will fail to cross the timeline switch, because it doesn't find all the history files in the standby: C 2013-01-18 13:38:46.047 EET 7695 FATAL: could not receive timeline history file from the primary server: ERROR: could not open file "pg_xlog/00000002.history": No such file or directory Indeed, looking at the pg_xlog, it's not there (I did a couple of extra timeline switches: ~/pgsql.master$ ls -l data-master/pg_xlog/ total 131084 -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000001 -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000002 -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000003 -rw------- 1 heikki heikki 41 Jan 18 13:38 00000002.history -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000003 -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000004 -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000005 -rw------- 1 heikki heikki 83 Jan 18 13:38 00000003.history -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000005 -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000006 drwx------ 2 heikki heikki 4096 Jan 18 13:38 archive_status ~/pgsql.master$ ls -l data-standbyB/pg_xlog/ total 81928 -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000001 -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000002 -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000003 -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000004 -rw------- 1 heikki heikki 83 Jan 18 13:38 00000003.history -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000005 drwx------ 2 heikki heikki 4096 Jan 18 13:38 archive_status This can be thought of as another variant of the same issue that was fixed by commit 60df192aea0e6458f20301546e11f7673c102101. When standby B scans for the latest timeline, it finds it to be 3, and it reads the timeline history file for 3. After that patch, it also saves it in pg_xlog. It doesn't save the timeline history file for timeline 2, because that's included in the history of timeline 3. However, when standby C connects, it will try to fetch all the history files that it doesn't have, including 00000002.history, which throws the error. A related problem is that at the segment containing the timeline switch, standby has only restored from archive the WAL file of the new timeline, not the old one. For example above, the timeline switch 1 -> 2 happened while inserting to segment 000000010000000000000003, and a copy of that partial segment was created with the timeline's ID as 000000020000000000000003. The standby only has the segment from the new timeline in pg_xlog, which is enough for that standby's purposes, but will cause an error when the cascading standby tries to stream it: C 2013-01-18 13:46:12.334 EET 8579 FATAL: error reading result of streaming command: ERROR: requested WAL segment 000000010000000000000003 has already been removed A straightforward fix would be for the standby to restore those files that the cascading standby needs from the WAL archive, even if they're not strictly required for that standby itself. But actually, isn't it a bad idea that we store the partial segment, 000000010000000000000003 in this case, in the WAL archive? There's no way to tell that it's partial, and it can clash with a complete segment if more WAL is generated on that timeline. I just argued that pg_receivexlog should not do that, and hence keep the .partial suffix in the same situation, in http://www.postgresql.org/message-id/50F56245.8050802@vmware.com. This needs some more thought. I'll try to come up with something next week, but if anyone has any ideas.. - Heikki
On Friday, January 18, 2013 5:27 PM Heikki Linnakangas wrote: > Indeed, looking at the pg_xlog, it's not there (I did a couple of extra > timeline switches: > ~/pgsql.master$ ls -l data-master/pg_xlog/ > total 131084 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000001 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000002 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000003 > -rw------- 1 heikki heikki 41 Jan 18 13:38 00000002.history > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000003 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000004 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000005 > -rw------- 1 heikki heikki 83 Jan 18 13:38 00000003.history > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000005 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000006 > drwx------ 2 heikki heikki 4096 Jan 18 13:38 archive_status > ~/pgsql.master$ ls -l data-standbyB/pg_xlog/ > total 81928 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000001 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000002 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000003 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000004 > -rw------- 1 heikki heikki 83 Jan 18 13:38 00000003.history > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000005 > drwx------ 2 heikki heikki 4096 Jan 18 13:38 archive_status > This can be thought of as another variant of the same issue that was > fixed by commit 60df192aea0e6458f20301546e11f7673c102101. When standby B > scans for the latest timeline, it finds it to be 3, and it reads the > timeline history file for 3. After that patch, it also saves it in > pg_xlog. It doesn't save the timeline history file for timeline 2, > because that's included in the history of timeline 3. However, when > standby C connects, it will try to fetch all the history files that it > doesn't have, including 00000002.history, which throws the error. Is the file 00000002.history really required by standby C for any useful purpose? Can we think of change in current designsuch that when standby C connects, even if some old history file (like 00000002.history) is not present, it ignoresthe same and continue. With Regards, Amit Kapila.
Re: More subtle issues with cascading replication over timeline switches
From
Heikki Linnakangas
Date:
On 19.01.2013 14:26, Amit kapila wrote: > On Friday, January 18, 2013 5:27 PM Heikki Linnakangas wrote: > > >> Indeed, looking at the pg_xlog, it's not there (I did a couple of extra >> timeline switches: > >> ~/pgsql.master$ ls -l data-master/pg_xlog/ >> total 131084 >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000001 >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000002 >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000003 >> -rw------- 1 heikki heikki 41 Jan 18 13:38 00000002.history >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000003 >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000004 >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000005 >> -rw------- 1 heikki heikki 83 Jan 18 13:38 00000003.history >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000005 >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000006 >> drwx------ 2 heikki heikki 4096 Jan 18 13:38 archive_status >> ~/pgsql.master$ ls -l data-standbyB/pg_xlog/ >> total 81928 >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000001 >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000002 >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000003 >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000004 >> -rw------- 1 heikki heikki 83 Jan 18 13:38 00000003.history >> -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000005 >> drwx------ 2 heikki heikki 4096 Jan 18 13:38 archive_status > >> This can be thought of as another variant of the same issue that was >> fixed by commit 60df192aea0e6458f20301546e11f7673c102101. When standby B >> scans for the latest timeline, it finds it to be 3, and it reads the >> timeline history file for 3. After that patch, it also saves it in >> pg_xlog. It doesn't save the timeline history file for timeline 2, >> because that's included in the history of timeline 3. However, when >> standby C connects, it will try to fetch all the history files that it >> doesn't have, including 00000002.history, which throws the error. > > Is the file 00000002.history really required by standby C for any useful purpose? No, not really. > Can we think of change in current design such that when standby C connects, even if some old history file (like 00000002.history) > is not present, it ignores the same and continue. That would be possible too, with some rejiggering of the code. At the moment, however, the code to find the latest timeline works by checking the existence of timeline history files in order. So it first checks for 00000002.history, then 00000003.history, then 00000004.history and so on, until it gets a file-not-found. That logic doesn't work if there are gaps in the sequence. So I'm inclined to just make sure the history files are always copied. I think it's good to have them around anyway, for debugging purposes. - Heikki
Re: More subtle issues with cascading replication over timeline switches
From
Heikki Linnakangas
Date:
On 18.01.2013 13:57, Heikki Linnakangas wrote: > When a standby starts up, and catches up with the master through the > archive, it copies the target timeline's history file from the archive > to pg_xlog. That's enough for that standby's purposes, but if there is a > cascading standby or pg_receivexlog connected to it, it will request the > timeline history files of *all* timelines between the starting point and > current target. > > For example, create a master, and take a base backup from it. Use the > base backup to initialize two standby servers. Now perform failover > first to the first standby, and once the second standby has caught up, > fail over again, to the second standby. (Or as a shortcut, forget about > the standbys, and just create a recovery.conf file in the master with > restore_command='/bin/false' and restart it. That causes a timeline > switch. Repeat twice) > > Now use the base backup to initialize a new standby server (you can kill > and delete the old ones), using the WAL archive. Set up a second, > cascading, standby server that connects to the first standby using > streaming replication, not WAL archive. This cascading standby will fail > to cross the timeline switch, because it doesn't find all the history > files in the standby: > > C 2013-01-18 13:38:46.047 EET 7695 FATAL: could not receive timeline > history file from the primary server: ERROR: could not open file > "pg_xlog/00000002.history": No such file or directory > > Indeed, looking at the pg_xlog, it's not there (I did a couple of extra > timeline switches: > > ~/pgsql.master$ ls -l data-master/pg_xlog/ > total 131084 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000001 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000002 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000003 > -rw------- 1 heikki heikki 41 Jan 18 13:38 00000002.history > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000003 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000004 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000005 > -rw------- 1 heikki heikki 83 Jan 18 13:38 00000003.history > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000005 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000006 > drwx------ 2 heikki heikki 4096 Jan 18 13:38 archive_status > ~/pgsql.master$ ls -l data-standbyB/pg_xlog/ > total 81928 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000001 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000010000000000000002 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000003 > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000020000000000000004 > -rw------- 1 heikki heikki 83 Jan 18 13:38 00000003.history > -rw------- 1 heikki heikki 16777216 Jan 18 13:38 000000030000000000000005 > drwx------ 2 heikki heikki 4096 Jan 18 13:38 archive_status > > This can be thought of as another variant of the same issue that was > fixed by commit 60df192aea0e6458f20301546e11f7673c102101. When standby B > scans for the latest timeline, it finds it to be 3, and it reads the > timeline history file for 3. After that patch, it also saves it in > pg_xlog. It doesn't save the timeline history file for timeline 2, > because that's included in the history of timeline 3. However, when > standby C connects, it will try to fetch all the history files that it > doesn't have, including 00000002.history, which throws the error. > > A related problem is that at the segment containing the timeline switch, > standby has only restored from archive the WAL file of the new timeline, > not the old one. For example above, the timeline switch 1 -> 2 happened > while inserting to segment 000000010000000000000003, and a copy of that > partial segment was created with the timeline's ID as > 000000020000000000000003. The standby only has the segment from the new > timeline in pg_xlog, which is enough for that standby's purposes, but > will cause an error when the cascading standby tries to stream it: > > C 2013-01-18 13:46:12.334 EET 8579 FATAL: error reading result of > streaming command: ERROR: requested WAL segment 000000010000000000000003 > has already been removed > > A straightforward fix would be for the standby to restore those files > that the cascading standby needs from the WAL archive, even if they're > not strictly required for that standby itself. But actually, isn't it a > bad idea that we store the partial segment, 000000010000000000000003 in > this case, in the WAL archive? There's no way to tell that it's partial, > and it can clash with a complete segment if more WAL is generated on > that timeline. I just argued that pg_receivexlog should not do that, and > hence keep the .partial suffix in the same situation, in > http://www.postgresql.org/message-id/50F56245.8050802@vmware.com. I came up with the attached. I decided to use different approaches for the timeline history files and the WAL segments. For timeline history files, try to copy any missing timeline history files from archive into pg_xlog. For example, if recovery begins at timeline 5, and target timeline is 7, recovery will try to restore 00000006.history even though it doesn't itself need it. A cascading standby or pg_receivexlog will then find it there. For WAL files, I changed walsender so that when it reads WAL belonging to a segment that also contains a timeline switch to the next timeline, ie. the last segment on the old timeline, walsender opens the file with the new timeline's ID instead. This comment explains it better: /*------- * When reading from a historic timeline, and there is a timeline * switch within the this segment, read from the WAL segment * belonging to the new timeline. * * For example, imagine that this server is currently on timeline * 5, and we're streaming timeline 4. The switch from timeline 4 * to 5 happened at 0/13002088. In pg_xlog, we have these files: * * ... * 000000040000000000000012 * 000000040000000000000013 * 000000050000000000000013 * 000000050000000000000014 * ... * * In this situation, when requested to send the WAL from * segment 0x13, on timeline 4, we read the WAL from file * 000000050000000000000013. Archive recovery prefers files from * newer timelines, so if the segment was restored from the * archive on this server, the file belonging to the old timeline, * 000000040000000000000013, might not exist. Their contents are * equal up to the switchpoint, because at a timeline switch, the * used portion of the old segment is copied to the new file. *------- */ That seems a bit weird at first, but it matches how archive recovery behaves when it restores the file with the newer timeline's ID, and reads the WAL belonging to the old timeline from that. - Heikki