Thread: out-of-order XID insertion in KnownAssignedXids
Hi All,
we are running many postgresql master/slave setups. The slaves are initialised from a pg_basebackup from the master and are sync streaming from the master. When we determine the master has failed, the slave is promoted. Some time after that, the old master is again initialised with a pg_basebackup and starts streaming from the new master.
Our setup seems pretty "stock" and has been running for us for some years (with different versions of postgresql but also different OSes).
Recently, we have gotten this error a fair amount of times: "out-of-order XID insertion in KnownAssignedXids" when postgresql attempts to start after being initialised with a pg_basebackup from the current master.
The only reference I can find on that particular error, is from 2012 and the resulting issue is long fixed in our version of postgresql (9.5.3) ... (https://www.postgresql.org/message-id/201205230849.59825.andres@anarazel.de)
Once the issue has occurred, a subsequent re-initialisation (with a completely new pg_basebackup) does not resolve the issue.
I have a setup in the failing state, so I can produce any kind of log mesages / details that would be helpful.
Thank you for your support,
Fredrik
Hi, On 2016-10-18 14:57:52 +0200, fredrik@huitfeldt.com wrote: > we are running many postgresql master/slave setups. The slaves are > initialised from a pg_basebackup from the master and are sync > streaming from the master. When we determine the master has failed, > the slave is promoted. Some time after that, the old master is again > initialised with a pg_basebackup and starts streaming from the new > master. Could you describe in a bit more detail how exactly you're setting up the standbys? E.g. the exact recovery.conf used, whether you remove any files during starting a standby. Also how exactly you're promoting standbys? > Recently, we have gotten this error a fair amount of times: "out-of-order XID insertion in KnownAssignedXids" when postgresqlattempts to start after being initialised with a pg_basebackup from the current master. Which version are you encountering this on precisely? > Once the issue has occurred, a subsequent re-initialisation (with a completely new pg_basebackup) does not resolve theissue. How have you recovered from this so far? > I have a setup in the failing state, so I can produce any kind of log mesages / details that would be helpful. Could you use pg_xlogdump to dump the WAL file on which replay failed? And then attach the output in a compressed manner? Greetings, Andres Freund
Hi,On 2016-10-18 14:57:52 +0200, fredrik@huitfeldt.com wrote:we are running many postgresql master/slave setups. The slaves areinitialised from a pg_basebackup from the master and are syncstreaming from the master. When we determine the master has failed,the slave is promoted. Some time after that, the old master is againinitialised with a pg_basebackup and starts streaming from the newmaster.Could you describe in a bit more detail how exactly you're setting upthe standbys? E.g. the exact recovery.conf used, whether you remove anyfiles during starting a standby. Also how exactly you're promotingstandbys?Recently, we have gotten this error a fair amount of times: "out-of-order XID insertion in KnownAssignedXids" when postgresql attempts to start after being initialised with a pg_basebackup from the current master.Which version are you encountering this on precisely?Once the issue has occurred, a subsequent re-initialisation (with a completely new pg_basebackup) does not resolve the issue.How have you recovered from this so far?I have a setup in the failing state, so I can produce any kind of log mesages / details that would be helpful.Could you use pg_xlogdump to dump the WAL file on which replay failed?And then attach the output in a compressed manner?Greetings,Andres Freund
On Thu, Oct 20, 2016 at 10:21 PM, <fredrik@huitfeldt.com> wrote: > Version : 9.2.13 You are largely out of date here. The last minor version available in the 9.2 series is 9.2.18, meaning that you are missing more than one year worth of bug fixes. > - remove a file called backup_label, but I am not certain that this file is > in fact there (any more). It is never a good idea when you are trying to restore from a backup, backup_label contains critical information when restoring from a backup, so you may finish with a corrupted data folder. -- Michael
On 2016-10-20 22:37:15 +0900, Michael Paquier wrote: > On Thu, Oct 20, 2016 at 10:21 PM, <fredrik@huitfeldt.com> wrote: > > - remove a file called backup_label, but I am not certain that this file is > > in fact there (any more). > > It is never a good idea when you are trying to restore from a backup, > backup_label contains critical information when restoring from a > backup, so you may finish with a corrupted data folder. And this actually seems like a likely source of these errors. Removing a backup label unfortunately causes hard to diagnose errors, because everything appears to be ok as long as there's no checkpoints while taking the base backups (or when the control file was copied early enough). But as soon as a second checkpoint happens before the control file is copied... Fredrik, how did you end up removing the label? Greetings, Andres Freund
On Thu, Oct 20, 2016 at 8:21 AM, <fredrik@huitfeldt.com> wrote: > Version : 9.2.13 You are missing over a year's worth of bug fixes. https://www.postgresql.org/support/versioning/ > - remove a file called backup_label http://tbeitr.blogspot.com/2015/07/deleting-backuplabel-on-restore-will.html -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-10-20 22:37:15 +0900, Michael Paquier wrote:On Thu, Oct 20, 2016 at 10:21 PM, <fredrik@huitfeldt.com> wrote:- remove a file called backup_label, but I am not certain that this file isin fact there (any more).It is never a good idea when you are trying to restore from a backup,backup_label contains critical information when restoring from abackup, so you may finish with a corrupted data folder.And this actually seems like a likely source of these errors. Removinga backup label unfortunately causes hard to diagnose errors, becauseeverything appears to be ok as long as there's no checkpoints whiletaking the base backups (or when the control file was copied earlyenough). But as soon as a second checkpoint happens before the controlfile is copied...Fredrik, how did you end up removing the label?Greetings,Andres Freund
On Mon, Oct 24, 2016 at 8:10 AM, <fredrik@huitfeldt.com> wrote: > This was actually introduced some time back, and I am not completely certain > how it crept into our codebase. I think that at least part of the > explanation lies in the fact that we are experiencing a fair amount of > growth in the database size and use on some of our installations. This could > be the reason why extensive testing did not show the issue back then and why > we are seeing it now. If there is no checkpoint during the backup, you dodge the corruption. Higher update volume increases the frequency of checkpoints and larger cluster size makes the backup take longer -- either of which would make corruption more likely. > Would it make sense to log a warning in the case of a missing backup_label > file, or would it be difficult to identify that situation in the code? The problem is, without a backup_label file things look exactly like a crash recovery, which is why it just goes to the last usable checkpoint; that's the correct behavior for crash recovery. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company