Re: BUG #4565: nextval not updated during wal replication, leading to pk violations - Mailing list pgsql-bugs
From | Marc Schablewski |
---|---|
Subject | Re: BUG #4565: nextval not updated during wal replication, leading to pk violations |
Date | |
Msg-id | 493D55E6.8010204@clickware.de Whole thread Raw |
In response to | BUG #4565: nextval not updated during wal replication, leading to pk violations ("Marc Schablewski" <ms@clickware.de>) |
List | pgsql-bugs |
FYI: We found another anomaly in that database (duplicated record with a unique index on a column (non PK)) and so we did some more research. As before, the record got replicated, but this time, the index entry was not. Later, a job tried to check if that record was already there by querying the column (thus, using the index) and couldn't find it so it was inserted again. We first thought it could be a WAL that wasn't restored. If I understand pg_backup_start() and _stop() right, the first WAL that should have been requested after the online backup had finished should have been the one mentioned in the "START WAL LOCATION" line of the backup history file, but the first file that actually gets requested is that in the "STOP WAL LOCATION" line. 000000020000004B00000027.00396450.backup: START WAL LOCATION: 4B/27396450 (file 000000020000004B00000027) STOP WAL LOCATION: 4B/28DDF968 (file 000000020000004B00000028) CHECKPOINT LOCATION: 4B/27396450 START TIME: 2008-11-28 16:29:17 CET LABEL: backup-20081128-162916 STOP TIME: 2008-11-28 17:49:09 CET Both WAL files were copied to the backup system. The log switch occurred before 17:06, because at that time the file was copied to the backup server by the archive command. But sadly, the records were all part of a single transaction which ran between 17:32 and 17:33, so obviously this is not a part of the problem. Though it seems a bit strange to me that the ...27 WAL didn't get recovered, but maybe I just don't fully understand that part of the docs (Chapter 24.3.2., fifth paragraph after the ordered list). Right now, I run a fresh online backup to check if we can reproduce the problem. I made some minor corrections on my scripts which prepare the backup database for replication, which might have caused the problem. I now omit the pg_xlog directory when taking the initial backup, after rereading chapter 24.3.2. saying there is a "risk of mistakes" if one copies those files, whatever that means. Maybe someone could clarify what the risks and mistakes could be? Speaking of chapter 24.3.2.: The last sentence in second paragraph after the ordered list ("You must ensure that these steps are carried out in sequence without any possible overlap, or you will invalidate the backup.") seems a bit misplaced. As far as I understand, it makes a reference to the ordered list ("steps"), but I had to read it several times to get it. I think, a better place would be right after the list. ...But then, maybe it's just me, not sleeping well last night. ;) Marc Marc Schablewski wrote: > The following bug has been logged online: > > Bug reference: 4565 > Logged by: Marc Schablewski > Email address: ms@clickware.de > PostgreSQL version: 8.3.5 > Operating system: Debian, 2.6.24-19 kernel > Description: nextval not updated during wal replication, leading to > pk violations > Details: > > Hi! > > We are using "Continuous Archiving" of WAL to keep a warm standby database > server. We recently switch over to that backup database for testing purpose. > We first took an online backup of the master database, let the WAL shipping > run for some days and finally started the backup for normal use. > > A job tried to insert some records into a few tables and complained about a > PK violation. PK are of type bigint. The column is populated by a sequence > and a "default nextval()" on the PK column. We found that the sequence's > currval was lower than the maximum value in the table and that this value > was already present. Further investigation showed us that the last records > in the table were inserted on the former master server while taking the > initial online backup for the replication. It seems that the records got > replicated but not the currval/nextval of the sequence. When running "select > nextval()" on the backup database it returned the PK value of the first > record inserted during that last run on the former master server. > > I couldn't reproduce the problem with two different servers and a simple > database containing one table. Right now, I don't have any idea how to > narrow down the problem or what to check next. So any hint would be > helpful. > > Marc > >
pgsql-bugs by date: