Thread: BUG #6661: out-of-order XID insertion in KnownAssignedXids
The following bug has been logged on the website: Bug reference: 6661 Logged by: Valentine Gogichashvili Email address: valgog@gmail.com PostgreSQL version: 9.0.7 Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41sque Description:=20=20=20=20=20=20=20=20 Hello, today when trying to restore a replication database I got the following error:=20 ... 2012-05-22 21:20:24.823 CEST,,,23804,,4fbbe69e.5cfc,124,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001E"" from archive",,,,,,,,,"" 2012-05-22 21:20:24.954 CEST,,,23804,,4fbbe69e.5cfc,125,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001F"" from archive",,,,,,,,,"" 2012-05-22 21:20:26.374 CEST,,,23804,,4fbbe69e.5cfc,126,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000020"" from archive",,,,,,,,,"" 2012-05-22 21:20:27.238 CEST,,,23804,,4fbbe69e.5cfc,127,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000021"" from archive",,,,,,,,,"" 2012-05-22 21:20:27.443 CEST,,,23804,,4fbbe69e.5cfc,128,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"consistent recovery state reached at 1C9C/21318C20",,,,,,,,,"" 2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,129,,2012-05-22 21:18:54 CEST,1/0,0,LOG,00000,"4 KnownAssignedXids (num=3D4 tail=3D0 head=3D916) [0]=3D3674726497 [1]=3D3674727041 [2]=3D3674727042 [3]=3D3674727128 ",,,,,"= xlog redo running xacts: nextXid 3674728633 latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts: 3674727041 3674727042 3674727128 3674726497",,,,"" 2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22 21:18:54 CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633 latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts: 3674727041 3674727042 3674727128 3674726497",,,,"" 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22 21:18:54 CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code 1",,,,,,,,,"" 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 21:18:54 CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"" This happened after replaying 124 WAL files from archive.=20 Full log is here: https://gist.github.com/ff51ccf506f5e522e271 I tried to start the database 3 times, with the same result.=20 Additionally I removed all existing WAL files from pg_xlog, and tried to start again, but got the same result this time as well (though probably I should have tried the other way round, as WAL files are anyway taken from the archive if there are there... anyway, just want to describe everything I did) The database on that machine is running as a read-only hot-standby machine for running long running statistical queries, that we do not want to run on the production system. It has significantly bigger shared_buffers =3D 48GB ( in comparison to 8GB on the master ).=20 Sometimes there is really a lot happening on the master machine, and the replication machine does not seem to be able to catch up with the master. This usually happens, when the replay is paused because of a long running query, but after the conflicting query is gone and the replication delay is bigger then 2-3 Gigabytes, it seems not to be able catch up any longer, it looks like a snowball effect. At this time it has a 100% of one CPU on recovery process with practically no IOWait (so for me it seems like a CPU bottleneck rather then IO one, that is rather confusing). Here is the replication delay graph https://lh4.googleusercontent.com/-UBf1iBHZGlQ/T7v9L-bpLvI/AAAAAAAAMms/Ve-e= AOFEXjw/s800/replication_delay.png In the past, if the replication delay was bigger then 20GB, I was stopping the server, restarting it so, that is replays the WAL files from the archive and subjectively it seemed, like the replay was faster, if shared_buffers was set to much lower value (4GB).=20 So today, I did the same, but got an error "out-of-order XID insertion in KnownAssignedXids" and could not restore the machine. Now, while rebuilding hot-standby of that 250GB database I have some time to write this report.= =20 More information about the affected machine:=20 Intel(R) Xeon(R) CPU E5540 @ 2.53GHz (16 Cores) cat /proc/version=20 Linux version 2.6.32-5-amd64 (Debian 2.6.32-41squeeze2) (dannf@debian.org) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Thu Mar 22 17:26:33 UTC 2012 =20 free -m total used free shared buffers cached Mem: 129192 128858 333 0 31 124969 -/+ buffers/cache: 3857 125334 Swap: 1952 1 1951 df -h /dev/mapper/vg00-data 54G 1.3G 52G 3% /data /dev/mapper/vg01-data1 1.7T 373G 1.3T 23% /data1
Oh, that is not good, I am getting the same error after rebuilding hot-standby from master. 2012-05-23 02:08:36.960 CEST,,,21080,,4fbc2a84.5258,1,,2012-05-23 02:08:36 CEST,,0,LOG,00000,"database system was interrupted while in recovery at log time 2012-05-22 21:33:49 CEST",,"If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.",,,,,,,"" 2012-05-23 02:08:36.967 CEST,,,21080,,4fbc2a84.5258,2,,2012-05-23 02:08:36 CEST,,0,LOG,00000,"entering standby mode",,,,,,,,,"" 2012-05-23 02:08:36.987 CEST,,,21080,,4fbc2a84.5258,3,,2012-05-23 02:08:36 CEST,,0,LOG,00000,"restored log file ""0000000200001CA2000000BE"" from archive",,,,,,,,,"" 2012-05-23 02:08:36.987 CEST,,,21080,,4fbc2a84.5258,4,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"redo starts at 1CA2/BE38F0E0",,,,,,,,,"" 2012-05-23 02:08:36.997 CEST,,,21080,,4fbc2a84.5258,5,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"file ""pg_multixact/offsets/0BF1"" doesn't exist, reading as zeroes",,,,,"xlog redo create multixact 200389058 offset 473492754: 3675639270 3675639273",,,,"" 2012-05-23 02:08:36.997 CEST,,,21080,,4fbc2a84.5258,6,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"file ""pg_multixact/members/1C38"" doesn't exist, reading as zeroes",,,,,"xlog redo create multixact 200389058 offset 473492754: 3675639270 3675639273",,,,"" 2012-05-23 02:08:37.096 CEST,,,21080,,4fbc2a84.5258,7,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000BF"" from archive",,,,,,,,,"" 2012-05-23 02:08:37.213 CEST,,,21080,,4fbc2a84.5258,8,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000C0"" from archive",,,,,,,,,"" 2012-05-23 02:08:37.393 CEST,,,21080,,4fbc2a84.5258,9,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000C1"" from archive",,,,,,,,,"" 2012-05-23 02:08:37.568 CEST,,,21080,,4fbc2a84.5258,10,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000C2"" from archive",,,,,,,,,"" 2012-05-23 02:08:37.932 CEST,,,21080,,4fbc2a84.5258,11,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000C3"" from archive",,,,,,,,,"" 2012-05-23 02:08:38.297 CEST,,,21080,,4fbc2a84.5258,12,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000C4"" from archive",,,,,,,,,"" .... 2012-05-23 02:08:41.617 CEST,,,21080,,4fbc2a84.5258,32,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000D8"" from archive",,,,,,,,,"" 2012-05-23 02:08:41.753 CEST,,,21080,,4fbc2a84.5258,33,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000D9"" from archive",,,,,,,,,"" 2012-05-23 02:08:41.881 CEST,,,21080,,4fbc2a84.5258,34,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"restored log file ""0000000200001CA2000000DA"" from archive",,,,,,,,,"" 2012-05-23 02:08:41.888 CEST,,,21080,,4fbc2a84.5258,35,,2012-05-23 02:08:36 CEST,1/0,0,LOG,00000,"4 KnownAssignedXids (num=4 tail=0 head=873) [0]=3675654963 [1]=3675655620 [2]=3675655621 [3]=3675655814 ",,,,,"xlog redo running xacts: nextXid 3675658265 latestCompletedXid 3675658264 oldestRunningXid 3675654963; 4 xacts: 3675655620 3675655621 3675655814 3675654963",,,,"" 2012-05-23 02:08:41.888 CEST,,,21080,,4fbc2a84.5258,36,,2012-05-23 02:08:36 CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3675658265 latestCompletedXid 3675658264 oldestRunningXid 3675654963; 4 xacts: 3675655620 3675655621 3675655814 3675654963",,,,"" 2012-05-23 02:08:41.909 CEST,,,21078,,4fbc2a84.5256,1,,2012-05-23 02:08:36 CEST,,0,LOG,00000,"startup process (PID 21080) exited with exit code 1",,,,,,,,,"" 2012-05-23 02:08:41.909 CEST,,,21078,,4fbc2a84.5256,2,,2012-05-23 02:08:36 CEST,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"" On Tue, May 22, 2012 at 11:36 PM, <valgog@gmail.com> wrote: > The following bug has been logged on the website: > > Bug reference: 6661 > Logged by: Valentine Gogichashvili > Email address: valgog@gmail.com > PostgreSQL version: 9.0.7 > Operating system: Linux version 2.6.32-5-amd64 (Debian 2.6.32-41sque > Description: > > Hello, > > today when trying to restore a replication database I got the following > error: > > ... > 2012-05-22 21:20:24.823 CEST,,,23804,,4fbbe69e.5cfc,124,,2012-05-22 > 21:18:54 > CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001E"" from > archive",,,,,,,,,"" > 2012-05-22 21:20:24.954 CEST,,,23804,,4fbbe69e.5cfc,125,,2012-05-22 > 21:18:54 > CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001F"" from > archive",,,,,,,,,"" > 2012-05-22 21:20:26.374 CEST,,,23804,,4fbbe69e.5cfc,126,,2012-05-22 > 21:18:54 > CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000020"" from > archive",,,,,,,,,"" > 2012-05-22 21:20:27.238 CEST,,,23804,,4fbbe69e.5cfc,127,,2012-05-22 > 21:18:54 > CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000021"" from > archive",,,,,,,,,"" > 2012-05-22 21:20:27.443 CEST,,,23804,,4fbbe69e.5cfc,128,,2012-05-22 > 21:18:54 > CEST,1/0,0,LOG,00000,"consistent recovery state reached at > 1C9C/21318C20",,,,,,,,,"" > 2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,129,,2012-05-22 > 21:18:54 > CEST,1/0,0,LOG,00000,"4 KnownAssignedXids (num=4 tail=0 head=916) > [0]=3674726497 [1]=3674727041 [2]=3674727042 [3]=3674727128 ",,,,,"xlog > redo > running xacts: nextXid 3674728633 latestCompletedXid 3674728632 > oldestRunningXid 3674726497; 4 xacts: 3674727041 3674727042 3674727128 > 3674726497",,,,"" > 2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22 > 21:18:54 > CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in > KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633 > latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts: > 3674727041 3674727042 3674727128 3674726497",,,,"" > 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22 21:18:54 > CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code > 1",,,,,,,,,"" > 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 21:18:54 > CEST,,0,LOG,00000,"terminating any other active server > processes",,,,,,,,,"" > > >
Hi Valentine, On Tuesday, May 22, 2012 11:36:23 PM valgog@gmail.com wrote: > 2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,129,,2012-05-22 > 21:18:54 CEST,1/0,0,LOG,00000,"4 KnownAssignedXids (num=4 tail=0 head=916) > [0]=3674726497 [1]=3674727041 [2]=3674727042 [3]=3674727128 ",,,,,"xlog > redo running xacts: nextXid 3674728633 latestCompletedXid 3674728632 > oldestRunningXid 3674726497; 4 xacts: 3674727041 3674727042 3674727128 > 3674726497",,,,"" > 2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22 > 21:18:54 CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in > KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633 > latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts: > 3674727041 3674727042 3674727128 3674726497",,,,"" > 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22 21:18:54 > CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code > 1",,,,,,,,,"" > 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 21:18:54 > CEST,,0,LOG,00000,"terminating any other active server > processes",,,,,,,,,"" Could you provide a log with log_min_messages=DEBUG4? This will be quite long though... Andres
> > > 2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22 > > 21:18:54 CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in > > KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633 > > latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts: > > 3674727041 3674727042 3674727128 3674726497",,,,"" > > 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22 > 21:18:54 > > CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code > > 1",,,,,,,,,"" > > 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 > 21:18:54 > > CEST,,0,LOG,00000,"terminating any other active server > > processes",,,,,,,,,"" > Could you provide a log with log_min_messages=DEBUG4? This will be quite > long > though... > > Andres > Hm... I have the data directory, but not the WAL files any more. Forgot to back them up :(
On Wednesday, May 23, 2012 12:30:31 PM Valentine Gogichashvili wrote: > > > CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code > > > 1",,,,,,,,,"" > > > 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 > > > 21:18:54 CEST,,0,LOG,00000,"terminating any other active server > > > processes",,,,,,,,,"" > > Could you provide a log with log_min_messages=DEBUG4? This will be quite > > long though... > Hm... I have the data directory, but not the WAL files any more. Forgot to > back them up :( Too bad, I think without more information we won't be able to fix this. I feel bad for it but I actually you run into it again ;) Here is what I found strange, should somebody ever come back to the issue: - the standby was in a consistent state - according to the error annotation we were processing a XLOG_RUNNING_XACTS - ProcArrayApplyRecoveryInfo with STANDBY_SNAPSHOT_READY only does ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid); StandbyReleaseOldLocks(running->xcnt, running->xids); before short-circuiting. How comes we call KnownAssignedXidsAdd in that case? Greetings, Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wednesday, May 23, 2012 06:27:48 PM Andres Freund wrote: > On Wednesday, May 23, 2012 12:30:31 PM Valentine Gogichashvili wrote: > > > > CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code > > > > 1",,,,,,,,,"" > > > > 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 > > > > 21:18:54 CEST,,0,LOG,00000,"terminating any other active server > > > > processes",,,,,,,,,"" > > > > > > Could you provide a log with log_min_messages=DEBUG4? This will be > > > quite long though... > > > > Hm... I have the data directory, but not the WAL files any more. Forgot > > to back them up :( > > Too bad, I think without more information we won't be able to fix this. I > feel bad for it but I actually you run into it again ;) > > Here is what I found strange, should somebody ever come back to the issue: > - the standby was in a consistent state > - according to the error annotation we were processing a XLOG_RUNNING_XACTS > - ProcArrayApplyRecoveryInfo with STANDBY_SNAPSHOT_READY only does > ExpireOldKnownAssignedTransactionIds(running->oldestRunningXid); > StandbyReleaseOldLocks(running->xcnt, running->xids); > before short-circuiting. How comes we call KnownAssignedXidsAdd in that > case? Looking over the code again reachedConsistency doesn't imply STANDBY_SNAPSHOT_READY. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wednesday, May 23, 2012 12:30:31 PM Valentine Gogichashvili wrote: > > > 2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22 > > > 21:18:54 CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in > > > KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3674728633 > > > latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts: > > > 3674727041 3674727042 3674727128 3674726497",,,,"" > > > 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22 > > > > 21:18:54 > > > > > CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code > > > 1",,,,,,,,,"" > > > 2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 > > > > 21:18:54 > > > > > CEST,,0,LOG,00000,"terminating any other active server > > > processes",,,,,,,,,"" > > > > Could you provide a log with log_min_messages=DEBUG4? This will be quite > > long > > though... > > > > Andres > > Hm... I have the data directory, but not the WAL files any more. Forgot to > back them up :( Do you have both, longrunning transactions and transactions with many subtransactions? Andres
Hello Andres, Do you have both, longrunning transactions and transactions with many > subtransactions? > > Yes long running plane SQL queries as well as many short SQL queries running one after another in one long spanning transaction (we are working on eliminating this kind of behavior, but it is some of the analysis tools opening a transaction and then fetching lots of data with small queries). Also many relatively short running calls (up to 500ms) to quite complicated read-only stored procedures is being done constantly on that machine. Regards, -- Valentine
Hallo again, I have the situation again, one of 3 slaves was slow to play all the WAL files and being about 10GB late it crashed with the same error again. I collected DEBUG4 output in this time: https://docs.google.com/open?id=0B2NMMrfiBQcLZjNDbU0xQ3lvWms I hope it will be helpful, Regards, -- Valentine Gogichashvili On Wed, May 23, 2012 at 10:51 PM, Valentine Gogichashvili <valgog@gmail.com>wrote: > But no subtransactions with writes (plpgsql + EXCEPTION also counts if does >> DML)? >> >> > On the master, yes. Practically all the stored procedures there are > catching exceptions in their own bodies and returning result codes back. > > Regards, > > -- Valentine >
Hi, On Thursday, June 07, 2012 12:44:08 PM Valentine Gogichashvili wrote: > I have the situation again, one of 3 slaves was slow to play all the WAL > files and being about 10GB late it crashed with the same error again. > > I collected DEBUG4 output in this time: > https://docs.google.com/open?id=0B2NMMrfiBQcLZjNDbU0xQ3lvWms Ok, I stared at this some time and I think I see what the problem is. Some log excerpts that lead my reasoning: 2012-06-06 15:35:51.954 "recovery snapshot waiting for non-overflowed snapshot or until oldest active xid on standby is at least 3730302193 (now 3730301766)",,,,,"xlog redo running xacts: nextXid 3730302194 latestCompletedXid 3730302179 oldestRunningXid 3730301766; 68 xacts: ... subxid ovf",,,,"" So we found a overflowed snapshot after we already are in STANDBY_SNAPSHOT_PENDING. This implies we have seen an overflowed snapshot before. In STANDBY_SNAPSHOT_PENDING we start recording known assigned xids to build our snapshot incrementally. 2012-06-06 15:35:51.954 record known xact 3730301766 latestObservedXid 3730302197 2012-06-06 15:37:49.580 record known xact 3730316333 latestObservedXid 3730316332 3730316517 Ok, we started filling the KnownAssignedXid machinery. That means procArray- >numKnownAssignedXids != 0. 2012-06-06 15:37:54.401 FATAL,XX000,"out-of-order XID insertion in KnownAssignedXids",,,,,"xlog redo running xacts: nextXid 3730316518 latestCompletedXid 3730316517 oldestRunningXid 3730316333; 4 xacts: 3730316429 3730316508 3730316333 3730316430",,,,"" the "xlog redo running xacts" bits tells us we have got a snapshot that is *not* overflowed. That means in ProcArrayAppylRecoveryInfo in the following part: if (standbyState == STANDBY_SNAPSHOT_PENDING) { /* * If the snapshot isn't overflowed or if its empty we can * reset our pending state and use this snapshot instead. */ if (!running->subxid_overflow || running->xcnt == 0) { standbyState = STANDBY_INITIALIZED; } else { ... } } we reset our state back to STANDBY_INITIALIZED. And fall back to computing our state at once in one piece instead of incrementally as we started doing for SNAPSHOT_PENDING. The problem is is know that that code assumes we run without any previous recorded xids. This is even formalized in an assert which we don't hit because were running without Assert: ... Assert(standbyState == STANDBY_INITIALIZED); ... Assert(procArray->numKnownAssignedXids == 0); ... after that we start adding all currently running xids from the snapshot to the KnownAssigned machinery. They are already recorded though, so we fail in KnownAssignedXidsAdd with the OPs error. The simplest fix for that seems to be to simply reset the KnownAssignedXids state in the above branch. Any arguments against that? This seems to have been broken in commit 10b7c686e52a6d1bb10194ebf9331ef06f044d46 Andres
On Thursday, June 07, 2012 03:58:24 PM Andres Freund wrote: > Hi, > > On Thursday, June 07, 2012 12:44:08 PM Valentine Gogichashvili wrote: > > I have the situation again, one of 3 slaves was slow to play all the WAL > > files and being about 10GB late it crashed with the same error again. > > > > I collected DEBUG4 output in this time: > > https://docs.google.com/open?id=0B2NMMrfiBQcLZjNDbU0xQ3lvWms > > Ok, I stared at this some time and I think I see what the problem is. Some > log excerpts that lead my reasoning: > ... > after that we start adding all currently running xids from the snapshot to > the KnownAssigned machinery. They are already recorded though, so we fail > in KnownAssignedXidsAdd with the OPs error. > > The simplest fix for that seems to be to simply reset the KnownAssignedXids > state in the above branch. Any arguments against that? A patch implementing that is attached. Unfortunately not really tested yet because its kinda hard to hit that code-path. Valentine, can you test that patch? Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 7 June 2012 17:49, Andres Freund <andres@2ndquadrant.com> wrote: > A patch implementing that is attached. Unfortunately not really tested yet > because its kinda hard to hit that code-path. > > Valentine, can you test that patch? Patch looks good, so as soon as we confirm it we can apply. --=20 =A0Simon Riggs=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 http:/= /www.2ndQuadrant.com/ =A0PostgreSQL Development, 24x7 Support, Training & Services
Hi Andres, unfortunately I did not manage to keep the snapshot of the database, that had this issue. Can that be, that this issue only happens when the recovering from the WAL files and not over the steaming replication? On other standbys, having no WAL delay, we did not see this problem yet. What I can do, is to switch streaming replication on one of the machines off, and wait :( Would it be possible to somehow generate a workload, that would lead to this a problem? With best regards, -- Valentine Gogichashvili On Thu, Jun 7, 2012 at 6:49 PM, Andres Freund <andres@2ndquadrant.com>wrote: > On Thursday, June 07, 2012 03:58:24 PM Andres Freund wrote: > > Hi, > > > > On Thursday, June 07, 2012 12:44:08 PM Valentine Gogichashvili wrote: > > > I have the situation again, one of 3 slaves was slow to play all the > WAL > > > files and being about 10GB late it crashed with the same error again. > > > > > > I collected DEBUG4 output in this time: > > > https://docs.google.com/open?id=0B2NMMrfiBQcLZjNDbU0xQ3lvWms > > > > Ok, I stared at this some time and I think I see what the problem is. > Some > > log excerpts that lead my reasoning: > > ... > > after that we start adding all currently running xids from the snapshot > to > > the KnownAssigned machinery. They are already recorded though, so we fail > > in KnownAssignedXidsAdd with the OPs error. > > > > The simplest fix for that seems to be to simply reset the > KnownAssignedXids > > state in the above branch. Any arguments against that? > A patch implementing that is attached. Unfortunately not really tested yet > because its kinda hard to hit that code-path. > > Valentine, can you test that patch? > > Andres > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >
Hi Valentine, On Friday, June 08, 2012 12:25:25 PM Valentine Gogichashvili wrote: > unfortunately I did not manage to keep the snapshot of the database, that > had this issue. Too bad. Let me try if I can reproduce it manually. > Can that be, that this issue only happens when the recovering from the WAL > files and not over the steaming replication? > On other standbys, having no WAL delay, we did not see this problem yet. No, the delivery method is not relevant. The relevant part is when the standby was started. A bit tricky to hit the right spot ;) > Would it be possible to somehow generate a workload, that would lead to > this a problem? I think I can do that. Let me play for a bit. Andres
Hi All, On Friday, June 08, 2012 12:42:00 PM Andres Freund wrote: > On Friday, June 08, 2012 12:25:25 PM Valentine Gogichashvili wrote: > > unfortunately I did not manage to keep the snapshot of the database, that > > had this issue. Ok, after some playing I could reproduce the issue: master: S1: CREATE FUNCTION insertone() RETURNS void LANGUAGE plpgsql AS $$BEGIN INSERT INTO data(data) VALUES (random()); EXCEPTION WHEN division_by_zero THEN RAISE NOTICE 'huh'; END;$$; S1: BEGIN; S1: SELECT insertone() FROM generate_series(1, 1000); --subxid overflow S2: BEGIN; S2: SELECT insertone(); --allocate xid standby: pg_basebackup LOG: 00000: consistent recovery state reached at 0/5C0001C0 LOG: 00000: recovery snapshot waiting for non-overflowed snapshot or until oldest active xid on standby is at least 4752 (now 3750) master: S1: commit: S3: checkpoint; standby: before patch: crash in assert/elog after: LOG: 00000: recovery snapshots are now enabled So I can confirm the above patch fixes at least that bug. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 8 June 2012 12:40, Andres Freund <andres@2ndquadrant.com> wrote: > Hi All, > > On Friday, June 08, 2012 12:42:00 PM Andres Freund wrote: >> On Friday, June 08, 2012 12:25:25 PM Valentine Gogichashvili wrote: >> > unfortunately I did not manage to keep the snapshot of the database, t= hat >> > had this issue. > Ok, after some playing I could reproduce the issue: > > master: > S1: CREATE FUNCTION insertone() RETURNS void LANGUAGE plpgsql AS $$BEGIN > INSERT INTO data(data) VALUES (random()); EXCEPTION WHEN division_by_zero= THEN > RAISE NOTICE 'huh'; END;$$; > S1: BEGIN; > S1: SELECT insertone() FROM generate_series(1, 1000); --subxid overflow > S2: BEGIN; > S2: SELECT insertone(); --allocate xid > > standby: > pg_basebackup > LOG: =A000000: consistent recovery state reached at 0/5C0001C0 > LOG: =A000000: recovery snapshot waiting for non-overflowed snapshot or u= ntil > oldest active xid on standby is at least 4752 (now 3750) > > master: > S1: commit: > S3: checkpoint; > > standby: > before patch: > crash in assert/elog > after: > =A0LOG: =A000000: recovery snapshots are now enabled > > So I can confirm the above patch fixes at least that bug. > > Andres Committed, with minor changes in wording of comments and error messages. --=20 =A0Simon Riggs=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 http:/= /www.2ndQuadrant.com/ =A0PostgreSQL Development, 24x7 Support, Training & Services