BUG #6661: out-of-order XID insertion in KnownAssignedXids - Mailing list pgsql-bugs

From valgog@gmail.com
Subject BUG #6661: out-of-order XID insertion in KnownAssignedXids
Date
Msg-id E1SWwkh-0000M9-Rj@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #6661: out-of-order XID insertion in KnownAssignedXids  (Valentine Gogichashvili <valgog@gmail.com>)
Re: BUG #6661: out-of-order XID insertion in KnownAssignedXids  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      6661
Logged by:          Valentine Gogichashvili
Email address:      valgog@gmail.com
PostgreSQL version: 9.0.7
Operating system:   Linux version 2.6.32-5-amd64 (Debian 2.6.32-41sque
Description:=20=20=20=20=20=20=20=20

Hello,

today when trying to restore a replication database I got the following
error:=20

...
2012-05-22 21:20:24.823 CEST,,,23804,,4fbbe69e.5cfc,124,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001E"" from
archive",,,,,,,,,""
2012-05-22 21:20:24.954 CEST,,,23804,,4fbbe69e.5cfc,125,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C0000001F"" from
archive",,,,,,,,,""
2012-05-22 21:20:26.374 CEST,,,23804,,4fbbe69e.5cfc,126,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000020"" from
archive",,,,,,,,,""
2012-05-22 21:20:27.238 CEST,,,23804,,4fbbe69e.5cfc,127,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"restored log file ""0000000200001C9C00000021"" from
archive",,,,,,,,,""
2012-05-22 21:20:27.443 CEST,,,23804,,4fbbe69e.5cfc,128,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"consistent recovery state reached at
1C9C/21318C20",,,,,,,,,""
2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,129,,2012-05-22 21:18:54
CEST,1/0,0,LOG,00000,"4 KnownAssignedXids (num=3D4 tail=3D0 head=3D916)
[0]=3D3674726497 [1]=3D3674727041 [2]=3D3674727042 [3]=3D3674727128 ",,,,,"=
xlog redo
 running xacts: nextXid 3674728633 latestCompletedXid 3674728632
oldestRunningXid 3674726497; 4 xacts: 3674727041 3674727042 3674727128
3674726497",,,,""
2012-05-22 21:20:27.868 CEST,,,23804,,4fbbe69e.5cfc,130,,2012-05-22 21:18:54
CEST,1/0,0,FATAL,XX000,"out-of-order XID insertion in
KnownAssignedXids",,,,,"xlog redo  running xacts: nextXid 3674728633
latestCompletedXid 3674728632 oldestRunningXid 3674726497; 4 xacts:
3674727041 3674727042 3674727128 3674726497",,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,1,,2012-05-22 21:18:54
CEST,,0,LOG,00000,"startup process (PID 23804) exited with exit code
1",,,,,,,,,""
2012-05-22 21:20:27.939 CEST,,,23793,,4fbbe69e.5cf1,2,,2012-05-22 21:18:54
CEST,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""


This happened after replaying 124 WAL files from archive.=20

Full log is here: https://gist.github.com/ff51ccf506f5e522e271

I tried to start the database 3 times, with the same result.=20

Additionally I removed all existing WAL files from pg_xlog, and tried to
start again, but got the same result this time as well (though probably I
should have tried the other way round, as WAL files are anyway taken from
the archive if there are there... anyway, just want to describe everything I
did)

The database on that machine is running as a read-only hot-standby machine
for running long running statistical queries, that we do not want to run on
the production system. It has significantly bigger shared_buffers =3D 48GB (
in comparison to 8GB on the master ).=20

Sometimes there is really a lot happening on the master machine, and the
replication machine does not seem to be able to catch up with the master.
This usually happens, when the replay is paused because of a long running
query, but after the conflicting query is gone and the replication delay is
bigger then 2-3 Gigabytes, it seems not to be able catch up any longer, it
looks like a snowball effect. At this time it has a 100% of one CPU on
recovery process with practically no IOWait (so for me it seems like a CPU
bottleneck rather then IO one, that is rather confusing).

Here is the replication delay graph
https://lh4.googleusercontent.com/-UBf1iBHZGlQ/T7v9L-bpLvI/AAAAAAAAMms/Ve-e=
AOFEXjw/s800/replication_delay.png

In the past, if the replication delay was bigger then 20GB, I was stopping
the server, restarting it so, that is replays the WAL files from the archive
and subjectively it seemed, like the replay was faster, if shared_buffers
was set to much lower value (4GB).=20

So today, I did the same, but got an error "out-of-order XID insertion in
KnownAssignedXids" and could not restore the machine. Now, while rebuilding
hot-standby of that 250GB database I have some time to write this report.=
=20

More information about the affected machine:=20

Intel(R) Xeon(R) CPU E5540 @ 2.53GHz (16 Cores)

cat /proc/version=20
Linux version 2.6.32-5-amd64 (Debian 2.6.32-41squeeze2) (dannf@debian.org)
(gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Thu Mar 22 17:26:33 UTC 2012
=20
free -m
             total       used       free     shared    buffers     cached
Mem:        129192     128858        333          0         31     124969
-/+ buffers/cache:       3857     125334
Swap:         1952          1       1951


df -h
/dev/mapper/vg00-data
                       54G  1.3G   52G   3% /data
/dev/mapper/vg01-data1
                      1.7T  373G  1.3T  23% /data1

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: sequential scans that pick up only deleted records do not honor query cancel or timeout
Next
From: Merlin Moncure
Date:
Subject: Re: sequential scans that pick up only deleted records do not honor query cancel or timeout