Thread: Hot Standby: too many KnownAssignedXids
Hi, I am seeing the following here on 9.0.1 on Linux x86-64: LOG: redo starts at 1F8/FC00E978 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 and this is the complete history: postgres was running as HS in foreground, Ctrl-C'ed it for a restart. LOG: received fast shutdown request LOG: aborting any active transactions FATAL: terminating walreceiver process due to administrator command FATAL: terminating connection due to administrator command LOG: shutting down LOG: database system is shut down Started it up again: $ postgres -D /db/ LOG: database system was shut down in recovery at 2010-11-19 14:36:30 EST LOG: entering standby mode cp: cannot stat `/archive/00000001000001F900000001': No such file or directory cp: cannot stat `/archive/00000001000001F8000000FC': No such file or directory LOG: redo starts at 1F8/FC00E978 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 LOG: startup process (PID 30052) exited with exit code 1 LOG: terminating any other active server processes (copied the log files over...) ./postgres -D /db/ LOG: database system was interrupted while in recovery at log time 2010-11-19 14:36:12 EST HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOG: entering standby mode LOG: restored log file "00000001000001F900000001" from archive LOG: restored log file "00000001000001F8000000FC" from archive LOG: redo starts at 1F8/FC00E978 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 LOG: startup process (PID 31581) exited with exit code 1 LOG: terminating any other active server processes Changing the line in the source code to give some more output gives me: FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, pArray->maxKnownAssignedXids: 6890 I still have the server, if you want me to debug anything or send a patch against 9.0.1 that gives more output, just let me know. Joachim
On Sat, Nov 20, 2010 at 6:46 AM, Joachim Wieland <joe@mcknight.de> wrote: > I still have the server, if you want me to debug anything or send a > patch against 9.0.1 that gives more output, just let me know. Per previous Simon's comment, the following information would be useful. http://archives.postgresql.org/pgsql-general/2010-10/msg00154.php -------------- If you suspect a bug in Hot Standby, please settrace_recovery_messages = DEBUG2 in postgresql.conf and repeat the action Always useful to know * max_connections * current number of sessions * whether we have two phase commits happening -------------- Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, Nov 21, 2010 at 11:48 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > -------------- > If you suspect a bug in Hot Standby, please set > trace_recovery_messages = DEBUG2 > in postgresql.conf and repeat the action > > Always useful to know > * max_connections > * current number of sessions > * whether we have two phase commits happening > -------------- The trace_recovery_messages parameter does not give more output... max_connections is set to 100 there have been no sessions on the standby itself, but a few on the primary database, I don't know how much but probably not more than 10. The sessions there were doing quite some load however, among them slony synchronization, since the hot standby master database was actually a slony replica. max_prepared_transactions has not been changed from its default value of 0. Joachim
On 19.11.2010 23:46, Joachim Wieland wrote: > FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, > pArray->maxKnownAssignedXids: 6890 Hmm, that's a lot of entries in KnownAssignedXids. Can you recompile with WAL_DEBUG, and run the recovery again with wal_debug=on ? That will print all the replayed WAL records, which is a lot of data, but it might give a hint what's going on. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 19.11.2010 23:46, Joachim Wieland wrote: >> >> FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, >> pArray->maxKnownAssignedXids: 6890 > > Hmm, that's a lot of entries in KnownAssignedXids. > > Can you recompile with WAL_DEBUG, and run the recovery again with > wal_debug=on ? That will print all the replayed WAL records, which is a lot > of data, but it might give a hint what's going on. Sure, but this gives me only one more line: [...] LOG: redo starts at 1F8/FC00E978 LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid 385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid 3829898/23 FATAL: too many KnownAssignedXids CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 LOG: startup process (PID 4587) exited with exit code 1 LOG: terminating any other active server processes Joachim
On 24.11.2010 06:56, Joachim Wieland wrote: > On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 19.11.2010 23:46, Joachim Wieland wrote: >>> >>> FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, >>> pArray->maxKnownAssignedXids: 6890 >> >> Hmm, that's a lot of entries in KnownAssignedXids. >> >> Can you recompile with WAL_DEBUG, and run the recovery again with >> wal_debug=on ? That will print all the replayed WAL records, which is a lot >> of data, but it might give a hint what's going on. > > Sure, but this gives me only one more line: > > [...] > LOG: redo starts at 1F8/FC00E978 > LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid > 385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid > 3829898/23 > FATAL: too many KnownAssignedXids > CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 > LOG: startup process (PID 4587) exited with exit code 1 > LOG: terminating any other active server processes Thanks, I can reproduce this now. This happens when you have a wide gap between the oldest still active xid and the latest xid. When recovery starts, we fetch the oldestActiveXid from the checkpoint record. Let's say that it's 100. We then start replaying WAL records from the Redo pointer, and the first record (heap insert in your case) contains an Xid that's much larger than 100, say 10000. We call RecordKnownAssignedXids() to make note that all xids between that range are in-progress, but there isn't enough room in the array for that. We normally get away with a smallish array because the array is trimmed at commit and abort records, and the special xid-assignment record to handle the case of a lot of subtransactions. We initialize the array from the running-xacts record that's written at a checkpoint. That mechanism fails in this case because the heap insert record is seen before the running-xacts record, causing all those xids in the range 100-10000 to be considered in-progress. The running-xacts record that comes later would prune them, but we don't have enough slots to hold them until that. Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash the xids we see during WAL replay in private memory instead of putting them in the KnownAssignedXids array until we see the running-xacts record. To reproduce this, I did this in the master: postgres=# CREATE FUNCTION insertfunc(n integer) RETURNS VOID AS $$ declare i integer; begin FOR i IN 1..n LOOP BEGIN INSERT INTO foo VALUES (1); EXCEPTION WHEN division_by_zero THEN RAISE NOTICE'divbyzero'; END; END LOOP; end; $$ LANGUAGE plpgsql; postgres=# SELECT insertfunc(100000000); After letting that run for a while, so that a couple of checkpoints have occurred, kill the master and start standby to recover that from archive. After it has replayed all the WAL, stop the standby and restart it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 24.11.2010 12:48, Heikki Linnakangas wrote: > On 24.11.2010 06:56, Joachim Wieland wrote: >> On Tue, Nov 23, 2010 at 8:45 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> On 19.11.2010 23:46, Joachim Wieland wrote: >>>> >>>> FATAL: too many KnownAssignedXids. head: 0, tail: 0, nxids: 9978, >>>> pArray->maxKnownAssignedXids: 6890 >>> >>> Hmm, that's a lot of entries in KnownAssignedXids. >>> >>> Can you recompile with WAL_DEBUG, and run the recovery again with >>> wal_debug=on ? That will print all the replayed WAL records, which is >>> a lot >>> of data, but it might give a hint what's going on. >> >> Sure, but this gives me only one more line: >> >> [...] >> LOG: redo starts at 1F8/FC00E978 >> LOG: REDO @ 1F8/FC00E978; LSN 1F8/FC00EE90: prev 1F8/FC00E930; xid >> 385669; len 21; bkpb1: Heap - insert: rel 1663/16384/18373; tid >> 3829898/23 >> FATAL: too many KnownAssignedXids >> CONTEXT: xlog redo insert: rel 1663/16384/18373; tid 3829898/23 >> LOG: startup process (PID 4587) exited with exit code 1 >> LOG: terminating any other active server processes > > Thanks, I can reproduce this now. This happens when you have a wide gap > between the oldest still active xid and the latest xid. > > When recovery starts, we fetch the oldestActiveXid from the checkpoint > record. Let's say that it's 100. We then start replaying WAL records > from the Redo pointer, and the first record (heap insert in your case) > contains an Xid that's much larger than 100, say 10000. We call > RecordKnownAssignedXids() to make note that all xids between that range > are in-progress, but there isn't enough room in the array for that. > > We normally get away with a smallish array because the array is trimmed > at commit and abort records, and the special xid-assignment record to > handle the case of a lot of subtransactions. We initialize the array > from the running-xacts record that's written at a checkpoint. That > mechanism fails in this case because the heap insert record is seen > before the running-xacts record, causing all those xids in the range > 100-10000 to be considered in-progress. The running-xacts record that > comes later would prune them, but we don't have enough slots to hold > them until that. > > Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash > the xids we see during WAL replay in private memory instead of putting > them in the KnownAssignedXids array until we see the running-xacts record. Looking closer at RecordKnownAssignedTransactionIds(), there's a related much more serious bug there too. When latestObservedXid is initialized to the oldest still-running xid, oldestActiveXid, at WAL recovery, we zero the CLOG starting from the oldestActiveXid. That will zap away the clog bits of any old transactions that had already committed before the checkpoint started, but were younger than the oldest still running transaction. The transactions will be lost :-(. It's dangerous to initialize latestObservedXid to anything to an older value. The idea of keeping the seen xids in a temporary list private to the startup process until the running-xacts record would solve that problem too. ProcArrayInitRecoveryInfo() would not be needed anymore, the KnownAssignedXids tracking would start at the first running-xacts record (or shutdown checkpoint) we see, not any sooner than that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 24.11.2010 13:38, Heikki Linnakangas wrote: > It's dangerous to initialize latestObservedXid to anything to an older > value. older value than the nextXid-1 from the checkpoint record, I meant to say. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, 2010-11-24 at 12:48 +0200, Heikki Linnakangas wrote: > When recovery starts, we fetch the oldestActiveXid from the checkpoint > record. Let's say that it's 100. We then start replaying WAL records > from the Redo pointer, and the first record (heap insert in your case) > contains an Xid that's much larger than 100, say 10000. We call > RecordKnownAssignedXids() to make note that all xids between that > range are in-progress, but there isn't enough room in the array for > that. Agreed. > Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash > the xids we see during WAL replay in private memory instead of > putting > them in the KnownAssignedXids array until we see the running-xacts > record. Moving LogStandbySnapshot() earlier will help but won't solve it fully. Will think. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 24.11.2010 12:48, Heikki Linnakangas wrote: > When recovery starts, we fetch the oldestActiveXid from the checkpoint > record. Let's say that it's 100. We then start replaying WAL records > from the Redo pointer, and the first record (heap insert in your case) > contains an Xid that's much larger than 100, say 10000. We call > RecordKnownAssignedXids() to make note that all xids between that range > are in-progress, but there isn't enough room in the array for that. > > We normally get away with a smallish array because the array is trimmed > at commit and abort records, and the special xid-assignment record to > handle the case of a lot of subtransactions. We initialize the array > from the running-xacts record that's written at a checkpoint. That > mechanism fails in this case because the heap insert record is seen > before the running-xacts record, causing all those xids in the range > 100-10000 to be considered in-progress. The running-xacts record that > comes later would prune them, but we don't have enough slots to hold > them until that. > > Hmm. I'm not sure off the top of my head how to fix that. Perhaps stash > the xids we see during WAL replay in private memory instead of putting > them in the KnownAssignedXids array until we see the running-xacts record. So, here's a patch using that approach. Another approach would be to revisit the way the running-xacts snapshot is taken. Currently, we first take a snapshot, and then WAL-log it. There is a small window between the steps where backends can begin/end transactions, and recovery has to deal with that. When this was designed, there was long discussion on whether we should instead grab WALInsertLock and ProcArrayLock at the same time, to ensure that the running-xacts snapshot represents an up-to-date situation at the point in WAL where it's inserted. We didn't want to do that because both locks can be heavily contended. But maybe we should after all. It would make the recovery code simpler. If we want to get fancy, we wouldn't necessarily need to hold both locks for the whole duration. We could first grab ProcArrayLock and construct the snapshot. Then grab WALInsertLock and release ProcArrayLock, and finally write the WAL record and release WALInsertLock. But that would require small changes to XLogInsert. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On 01.12.2010 20:51, Heikki Linnakangas wrote: > Another approach would be to revisit the way the running-xacts snapshot > is taken. Currently, we first take a snapshot, and then WAL-log it. > There is a small window between the steps where backends can begin/end > transactions, and recovery has to deal with that. When this was > designed, there was long discussion on whether we should instead grab > WALInsertLock and ProcArrayLock at the same time, to ensure that the > running-xacts snapshot represents an up-to-date situation at the point > in WAL where it's inserted. > > We didn't want to do that because both locks can be heavily contended. > But maybe we should after all. It would make the recovery code simpler. > > If we want to get fancy, we wouldn't necessarily need to hold both locks > for the whole duration. We could first grab ProcArrayLock and construct > the snapshot. Then grab WALInsertLock and release ProcArrayLock, and > finally write the WAL record and release WALInsertLock. But that would > require small changes to XLogInsert. I took a look at that approach. We don't actually need to hold ProcArrayLock while the WAL-record is written, we need to hold XidGenLock. I believe that's less severe than holding the ProcArrayLock as there's already precedence for writing a WAL record while holding that: we do that when we advance to a new clog page and write a zero-clog-page record. So this is what we should do IMHO. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
On Thu, 2010-12-02 at 10:39 +0200, Heikki Linnakangas wrote: > On 01.12.2010 20:51, Heikki Linnakangas wrote: > > Another approach would be to revisit the way the running-xacts snapshot > > is taken. Currently, we first take a snapshot, and then WAL-log it. > > There is a small window between the steps where backends can begin/end > > transactions, and recovery has to deal with that. When this was > > designed, there was long discussion on whether we should instead grab > > WALInsertLock and ProcArrayLock at the same time, to ensure that the > > running-xacts snapshot represents an up-to-date situation at the point > > in WAL where it's inserted. > > > > We didn't want to do that because both locks can be heavily contended. > > But maybe we should after all. It would make the recovery code simpler. > > > > If we want to get fancy, we wouldn't necessarily need to hold both locks > > for the whole duration. We could first grab ProcArrayLock and construct > > the snapshot. Then grab WALInsertLock and release ProcArrayLock, and > > finally write the WAL record and release WALInsertLock. But that would > > require small changes to XLogInsert. > > I took a look at that approach. We don't actually need to hold > ProcArrayLock while the WAL-record is written, we need to hold > XidGenLock. I believe that's less severe than holding the ProcArrayLock > as there's already precedence for writing a WAL record while holding > that: we do that when we advance to a new clog page and write a > zero-clog-page record. > > So this is what we should do IMHO. Oh, thanks for looking at this. I've been looking at this also and as we might expect had a slightly different design. First, your assessment of the locking above is better than mine. I agree with your analysis so we should do it that way. The locking issue was the reason I haven't patched this yet so I'm glad you've improved this. In terms of the rest of the patch, it seems we have different designs, I think I have a much simpler, less invasive solution: The cause of the issue is that replay starts at one LSN and there is a delay until the RunningXacts WAL record occurs. If there was no delay, there would be no issue at all. In CreateCheckpoint() we start by grabbing the WAInsertLock and later recording that pointer as part of the checkpoint record. My proposal is to replace the "grab the lock" code with the insert of the RunningXacts WAL record (when wal_level set), so that recovery always starts with that record type. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 02.12.2010 11:02, Simon Riggs wrote: > The cause of the issue is that replay starts at one LSN and there is a > delay until the RunningXacts WAL record occurs. If there was no delay, > there would be no issue at all. In CreateCheckpoint() we start by > grabbing the WAInsertLock and later recording that pointer as part of > the checkpoint record. My proposal is to replace the "grab the lock" > code with the insert of the RunningXacts WAL record (when wal_level > set), so that recovery always starts with that record type. Oh, interesting idea. But AFAICS closing the gap between acquiring the running-xacts snapshot and writing it to the log is sufficient, I don't see what moving the running-xacts record buys us. Does it allow some further simplifications somewhere? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2010-12-02 at 12:41 +0200, Heikki Linnakangas wrote: > On 02.12.2010 11:02, Simon Riggs wrote: > > The cause of the issue is that replay starts at one LSN and there is a > > delay until the RunningXacts WAL record occurs. If there was no delay, > > there would be no issue at all. In CreateCheckpoint() we start by > > grabbing the WAInsertLock and later recording that pointer as part of > > the checkpoint record. My proposal is to replace the "grab the lock" > > code with the insert of the RunningXacts WAL record (when wal_level > > set), so that recovery always starts with that record type. > > Oh, interesting idea. But AFAICS closing the gap between acquiring the > running-xacts snapshot and writing it to the log is sufficient, I don't > see what moving the running-xacts record buys us. Does it allow some > further simplifications somewhere? Your patch is quite long and you do a lot more than just alter the locking. I don't think we need those changes at all and especially would not wish to backpatch that. Earlier on this thread, we discussed: On Wed, 2010-11-24 at 15:19 +0000, Simon Riggs wrote: > On Wed, 2010-11-24 at 12:48 +0200, Heikki Linnakangas wrote: > > When recovery starts, we fetch the oldestActiveXid from the checkpoint > > record. Let's say that it's 100. We then start replaying WAL records > > from the Redo pointer, and the first record (heap insert in your case) > > contains an Xid that's much larger than 100, say 10000. We call > > RecordKnownAssignedXids() to make note that all xids between that > > range are in-progress, but there isn't enough room in the array for > > that. > > Agreed. The current code fails because of the gap between the redo pointer and the XLOG_RUNNING_XACTS WAL record. If there is no gap, there is no problem. So my preferred solution would: * Log XLOG_RUNNING_XACTS while holding XidGenLock, as you suggest * Move logging to occur at the Redo pointer That is a much smaller patch with a smaller footprint. -- Simon Riggs http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services
On 02.12.2010 13:25, Simon Riggs wrote: > On Thu, 2010-12-02 at 12:41 +0200, Heikki Linnakangas wrote: >> On 02.12.2010 11:02, Simon Riggs wrote: >>> The cause of the issue is that replay starts at one LSN and there is a >>> delay until the RunningXacts WAL record occurs. If there was no delay, >>> there would be no issue at all. In CreateCheckpoint() we start by >>> grabbing the WAInsertLock and later recording that pointer as part of >>> the checkpoint record. My proposal is to replace the "grab the lock" >>> code with the insert of the RunningXacts WAL record (when wal_level >>> set), so that recovery always starts with that record type. >> >> Oh, interesting idea. But AFAICS closing the gap between acquiring the >> running-xacts snapshot and writing it to the log is sufficient, I don't >> see what moving the running-xacts record buys us. Does it allow some >> further simplifications somewhere? > > Your patch is quite long and you do a lot more than just alter the > locking. I don't think we need those changes at all and especially would > not wish to backpatch that. Most of the changes to procarray.c were about removing code that's no longer necessary when we close the gap between acquiring and writing the running-xacts WAL record. You can leave it as it is as a historical curiosity, but I'd prefer to simplify it, given that we now know that it doesn't actually work correctly if the gap is not closed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On 02.12.2010 12:31, Heikki Linnakangas wrote: > On 02.12.2010 13:25, Simon Riggs wrote: >> On Thu, 2010-12-02 at 12:41 +0200, Heikki Linnakangas wrote: >>> On 02.12.2010 11:02, Simon Riggs wrote: >>>> The cause of the issue is that replay starts at one LSN and there is a >>>> delay until the RunningXacts WAL record occurs. If there was no delay, >>>> there would be no issue at all. In CreateCheckpoint() we start by >>>> grabbing the WAInsertLock and later recording that pointer as part of >>>> the checkpoint record. My proposal is to replace the "grab the lock" >>>> code with the insert of the RunningXacts WAL record (when wal_level >>>> set), so that recovery always starts with that record type. >>> >>> Oh, interesting idea. But AFAICS closing the gap between acquiring the >>> running-xacts snapshot and writing it to the log is sufficient, I don't >>> see what moving the running-xacts record buys us. Does it allow some >>> further simplifications somewhere? >> >> Your patch is quite long and you do a lot more than just alter the >> locking. I don't think we need those changes at all and especially would >> not wish to backpatch that. > > Most of the changes to procarray.c were about removing code that's no > longer necessary when we close the gap between acquiring and writing the > running-xacts WAL record. You can leave it as it is as a historical > curiosity, but I'd prefer to simplify it, given that we now know that it > doesn't actually work correctly if the gap is not closed. Ok, I've committed this patch now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Dec 7, 2010 at 3:42 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Ok, I've committed this patch now. I can confirm that I could continue replaying the logfiles on the standby host with this patch. Thanks a lot, Joachim