Index: doc/src/sgml/config.sgml =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/config.sgml,v retrieving revision 1.191 diff -c -r1.191 config.sgml *** doc/src/sgml/config.sgml 30 Sep 2008 10:52:09 -0000 1.191 --- doc/src/sgml/config.sgml 1 Nov 2008 14:49:38 -0000 *************** *** 5284,5289 **** --- 5284,5315 ---- + + trace_recovery_messages (string) + + trace_recovery_messages configuration parameter + + + + Controls which message levels are written to the server log + for system modules needed for recovery processing. This allows + the user to override the normal setting of log_min_messages, + but only for specific messages. This is intended for use in + debugging Hot Standby. + Valid values are DEBUG5, DEBUG4, + DEBUG3, DEBUG2, DEBUG1, + INFO, NOTICE, WARNING, + ERROR, LOG, FATAL, and + PANIC. Each level includes all the levels that + follow it. The later the level, the fewer messages are sent + to the log. The default is WARNING. Note that + LOG has a different rank here than in + client_min_messages. + Parameter should be set in the postgresql.conf only. + + + + zero_damaged_pages (boolean) Index: doc/src/sgml/func.sgml =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/func.sgml,v retrieving revision 1.451 diff -c -r1.451 func.sgml *** doc/src/sgml/func.sgml 27 Oct 2008 09:37:46 -0000 1.451 --- doc/src/sgml/func.sgml 1 Nov 2008 14:49:38 -0000 *************** *** 12419,12424 **** --- 12419,12615 ---- . + + pg_is_in_recovery + + + pg_last_completed_xact_timestamp + + + pg_last_completed_xid + + + pg_recovery_pause + + + pg_recovery_continue + + + pg_recovery_pause_cleanup + + + pg_recovery_pause_xid + + + pg_recovery_pause_time + + + pg_recovery_stop + + + + The functions shown in assist in archive recovery. + Except for the first three functions, these are restricted to superusers. + All of these functions can only be executed during recovery. + + + + Recovery Control Functions + + + Name Return Type Description + + + + + + + pg_is_in_recovery() + + bool + True if recovery is still in progress. + + + + pg_last_completed_xact_timestamp() + + timestamp with time zone + Returns the original completion timestamp with timezone of the + last completed transaction in the current recovery. + + + + + pg_last_completed_xid() + + integer + Returns the transaction id (32-bit) of last completed transaction + in the current recovery. Later numbered transaction ids may already have + completed. This is unrelated to transactions on the source server. + + + + + + pg_recovery_pause() + + void + Pause recovery processing, unconditionally. + + + + pg_recovery_continue() + + void + If recovery is paused, continue processing. + + + + pg_recovery_stop() + + void + End recovery and begin normal processing. + + + + pg_recovery_pause_xid() + + void + Continue recovery until specified xid completes, if it is ever + seen, then pause recovery. + + + + + pg_recovery_pause_time() + + void + Continue recovery until a transaction with specified timestamp + completes, if one is ever seen, then pause recovery. + + + + + pg_recovery_pause_cleanup() + + void + Continue recovery until the next cleanup record, then pause. + + + + pg_recovery_pause_advance() + + void + Advance recovery specified number of records then pause. + + + +
+ + + pg_recovery_pause and pg_recovery_continue allow + a superuser to control the progress of recovery on the database server. + While recovery is paused queries can then be executed to determine how far + forwards recovery should progress. Recovery can never go backwards + because previous values are overwritten. If the superuser wishes recovery + to complete and normal processing mode to start, execute + pg_recovery_stop. + + + + Variations of the pause function exist, mainly to allow PITR to dynamically + control where it should progress to. pg_recovery_pause_xid and + pg_recovery_pause_time allow the specification of a trial + recovery target, similarly to . + Recovery will then progress to the specified point and then pause, rather + than stopping permanently, allowing assessment of whether this is the + desired stopping point for recovery. + + + + pg_recovery_pause_cleanup allows recovery to progress only + as far as the next cleanup record. This is useful where a longer running + query needs to access the database in a consistent state and it is + more important that the query executes than it is that we keep processing + new WAL records. This can be used as shown: + + select pg_recovery_pause_cleanup(); + + -- run very important query + select + from big_table1 join big_table2 + on ... + where ... + + select pg_recovery_continue; + + + + + pg_recovery_advance allows recovery to progress record by + record, for very careful analysis or debugging. Step size can be 1 or + more records. If recovery is not yet paused then pg_recovery_advance + will process the specified number of records then pause. If recovery + is already paused, recovery will continue for another N records before + pausing again. + + + + If you pause recovery while the server is waiting for a WAL file when + operating in standby mode it will have apparently no effect until the + file arrives. Once the server begins processing WAL records again it + will notice the pause request and will act upon it. This is not a bug. + pause. + + + + Pausing recovery will also prevent restartpoints from starting since they + are triggered by events in the WAL stream. In all other ways processing + will continue, for example the background writer will continue to clean + shared_buffers while paused. + + The functions shown in calculate the actual disk space usage of database objects. Index: src/backend/access/heap/heapam.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/heap/heapam.c,v retrieving revision 1.268 diff -c -r1.268 heapam.c *** src/backend/access/heap/heapam.c 31 Oct 2008 19:40:26 -0000 1.268 --- src/backend/access/heap/heapam.c 1 Nov 2008 14:49:38 -0000 *************** *** 3715,3733 **** } /* * Perform XLogInsert for a heap-clean operation. Caller must already * have modified the buffer and marked it dirty. * * Note: prior to Postgres 8.3, the entries in the nowunused[] array were * zero-based tuple indexes. Now they are one-based like other uses * of OffsetNumber. */ XLogRecPtr log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! bool redirect_move) { xl_heap_clean xlrec; uint8 info; --- 3715,3791 ---- } /* + * Update the latestRemovedXid for the current VACUUM. This gets called + * only rarely, since we probably already removed rows earlier. + * see comments for vacuum_log_cleanup_info(). + */ + void + HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple, + TransactionId *latestRemovedXid) + { + TransactionId xmin = HeapTupleHeaderGetXmin(tuple); + TransactionId xmax = HeapTupleHeaderGetXmax(tuple); + TransactionId xvac = HeapTupleHeaderGetXvac(tuple); + + if (tuple->t_infomask & HEAP_MOVED_OFF || + tuple->t_infomask & HEAP_MOVED_IN) + { + if (TransactionIdPrecedes(*latestRemovedXid, xvac)) + *latestRemovedXid = xvac; + } + + if (TransactionIdPrecedes(*latestRemovedXid, xmax)) + *latestRemovedXid = xmax; + + if (TransactionIdPrecedes(*latestRemovedXid, xmin)) + *latestRemovedXid = xmin; + + Assert(TransactionIdIsValid(*latestRemovedXid)); + } + + /* + * Perform XLogInsert to register a heap cleanup info message. These + * are typically called just once per VACUUM and are require because + * of the phasing of removal operations during a lazy VACUUM. + * see comments for vacuum_log_cleanup_info(). + */ + XLogRecPtr + log_heap_cleanup_info(RelFileNode rnode, TransactionId latestRemovedXid) + { + xl_heap_cleanup_info xlrec; + XLogRecPtr recptr; + XLogRecData rdata; + + xlrec.node = rnode; + xlrec.latestRemovedXid = latestRemovedXid; + + rdata.data = (char *) &xlrec; + rdata.len = SizeOfHeapCleanupInfo; + rdata.buffer = InvalidBuffer; + rdata.next = NULL; + + recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_CLEANUP_INFO, &rdata); + + return recptr; + } + + /* * Perform XLogInsert for a heap-clean operation. Caller must already * have modified the buffer and marked it dirty. * * Note: prior to Postgres 8.3, the entries in the nowunused[] array were * zero-based tuple indexes. Now they are one-based like other uses * of OffsetNumber. + * + * For 8.4 we also include the latestRemovedXid which allows recovery + * processing to abort standby queries in conflict with these changes. */ XLogRecPtr log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! TransactionId latestRemovedXid, bool redirect_move) { xl_heap_clean xlrec; uint8 info; *************** *** 3739,3744 **** --- 3797,3803 ---- xlrec.node = reln->rd_node; xlrec.block = BufferGetBlockNumber(buffer); + xlrec.latestRemovedXid = latestRemovedXid; xlrec.nredirected = nredirected; xlrec.ndead = ndead; *************** *** 4028,4034 **** if (record->xl_info & XLR_BKP_BLOCK_1) return; ! buffer = XLogReadBuffer(xlrec->node, xlrec->block, false); if (!BufferIsValid(buffer)) return; page = (Page) BufferGetPage(buffer); --- 4087,4093 ---- if (record->xl_info & XLR_BKP_BLOCK_1) return; ! buffer = XLogReadBufferForCleanup(xlrec->node, xlrec->block, false); if (!BufferIsValid(buffer)) return; page = (Page) BufferGetPage(buffer); *************** *** 4088,4094 **** if (record->xl_info & XLR_BKP_BLOCK_1) return; ! buffer = XLogReadBuffer(xlrec->node, xlrec->block, false); if (!BufferIsValid(buffer)) return; page = (Page) BufferGetPage(buffer); --- 4147,4153 ---- if (record->xl_info & XLR_BKP_BLOCK_1) return; ! buffer = XLogReadBufferForCleanup(xlrec->node, xlrec->block, false); if (!BufferIsValid(buffer)) return; page = (Page) BufferGetPage(buffer); *************** *** 4664,4669 **** --- 4723,4734 ---- case XLOG_HEAP2_CLEAN_MOVE: heap_xlog_clean(lsn, record, true); break; + case XLOG_HEAP2_CLEANUP_INFO: + /* + * Actual operation is a no-op. Record type exists + * to provide information to the recovery subsystem + */ + break; default: elog(PANIC, "heap2_redo: unknown op code %u", info); } *************** *** 4793,4809 **** { xl_heap_clean *xlrec = (xl_heap_clean *) rec; ! appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u", xlrec->node.spcNode, xlrec->node.dbNode, ! xlrec->node.relNode, xlrec->block); } else if (info == XLOG_HEAP2_CLEAN_MOVE) { xl_heap_clean *xlrec = (xl_heap_clean *) rec; ! appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u", xlrec->node.spcNode, xlrec->node.dbNode, ! xlrec->node.relNode, xlrec->block); } else appendStringInfo(buf, "UNKNOWN"); --- 4858,4883 ---- { xl_heap_clean *xlrec = (xl_heap_clean *) rec; ! appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u remxid %u", xlrec->node.spcNode, xlrec->node.dbNode, ! xlrec->node.relNode, xlrec->block, ! xlrec->latestRemovedXid); } else if (info == XLOG_HEAP2_CLEAN_MOVE) { xl_heap_clean *xlrec = (xl_heap_clean *) rec; ! appendStringInfo(buf, "clean_move: rel %u/%u/%u; blk %u remxid %u", xlrec->node.spcNode, xlrec->node.dbNode, ! xlrec->node.relNode, xlrec->block, ! xlrec->latestRemovedXid); ! } ! else if (info == XLOG_HEAP2_CLEANUP_INFO) ! { ! xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec; ! ! appendStringInfo(buf, "cleanup info: remxid %u", ! xlrec->latestRemovedXid); } else appendStringInfo(buf, "UNKNOWN"); Index: src/backend/access/heap/pruneheap.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/heap/pruneheap.c,v retrieving revision 1.16 diff -c -r1.16 pruneheap.c *** src/backend/access/heap/pruneheap.c 13 Jul 2008 20:45:47 -0000 1.16 --- src/backend/access/heap/pruneheap.c 1 Nov 2008 14:49:38 -0000 *************** *** 30,35 **** --- 30,36 ---- typedef struct { TransactionId new_prune_xid; /* new prune hint value for page */ + TransactionId latestRemovedXid; /* latest xid to be removed by this prune */ int nredirected; /* numbers of entries in arrays below */ int ndead; int nunused; *************** *** 85,90 **** --- 86,99 ---- return; /* + * We can't write WAL in recovery mode, so there's no point trying to + * clean the page. The master will likely issue a cleaning WAL record + * soon anyway, so this is no particular loss. + */ + if (IsRecoveryProcessingMode()) + return; + + /* * We prune when a previous UPDATE failed to find enough space on the page * for a new tuple version, or when free space falls below the relation's * fill-factor target (but not less than 10%). *************** *** 176,181 **** --- 185,191 ---- * Also initialize the rest of our working state. */ prstate.new_prune_xid = InvalidTransactionId; + prstate.latestRemovedXid = InvalidTransactionId; prstate.nredirected = prstate.ndead = prstate.nunused = 0; memset(prstate.marked, 0, sizeof(prstate.marked)); *************** *** 258,264 **** prstate.redirected, prstate.nredirected, prstate.nowdead, prstate.ndead, prstate.nowunused, prstate.nunused, ! redirect_move); PageSetLSN(BufferGetPage(buffer), recptr); PageSetTLI(BufferGetPage(buffer), ThisTimeLineID); --- 268,274 ---- prstate.redirected, prstate.nredirected, prstate.nowdead, prstate.ndead, prstate.nowunused, prstate.nunused, ! prstate.latestRemovedXid, redirect_move); PageSetLSN(BufferGetPage(buffer), recptr); PageSetTLI(BufferGetPage(buffer), ThisTimeLineID); *************** *** 396,401 **** --- 406,413 ---- == HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup)) { heap_prune_record_unused(prstate, rootoffnum); + HeapTupleHeaderAdvanceLatestRemovedXid(htup, + &prstate->latestRemovedXid); ndeleted++; } *************** *** 521,527 **** --- 533,543 ---- * find another DEAD tuple is a fairly unusual corner case.) */ if (tupdead) + { latestdead = offnum; + HeapTupleHeaderAdvanceLatestRemovedXid(htup, + &prstate->latestRemovedXid); + } else if (!recent_dead) break; Index: src/backend/access/transam/clog.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/clog.c,v retrieving revision 1.48 diff -c -r1.48 clog.c *** src/backend/access/transam/clog.c 20 Oct 2008 19:18:18 -0000 1.48 --- src/backend/access/transam/clog.c 1 Nov 2008 14:49:38 -0000 *************** *** 459,464 **** --- 459,467 ---- /* * This must be called ONCE during postmaster or standalone-backend startup, * after StartupXLOG has initialized ShmemVariableCache->nextXid. + * + * We access just a single clog page, so this action is atomic and safe + * for use if other processes are active during recovery. */ void StartupCLOG(void) Index: src/backend/access/transam/multixact.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/multixact.c,v retrieving revision 1.28 diff -c -r1.28 multixact.c *** src/backend/access/transam/multixact.c 1 Aug 2008 13:16:08 -0000 1.28 --- src/backend/access/transam/multixact.c 1 Nov 2008 14:49:38 -0000 *************** *** 1413,1420 **** * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact. Note that we * may already have replayed WAL data into the SLRU files. * ! * We don't need any locks here, really; the SLRU locks are taken ! * only because slru.c expects to be called with locks held. */ void StartupMultiXact(void) --- 1413,1423 ---- * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact. Note that we * may already have replayed WAL data into the SLRU files. * ! * We want this operation to be atomic to ensure that other processes can ! * use MultiXact while we complete recovery. We access one page only from the ! * offset and members buffers, so once locks are acquired they will not be ! * dropped and re-acquired by SLRU code. So we take both locks at start, then ! * hold them all the way to the end. */ void StartupMultiXact(void) *************** *** 1426,1431 **** --- 1429,1435 ---- /* Clean up offsets state */ LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE); + LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE); /* * Initialize our idea of the latest page number. *************** *** 1452,1461 **** MultiXactOffsetCtl->shared->page_dirty[slotno] = true; } - LWLockRelease(MultiXactOffsetControlLock); - /* And the same for members */ - LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE); /* * Initialize our idea of the latest page number. --- 1456,1462 ---- *************** *** 1483,1488 **** --- 1484,1490 ---- } LWLockRelease(MultiXactMemberControlLock); + LWLockRelease(MultiXactOffsetControlLock); /* * Initialize lastTruncationPoint to invalid, ensuring that the first *************** *** 1543,1549 **** * SimpleLruTruncate would get confused. It seems best not to risk * removing any data during recovery anyway, so don't truncate. */ ! if (!InRecovery) TruncateMultiXact(); TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true); --- 1545,1551 ---- * SimpleLruTruncate would get confused. It seems best not to risk * removing any data during recovery anyway, so don't truncate. */ ! if (!IsRecoveryProcessingMode()) TruncateMultiXact(); TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true); Index: src/backend/access/transam/rmgr.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/rmgr.c,v retrieving revision 1.26 diff -c -r1.26 rmgr.c *** src/backend/access/transam/rmgr.c 30 Sep 2008 10:52:11 -0000 1.26 --- src/backend/access/transam/rmgr.c 1 Nov 2008 14:49:38 -0000 *************** *** 20,25 **** --- 20,26 ---- #include "commands/sequence.h" #include "commands/tablespace.h" #include "storage/freespace.h" + #include "storage/sinval.h" #include "storage/smgr.h" *************** *** 32,38 **** {"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL}, {"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL}, {"FreeSpaceMap", fsm_redo, fsm_desc, NULL, NULL, NULL}, ! {"Reserved 8", NULL, NULL, NULL, NULL, NULL}, {"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL}, {"Heap", heap_redo, heap_desc, NULL, NULL, NULL}, {"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint}, --- 33,39 ---- {"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL}, {"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL}, {"FreeSpaceMap", fsm_redo, fsm_desc, NULL, NULL, NULL}, ! {"Relation", relation_redo, relation_desc, NULL, NULL, NULL}, {"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL}, {"Heap", heap_redo, heap_desc, NULL, NULL, NULL}, {"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint}, Index: src/backend/access/transam/slru.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/slru.c,v retrieving revision 1.44 diff -c -r1.44 slru.c *** src/backend/access/transam/slru.c 1 Jan 2008 19:45:48 -0000 1.44 --- src/backend/access/transam/slru.c 1 Nov 2008 14:49:38 -0000 *************** *** 619,624 **** --- 619,632 ---- if (lseek(fd, (off_t) offset, SEEK_SET) < 0) { + if (InRecovery) + { + ereport(LOG, + (errmsg("file \"%s\" doesn't exist, reading as zeroes", + path))); + MemSet(shared->page_buffer[slotno], 0, BLCKSZ); + return true; + } slru_errcause = SLRU_SEEK_FAILED; slru_errno = errno; close(fd); *************** *** 628,633 **** --- 636,649 ---- errno = 0; if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ) { + if (InRecovery) + { + ereport(LOG, + (errmsg("file \"%s\" doesn't exist, reading as zeroes", + path))); + MemSet(shared->page_buffer[slotno], 0, BLCKSZ); + return true; + } slru_errcause = SLRU_READ_FAILED; slru_errno = errno; close(fd); Index: src/backend/access/transam/subtrans.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/subtrans.c,v retrieving revision 1.23 diff -c -r1.23 subtrans.c *** src/backend/access/transam/subtrans.c 1 Aug 2008 13:16:08 -0000 1.23 --- src/backend/access/transam/subtrans.c 1 Nov 2008 14:49:38 -0000 *************** *** 223,257 **** /* * This must be called ONCE during postmaster or standalone-backend startup, * after StartupXLOG has initialized ShmemVariableCache->nextXid. - * - * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid - * if there are none. */ void StartupSUBTRANS(TransactionId oldestActiveXID) { ! int startPage; ! int endPage; - /* - * Since we don't expect pg_subtrans to be valid across crashes, we - * initialize the currently-active page(s) to zeroes during startup. - * Whenever we advance into a new page, ExtendSUBTRANS will likewise zero - * the new page without regard to whatever was previously on disk. - */ LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE); ! startPage = TransactionIdToPage(oldestActiveXID); ! endPage = TransactionIdToPage(ShmemVariableCache->nextXid); ! ! while (startPage != endPage) ! { ! (void) ZeroSUBTRANSPage(startPage); ! startPage++; ! } ! (void) ZeroSUBTRANSPage(startPage); LWLockRelease(SubtransControlLock); } /* --- 223,244 ---- /* * This must be called ONCE during postmaster or standalone-backend startup, * after StartupXLOG has initialized ShmemVariableCache->nextXid. */ void StartupSUBTRANS(TransactionId oldestActiveXID) { ! TransactionId xid = ShmemVariableCache->nextXid; ! int pageno = TransactionIdToPage(xid); LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE); ! /* ! * Initialize our idea of the latest page number. ! */ ! SubTransCtl->shared->latest_page_number = pageno; LWLockRelease(SubtransControlLock); + } /* Index: src/backend/access/transam/twophase.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/twophase.c,v retrieving revision 1.46 diff -c -r1.46 twophase.c *** src/backend/access/transam/twophase.c 20 Oct 2008 19:18:18 -0000 1.46 --- src/backend/access/transam/twophase.c 1 Nov 2008 14:49:38 -0000 *************** *** 1708,1713 **** --- 1708,1715 ---- /* Emit the XLOG commit record */ xlrec.xid = xid; xlrec.crec.xact_time = GetCurrentTimestamp(); + xlrec.crec.slotId = MyProc->slotId; + xlrec.crec.xinfo = 0; xlrec.crec.nrels = nrels; xlrec.crec.nsubxacts = nchildren; rdata[0].data = (char *) (&xlrec); *************** *** 1786,1791 **** --- 1788,1795 ---- /* Emit the XLOG abort record */ xlrec.xid = xid; xlrec.arec.xact_time = GetCurrentTimestamp(); + xlrec.arec.slotId = MyProc->slotId; + xlrec.arec.xinfo = 0; xlrec.arec.nrels = nrels; xlrec.arec.nsubxacts = nchildren; rdata[0].data = (char *) (&xlrec); Index: src/backend/access/transam/xact.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xact.c,v retrieving revision 1.266 diff -c -r1.266 xact.c *** src/backend/access/transam/xact.c 20 Oct 2008 19:18:18 -0000 1.266 --- src/backend/access/transam/xact.c 1 Nov 2008 14:49:38 -0000 *************** *** 38,43 **** --- 38,44 ---- #include "storage/fd.h" #include "storage/lmgr.h" #include "storage/procarray.h" + #include "storage/sinval.h" #include "storage/sinvaladt.h" #include "storage/smgr.h" #include "utils/combocid.h" *************** *** 72,77 **** --- 73,82 ---- */ bool MyXactAccessedTempRel = false; + /* + * Bookkeeping for tracking emulated transactions in Recovery Procs. + */ + static TransactionId latestObservedXid = InvalidTransactionId; /* * transaction states - transaction state from server perspective *************** *** 139,144 **** --- 144,151 ---- Oid prevUser; /* previous CurrentUserId setting */ bool prevSecDefCxt; /* previous SecurityDefinerContext setting */ bool prevXactReadOnly; /* entry-time xact r/o state */ + bool xidMarkedInWAL; /* is this xid present in WAL yet? */ + bool hasUnMarkedSubXids; /* had unmarked subxids */ struct TransactionStateData *parent; /* back link to parent */ } TransactionStateData; *************** *** 167,172 **** --- 174,181 ---- InvalidOid, /* previous CurrentUserId setting */ false, /* previous SecurityDefinerContext setting */ false, /* entry-time xact r/o state */ + false, /* initial state for xidMarkedInWAL */ + false, /* hasUnMarkedSubXids */ NULL /* link to parent state block */ }; *************** *** 235,241 **** /* local function prototypes */ ! static void AssignTransactionId(TransactionState s); static void AbortTransaction(void); static void AtAbort_Memory(void); static void AtCleanup_Memory(void); --- 244,250 ---- /* local function prototypes */ ! static void AssignTransactionId(TransactionState s, int recursion_level); static void AbortTransaction(void); static void AtAbort_Memory(void); static void AtCleanup_Memory(void); *************** *** 329,335 **** GetTopTransactionId(void) { if (!TransactionIdIsValid(TopTransactionStateData.transactionId)) ! AssignTransactionId(&TopTransactionStateData); return TopTransactionStateData.transactionId; } --- 338,344 ---- GetTopTransactionId(void) { if (!TransactionIdIsValid(TopTransactionStateData.transactionId)) ! AssignTransactionId(&TopTransactionStateData, 0); return TopTransactionStateData.transactionId; } *************** *** 359,365 **** TransactionState s = CurrentTransactionState; if (!TransactionIdIsValid(s->transactionId)) ! AssignTransactionId(s); return s->transactionId; } --- 368,374 ---- TransactionState s = CurrentTransactionState; if (!TransactionIdIsValid(s->transactionId)) ! AssignTransactionId(s, 0); return s->transactionId; } *************** *** 376,382 **** return CurrentTransactionState->transactionId; } - /* * AssignTransactionId * --- 385,390 ---- *************** *** 387,397 **** * following its parent's. */ static void ! AssignTransactionId(TransactionState s) { bool isSubXact = (s->parent != NULL); ResourceOwner currentOwner; /* Assert that caller didn't screw up */ Assert(!TransactionIdIsValid(s->transactionId)); Assert(s->state == TRANS_INPROGRESS); --- 395,408 ---- * following its parent's. */ static void ! AssignTransactionId(TransactionState s, int recursion_level) { bool isSubXact = (s->parent != NULL); ResourceOwner currentOwner; + if (IsRecoveryProcessingMode()) + elog(FATAL, "cannot assign TransactionIds during recovery"); + /* Assert that caller didn't screw up */ Assert(!TransactionIdIsValid(s->transactionId)); Assert(s->state == TRANS_INPROGRESS); *************** *** 401,407 **** * than its parent. */ if (isSubXact && !TransactionIdIsValid(s->parent->transactionId)) ! AssignTransactionId(s->parent); /* * Generate a new Xid and record it in PG_PROC and pg_subtrans. --- 412,418 ---- * than its parent. */ if (isSubXact && !TransactionIdIsValid(s->parent->transactionId)) ! AssignTransactionId(s->parent, recursion_level + 1); /* * Generate a new Xid and record it in PG_PROC and pg_subtrans. *************** *** 413,419 **** */ s->transactionId = GetNewTransactionId(isSubXact); ! if (isSubXact) SubTransSetParent(s->transactionId, s->parent->transactionId); /* --- 424,437 ---- */ s->transactionId = GetNewTransactionId(isSubXact); ! /* ! * If we have overflowed the subxid cache then we must mark subtrans ! * with the parent xid. Prior to 8.4 we marked subtrans for each ! * subtransaction, though that is no longer necessary because the ! * way snapshots are searched in XidInMVCCSnapshot() has changed to ! * allow searching of both subxid cache and subtrans, not either/or. ! */ ! if (isSubXact && MyProc->subxids.overflowed) SubTransSetParent(s->transactionId, s->parent->transactionId); /* *************** *** 435,442 **** } PG_END_TRY(); CurrentResourceOwner = currentOwner; - } /* * GetCurrentSubTransactionId --- 453,518 ---- } PG_END_TRY(); CurrentResourceOwner = currentOwner; + elog(trace_recovery(DEBUG2), + "AssignXactId xid %d nest %d recursion %d xidMarkedInWAL %s hasParent %s", + s->transactionId, + GetCurrentTransactionNestLevel(), + recursion_level, + s->xidMarkedInWAL ? "t" : "f", + s->parent ? "t" : "f"); + + /* + * WAL log this assignment, if required. + * + * If we have large numbers of connections, we need to log also. + */ + if (recursion_level > 1 || + (recursion_level == 1 && isSubXact) || + (MyProc && MyProc->slotId >= XLOG_MAX_SLOT_ID)) + { + XLogRecData rdata; + xl_xact_assignment xlrec; + + xlrec.xassign = s->transactionId; + xlrec.isSubXact = (s->parent != NULL); + xlrec.slotId = MyProc->slotId; + + if (xlrec.isSubXact) + xlrec.xparent = s->parent->transactionId; + else + xlrec.xparent = InvalidTransactionId; + + START_CRIT_SECTION(); + + rdata.data = (char *) (&xlrec); + rdata.len = sizeof(xl_xact_assignment); + rdata.buffer = InvalidBuffer; + rdata.next = NULL; + + /* + * These WAL records look like no other. We are assigning a + * TransactionId to upper levels of the transaction stack. The + * transaction level we are looking may *not* be the *current* + * transaction. We have not yet assigned the xid for the current + * transaction, so the xid of this WAL record will be + * InvalidTransactionId, even though we are in a transaction. + * Got that? + * + * So we stuff the newly assigned xid into the WAL record and + * let WAL replay sort it out later. + */ + (void) XLogInsert(RM_XACT_ID, XLOG_XACT_ASSIGNMENT, &rdata); + + END_CRIT_SECTION(); + + /* + * Mark this transaction level, so we can avoid issuing WAL records + * for later subtransactions also. + */ + s->xidMarkedInWAL = true; + } + } /* * GetCurrentSubTransactionId *************** *** 822,831 **** --- 898,914 ---- bool haveNonTemp; int nchildren; TransactionId *children; + int nmsgs; + SharedInvalidationMessage *invalidationMessages = NULL; + bool RelcacheInitFileInval; /* Get data needed for commit record */ nrels = smgrGetPendingDeletes(true, &rels, &haveNonTemp); nchildren = xactGetCommittedChildren(&children); + nmsgs = xactGetCommittedInvalidationMessages(&invalidationMessages, + &RelcacheInitFileInval); + if (nmsgs > 0) + Assert(invalidationMessages != NULL); /* * If we haven't been assigned an XID yet, we neither can, nor do we want *************** *** 860,866 **** /* * Begin commit critical section and insert the commit XLOG record. */ ! XLogRecData rdata[3]; int lastrdata = 0; xl_xact_commit xlrec; --- 943,949 ---- /* * Begin commit critical section and insert the commit XLOG record. */ ! XLogRecData rdata[4]; int lastrdata = 0; xl_xact_commit xlrec; *************** *** 884,889 **** --- 967,982 ---- * This makes checkpoint's determination of which xacts are inCommit a * bit fuzzy, but it doesn't matter. */ + xlrec.xinfo = 0; + if (CurrentTransactionState->hasUnMarkedSubXids) + xlrec.xinfo |= XACT_COMPLETION_UNMARKED_SUBXIDS; + if (AtEOXact_Database_FlatFile_Update_Needed()) + xlrec.xinfo |= XACT_COMPLETION_UPDATE_DB_FILE; + if (AtEOXact_Auth_FlatFile_Update_Needed()) + xlrec.xinfo |= XACT_COMPLETION_UPDATE_AUTH_FILE; + if (RelcacheInitFileInval) + xlrec.xinfo |= XACT_COMPLETION_UPDATE_RELCACHE_FILE; + START_CRIT_SECTION(); MyProc->inCommit = true; *************** *** 891,896 **** --- 984,992 ---- xlrec.xact_time = xactStopTimestamp; xlrec.nrels = nrels; xlrec.nsubxacts = nchildren; + xlrec.nmsgs = nmsgs; + xlrec.slotId = MyProc->slotId; + rdata[0].data = (char *) (&xlrec); rdata[0].len = MinSizeOfXactCommit; rdata[0].buffer = InvalidBuffer; *************** *** 912,917 **** --- 1008,1022 ---- rdata[2].buffer = InvalidBuffer; lastrdata = 2; } + /* dump committed invalidation messages */ + if (nmsgs > 0) + { + rdata[lastrdata].next = &(rdata[3]); + rdata[3].data = (char *) invalidationMessages; + rdata[3].len = nmsgs * sizeof(SharedInvalidationMessage); + rdata[3].buffer = InvalidBuffer; + lastrdata = 3; + } rdata[lastrdata].next = NULL; (void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata); *************** *** 1215,1222 **** --- 1320,1331 ---- SetCurrentTransactionStopTimestamp(); xlrec.xact_time = xactStopTimestamp; } + xlrec.slotId = MyProc->slotId; xlrec.nrels = nrels; xlrec.nsubxacts = nchildren; + xlrec.xinfo = 0; + if (CurrentTransactionState->hasUnMarkedSubXids) + xlrec.xinfo |= XACT_COMPLETION_UNMARKED_SUBXIDS; rdata[0].data = (char *) (&xlrec); rdata[0].len = MinSizeOfXactAbort; rdata[0].buffer = InvalidBuffer; *************** *** 1523,1528 **** --- 1632,1639 ---- s->childXids = NULL; s->nChildXids = 0; s->maxChildXids = 0; + s->xidMarkedInWAL = false; + s->hasUnMarkedSubXids = false; GetUserIdAndContext(&s->prevUser, &s->prevSecDefCxt); /* SecurityDefinerContext should never be set outside a transaction */ Assert(!s->prevSecDefCxt); *************** *** 1635,1641 **** * must be done _before_ releasing locks we hold and _after_ * RecordTransactionCommit. */ ! ProcArrayEndTransaction(MyProc, latestXid); /* * This is all post-commit cleanup. Note that if an error is raised here, --- 1746,1752 ---- * must be done _before_ releasing locks we hold and _after_ * RecordTransactionCommit. */ ! ProcArrayEndTransaction(MyProc, latestXid, 0, NULL); /* * This is all post-commit cleanup. Note that if an error is raised here, *************** *** 2047,2053 **** * must be done _before_ releasing locks we hold and _after_ * RecordTransactionAbort. */ ! ProcArrayEndTransaction(MyProc, latestXid); /* * Post-abort cleanup. See notes in CommitTransaction() concerning --- 2158,2164 ---- * must be done _before_ releasing locks we hold and _after_ * RecordTransactionAbort. */ ! ProcArrayEndTransaction(MyProc, latestXid, 0, NULL); /* * Post-abort cleanup. See notes in CommitTransaction() concerning *************** *** 3746,3751 **** --- 3857,3868 ---- /* Must CCI to ensure commands of subtransaction are seen as done */ CommandCounterIncrement(); + /* + * Make sure we keep tracking xids that haven't marked WAL. + */ + if (!s->xidMarkedInWAL || s->hasUnMarkedSubXids) + s->parent->hasUnMarkedSubXids = true; + /* * Prior to 8.4 we marked subcommit in clog at this point. We now only * perform that step, if required, as part of the atomic update of the *************** *** 3865,3870 **** --- 3982,3993 ---- s->state = TRANS_ABORT; /* + * Make sure we keep tracking xids that haven't marked WAL. + */ + if (!s->xidMarkedInWAL || s->hasUnMarkedSubXids) + s->parent->hasUnMarkedSubXids = true; + + /* * Reset user ID which might have been changed transiently. (See notes * in AbortTransaction.) */ *************** *** 4207,4237 **** } /* * XLOG support routines */ static void ! xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid) { TransactionId *sub_xids; TransactionId max_xid; int i; - /* Mark the transaction committed in pg_clog */ - sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]); - TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids); - /* Make sure nextXid is beyond any XID mentioned in the record */ max_xid = xid; for (i = 0; i < xlrec->nsubxacts; i++) { if (TransactionIdPrecedes(max_xid, sub_xids[i])) max_xid = sub_xids[i]; } if (TransactionIdFollowsOrEquals(max_xid, ShmemVariableCache->nextXid)) { ShmemVariableCache->nextXid = max_xid; TransactionIdAdvance(ShmemVariableCache->nextXid); } --- 4330,4964 ---- } /* + * Fill in additional transaction information for an XLogRecord. + * We do this here so we can inspect various transaction state data, + * plus no need to further clutter XLogInsert(). + */ + void + GetStandbyInfoForTransaction(RmgrId rmid, uint8 info, XLogRecData *rdata, + TransactionId *xid2, uint16 *info2) + { + int level; + int slotId; + + if (!MyProc) + *info2 |= XLR2_INVALID_SLOT_ID; + else + { + slotId = MyProc->slotId; + + if (slotId >= XLOG_MAX_SLOT_ID) + *info2 |= XLR2_INVALID_SLOT_ID; + else + *info2 = ((uint16) slotId) & XLR2_INFO2_MASK; + } + + if (rmid == RM_XACT_ID && info == XLOG_XACT_ASSIGNMENT) + { + xl_xact_assignment *xlrec = (xl_xact_assignment *) rdata->data; + + /* + * We set the flag for records written by AssignTransactionId + * to allow that record type to be handled by + * RecordKnownAssignedTransactionIds(). This looks a little + * strange, but it avoids the need to alter the API of XLogInsert. + */ + if (xlrec->isSubXact) + *info2 |= XLR2_FIRST_SUBXID_RECORD; + else + *info2 |= XLR2_FIRST_XID_RECORD; + } + else + { + TransactionState s = CurrentTransactionState; + + /* + * If we haven't assigned an xid yet, don't flag the record. + * We currently assign xids when we make database entries, so + * things like storage creation and oid assignment does not + * have xids assigned on them. So no need to mark xid2 either. + */ + if (!TransactionIdIsValid(GetCurrentTransactionIdIfAny())) + return; + + level = GetCurrentTransactionNestLevel(); + + if (level >= 1 && !s->xidMarkedInWAL) + { + if (level == 1) + *info2 |= XLR2_FIRST_XID_RECORD; + else + { + *info2 |= XLR2_FIRST_SUBXID_RECORD; + + if (level == 2 && + !CurrentTransactionState->parent->xidMarkedInWAL) + { + *info2 |= XLR2_FIRST_XID_RECORD; + CurrentTransactionState->parent->xidMarkedInWAL = true; + } + } + CurrentTransactionState->xidMarkedInWAL = true; + + /* + * Decide whether we need to mark subtrans or not, for this xid. + * Top-level transaction is level=1, so we need to be careful to + * start at the right subtransaction. + */ + if (level > (PGPROC_MAX_CACHED_SUBXIDS + 1)) + *info2 |= XLR2_MARK_SUBTRANS; + } + + /* + * Set the secondary TransactionId for this record + */ + if (*info2 & XLR2_FIRST_SUBXID_RECORD) + *xid2 = CurrentTransactionState->parent->transactionId; + else if (rmid == RM_HEAP2_ID) + *xid2 = InvalidTransactionId; // XXX: GetLatestRemovedXidIfAny(); + } + + elog(trace_recovery(DEBUG3), "info2 %d xid2 %d", *info2, *xid2); + } + + void + LogCurrentRunningXacts(void) + { + RunningTransactions CurrRunningXacts = GetRunningTransactionData(); + xl_xact_running_xacts xlrec; + XLogRecData rdata[3]; + int lastrdata = 0; + XLogRecPtr recptr; + + xlrec.xcnt = CurrRunningXacts->xcnt; + xlrec.subxcnt = CurrRunningXacts->subxcnt; + xlrec.latestRunningXid = CurrRunningXacts->latestRunningXid; + xlrec.latestCompletedXid = CurrRunningXacts->latestCompletedXid; + + /* Header */ + rdata[0].data = (char *) (&xlrec); + rdata[0].len = MinSizeOfXactRunningXacts; + rdata[0].buffer = InvalidBuffer; + + /* array of RunningXact */ + if (xlrec.xcnt > 0) + { + rdata[0].next = &(rdata[1]); + rdata[1].data = (char *) CurrRunningXacts->xrun; + rdata[1].len = xlrec.xcnt * sizeof(RunningXact); + rdata[1].buffer = InvalidBuffer; + lastrdata = 1; + } + + /* array of RunningXact */ + if (xlrec.subxcnt > 0) + { + rdata[lastrdata].next = &(rdata[2]); + rdata[2].data = (char *) CurrRunningXacts->subxip; + rdata[2].len = xlrec.subxcnt * sizeof(TransactionId); + rdata[2].buffer = InvalidBuffer; + lastrdata = 2; + } + + rdata[lastrdata].next = NULL; + + START_CRIT_SECTION(); + + recptr = XLogInsert(RM_XACT_ID, XLOG_XACT_RUNNING_XACTS, rdata); + + END_CRIT_SECTION(); + + elog(trace_recovery(DEBUG1), "captured snapshot of running xacts %X/%X", recptr.xlogid, recptr.xrecoff); + } + + /* + * Is the data available to allow valid snapshots? + */ + bool + IsRunningXactDataIsValid(void) + { + if (TransactionIdIsValid(latestObservedXid)) + return true; + + return false; + } + + /* + * We need to issue shared invalidations and hold locks. Holding locks + * means others may want to wait on us, so we need to make lock table + * inserts to appear like a transaction. We could create and delete + * lock table entries for each transaction but its simpler just to create + * one permanent entry and leave it there all the time. Locks are then + * acquired and released as needed. Yes, this means you can see the + * Startup process in pg_locks once we have run this. + */ + void + InitRecoveryTransactionEnvironment(void) + { + VirtualTransactionId vxid; + + /* + * Initialise shared invalidation management for Startup process, + * being careful to register ourselves as a sendOnly process so + * we don't need to read messages, nor will we get signalled + * when the queue starts filling up. + */ + SharedInvalBackendInit(true); + + /* + * Lock a virtual transaction id for Startup process. + * + * We need to do GetNextLocalTransactionId() because + * SharedInvalBackendInit() leaves localTransactionid invalid and + * the lock manager doesn't like that at all. + * + * Note that we don't need to run XactLockTableInsert() because nobody + * needs to wait on xids. That sounds a little strange, but table locks + * are held by vxids and row level locks are held by xids. All queries + * hold AccessShareLocks so never block while we write or lock new rows. + */ + vxid.backendId = MyBackendId; + vxid.localTransactionId = GetNextLocalTransactionId(); + VirtualXactLockTableInsert(vxid); + } + + /* + * Called during archive recovery when we already know the WAL record is + * a cleanup record that might remove data that should be visible to + * some currently active snapshot. + * + * * First pull the latestRemovedXid and databaseId out of WAL record. + * * Get all virtual xids whose xmin is earlier than latestRemovedXid + * and who are in the same database + * * Check/Wait until we either give up waiting or vxids end + * * Blow away any backend we gave up waiting for it to complete + */ + void + XactResolveRedoVisibilityConflicts(XLogRecPtr lsn, XLogRecord *record) + { + VirtualTransactionId *old_snapshots; + uint8 info = record->xl_info & ~XLR_INFO_MASK; + Oid recDatabaseOid = 0; + TransactionId latestRemovedXid = 0; + + if (info == XLOG_HEAP2_CLEAN || info == XLOG_HEAP2_CLEAN_MOVE ) + { + xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record); + + latestRemovedXid = xlrec->latestRemovedXid; + recDatabaseOid = xlrec->node.dbNode; + } + else if (info == XLOG_HEAP2_CLEANUP_INFO) + { + xl_heap_clean *xlrec = (xl_heap_clean *) XLogRecGetData(record); + + latestRemovedXid = xlrec->latestRemovedXid; + recDatabaseOid = xlrec->node.dbNode; + } + else + elog(FATAL, "unrecognised cleanup record"); + + old_snapshots = GetCurrentVirtualXIDs(latestRemovedXid, + recDatabaseOid, + 0 /* no need to exclude vacuum */); + + ResolveRecoveryConflictWithVirtualXIDs(old_snapshots, + "cleanup redo"); + } + + #define XACT_IS_TOP_XACT false + #define XACT_IS_SUBXACT true + /* + * During recovery we maintain ProcArray with incoming xids + * when we first observe them in use. Uses local variables, so + * should only be called by Startup process. + * + * We record all xids that we know have been assigned. That includes + * all the xids on the WAL record, plus all unobserved xids that + * we can deduce have been assigned. We can deduce the existence of + * unobserved xids because we know xids are in sequence, with no gaps. + * + * XXX Be careful of what happens when we use pg_resetxlogs. + */ + void + RecordKnownAssignedTransactionIds(XLogRecPtr lsn, XLogRecord *record) + { + uint8 info = record->xl_info & ~XLR_INFO_MASK; + TransactionId xid, + parent_xid; + int slotId; + PGPROC *proc; + TransactionId next_expected_xid = latestObservedXid; + + if (!TransactionIdIsValid(latestObservedXid)) + return; + + TransactionIdAdvance(next_expected_xid); + + /* + * If its an assignment record, we need to need extract data from + * the body of the record, rather than take header values. This + * is because an assignment record can be issued when + * GetCurrentTransactionIdIfAny() returns InvalidTransactionId. + * We also use the supplied slotId rather than the header value, + * so we can cope with backends above XLOG_MAX_SLOT_ID. + */ + if (record->xl_rmid == RM_XACT_ID && info == XLOG_XACT_ASSIGNMENT) + { + xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record); + + xid = xlrec->xassign; + parent_xid = xlrec->xparent; + slotId = xlrec->slotId; + } + else + { + xid = record->xl_xid; + parent_xid = record->xl_xid2; + slotId = XLogRecGetSlotId(record); + } + + elog(trace_recovery(DEBUG4), "RecordKnown xid %d parent %d slot %d" + " latestObsvXid %d firstXid %s firstSubXid %s markSubtrans %s", + xid, parent_xid, slotId, latestObservedXid, + XLogRecIsFirstXidRecord(record) ? "t" : "f", + XLogRecIsFirstSubXidRecord(record) ? "t" : "f", + XLogRecMustMarkSubtrans(record) ? "t" : "f"); + + if (XLogRecIsFirstSubXidRecord(record)) + Assert(TransactionIdIsValid(parent_xid) && TransactionIdPrecedes(parent_xid, xid)); + else + Assert(!TransactionIdIsValid(parent_xid)); + + /* + * Identify the recovery proc that holds replay info for this xid + */ + proc = SlotIdGetRecoveryProc(slotId); + + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + + /* + * Record the newly observed xid onto the correct proc. + */ + if (XLogRecIsFirstXidRecord(record)) + { + if (XLogRecIsFirstSubXidRecord(record)) + { + /* + * If both flags are set, then we are seeing both the + * subtransaction xid and its top-level parent xid + * for the first time. So start the top-level transaction + * first, then add the subtransaction. + * + * Note that we don't need locks in all cases here + * because it is normal to start each of these atomically, + * in sequence. + */ + ProcArrayStartRecoveryTransaction(proc, parent_xid, lsn, XACT_IS_TOP_XACT); + ProcArrayStartRecoveryTransaction(proc, xid, lsn, XACT_IS_SUBXACT); + } + else + { + /* + * First observation of top-level xid only. + */ + ProcArrayStartRecoveryTransaction(proc, xid, lsn, XACT_IS_TOP_XACT); + } + } + else if (XLogRecIsFirstSubXidRecord(record)) + { + /* + * First observation of subtransaction xid. + */ + ProcArrayStartRecoveryTransaction(proc, xid, lsn, XACT_IS_SUBXACT); + } + + /* + * When a newly observed xid arrives, it is frequently the case + * that it is *not* the next xid in sequence. When this occurs, we + * must treat the intervening xids as running also. So we maintain + * a special list of these UnobservedXids, so that snapshots can + * see what's happening. + * + * We maintain both recovery Procs *and* UnobservedXids because we + * need them both. Recovery procs allow us to store top-level xids + * and subtransactions separately, otherwise we wouldn't know + * when to overflow the subxid cache. UnobservedXids allow us to + * make sense of the out-of-order arrival of xids. + * + * Some examples: + * 1) latestObservedXid = 647 + * next xid observed in WAL = 651 (a top-level transaction) + * so we add 648, 649, 650 to UnobservedXids + * + * 2) latestObservedXid = 769 + * next xid observed in WAL = 771 (a subtransaction) + * so we add 770 to UnobservedXids + * + * 3) latestObservedXid = 769 + * next xid observed in WAL = 810 (a subtransaction) + * 810's parent had not yet recorded WAL = 807 + * so we add 770 thru 809 inclusive to UnobservedXids + * then remove 807 + * + * 4) latestObservedXid = 769 + * next xid observed in WAL = 771 (a subtransaction) + * 771's parent had not yet recorded WAL = 770 + * so do nothing + * + * 5) latestObservedXid = 7747 + * next xid observed in WAL = 7748 (a subtransaction) + * 7748's parent had not yet recorded WAL = 7742 + * so we add 7748 and removed 7742 + */ + if (!XLogRecIsFirstXidRecord(record) || !XLogRecIsFirstSubXidRecord(record)) + { + /* + * Just have one xid to process, so fairly simple + */ + if (next_expected_xid == xid) + { + Assert(!XidInUnobservedTransactions(xid)); + Assert(!XLogRecIsFirstSubXidRecord(record) || + !XidInUnobservedTransactions(parent_xid)); + latestObservedXid = xid; + } + else if (TransactionIdPrecedes(next_expected_xid, xid)) + { + UnobservedTransactionsAddXids(next_expected_xid, xid); + latestObservedXid = xid; + } + else + UnobservedTransactionsRemoveXid(xid, true); + } + else + { + TransactionId next_plus_one_xid = next_expected_xid; + TransactionIdAdvance(next_plus_one_xid); + + /* + * Just remember when reading this logic that by definition we have + * Assert(TransactionIdPrecedes(parent_xid, xid)) + */ + if (next_expected_xid == parent_xid && next_plus_one_xid == xid) + { + Assert(!XidInUnobservedTransactions(xid)); + Assert(!XidInUnobservedTransactions(parent_xid)); + latestObservedXid = xid; + } + else if (next_expected_xid == xid) + { + latestObservedXid = xid; + UnobservedTransactionsRemoveXid(parent_xid, true); + } + else if (TransactionIdFollowsOrEquals(xid, next_plus_one_xid)) + { + UnobservedTransactionsAddXids(next_expected_xid, xid); + latestObservedXid = xid; + UnobservedTransactionsRemoveXid(parent_xid, true); + } + else if (TransactionIdPrecedes(xid, next_expected_xid)) + { + UnobservedTransactionsRemoveXid(xid, true); + UnobservedTransactionsRemoveXid(parent_xid, true); + } + else + elog(FATAL, "there are more combinations than you thought about"); + } + + LWLockRelease(ProcArrayLock); + + /* + * Now we've upated the proc we can update subtrans, if appropriate. + * We must do this step last to avoid race conditions. See comments + * and code for AssignTransactionId(). + */ + if (XLogRecMustMarkSubtrans(record)) + { + Assert(XLogRecIsFirstSubXidRecord(record)); + elog(trace_recovery(DEBUG2), + "subtrans setting parent %d for xid %d", parent_xid, xid); + SubTransSetParent(xid, parent_xid); + } + } + + /* * XLOG support routines */ + /* + * Before 8.4 this was a fairly short function, but now it performs many + * actions for which the order of execution is critical. + */ static void ! xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid, bool preparedXact) { + PGPROC *proc = NULL; TransactionId *sub_xids; TransactionId max_xid; int i; /* Make sure nextXid is beyond any XID mentioned in the record */ max_xid = xid; + sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]); + + /* + * Find the highest xid and remove unobserved xids if required. + */ for (i = 0; i < xlrec->nsubxacts; i++) { if (TransactionIdPrecedes(max_xid, sub_xids[i])) max_xid = sub_xids[i]; } + + if (InArchiveRecovery) + { + /* + * If we've just observed some new xids on the commit record + * make sure they're visible before we update clog. + */ + if (XactCompletionHasUnMarkedSubxids(xlrec)) + { + if (!TransactionIdIsValid(latestObservedXid)) + latestObservedXid = xid; + + if (TransactionIdPrecedes(latestObservedXid, max_xid)) + { + TransactionId next_expected_xid = latestObservedXid; + + TransactionIdAdvance(next_expected_xid); + if (TransactionIdPrecedes(next_expected_xid, max_xid)) + { + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + UnobservedTransactionsAddXids(next_expected_xid, max_xid); + LWLockRelease(ProcArrayLock); + } + latestObservedXid = max_xid; + } + } + + /* + * Even though there is a slotId on the xlrec header we use the slotId + * from the nody of the xlrec, to allow for cases where MaxBackends + * is larger than can fit in the xlrec header. + */ + proc = SlotIdGetRecoveryProc(xlrec->slotId); + + if (!preparedXact) + { + /* + * Double check everything to make sure there's no mistakes + * before we update the proc array. This test can be true in + * a number of normal running situations, it could also be + * a bug, which we test for last. + */ + if (xid != proc->xid) + { + /* + * If proc->xid is invalid that can be normal if this is + * the first time we've seen the xid. + */ + if (TransactionIdIsValid(proc->xid)) + { + /* + * There was a pre-existing xid in the slot. This can + * occur because of FATAL errors that don't write + * abort records. So it is an "implied abort". + * We need to remove any locks held by the failed + * transaction. We don't do that here, since we + * will be dropping all locks on this slot very soon + * anyway. + * + * If the correct xid exists in a different Recovery + * Proc then we have a bug related to slot usage. + */ + if (XidInRecoveryProcs(xid) && !preparedXact) + { + ProcArrayDisplay(LOG); + elog(FATAL, "abort accessed the wrong slot " + "xid %d slot %d proc->xid %d prep %s", + xid, xlrec->slotId, proc->xid, + (preparedXact ? "t" : "f")); + } + else + elog(trace_recovery(DEBUG3), + "implied abort of transaction id %d", + proc->xid); + } + } + } + + /* + * If requested, update the flat files for DB and Auth Files. + * These acquire AccessExclusiveLocks which will be released soon + * after we mark the commit in clog. These *must* be the last + * locks we take before updating clog to prevent deadlocks, and + * we also want to keep the window between this action and marking + * the commit as small as possible. + * + * XXXR does this handle relcache correctly, probably not. + */ + if (XactCompletionUpdateDBFile(xlrec)) + { + if (XactCompletionUpdateAuthFile(xlrec)) + BuildFlatFiles(false, true, false); + else + BuildFlatFiles(true, true, false); + } + } + + /* Mark the transaction committed in pg_clog */ + TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids); + + if (InArchiveRecovery) + { + /* + * We must mark clog before we update the ProcArray. Only update + * if we have already initialised the state and we have previously + * added an xid to the proc. We need no lock to check xid since it + * is controlled by Startup process. It's possible for xids to + * appear that haven't been seen before. We don't need to check + * UnobservedXids because in the normal case this will already have + * happened, but there are cases where they might sneak through. + * Leave these for the periodic cleanup by XACT_RUNNING_XACT records. + */ + if (TransactionIdIsValid(latestObservedXid) && + TransactionIdIsValid(proc->xid) && !preparedXact) + { + if (XactCompletionHasUnMarkedSubxids(xlrec)) + ProcArrayEndTransaction(proc, max_xid, xlrec->nsubxacts, sub_xids); + else + ProcArrayEndTransaction(proc, max_xid, 0, NULL); + } + + /* + * Send any cache invalidations attached to the commit. We must + * maintain the same order of invalidation then release locks + * as occurs in RecordTransactionCommit. + */ + if (xlrec->nmsgs > 0) + { + int offset = MinSizeOfXactCommit + + (xlrec->nsubxacts * sizeof(TransactionId)) + + (xlrec->nrels * sizeof(RelFileFork)); + SharedInvalidationMessage *msgs = (SharedInvalidationMessage *) + (((char *) xlrec) + offset); + + SendSharedInvalidMessages(msgs, xlrec->nmsgs); + } + + /* + * Release locks, if any. + */ + RelationReleaseRecoveryLocks(xlrec->slotId); + } + + /* Make sure nextXid is beyond any XID mentioned in the record */ if (TransactionIdFollowsOrEquals(max_xid, ShmemVariableCache->nextXid)) { ShmemVariableCache->nextXid = max_xid; + ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid; TransactionIdAdvance(ShmemVariableCache->nextXid); } *************** *** 4248,4275 **** } } static void ! xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid) { TransactionId *sub_xids; TransactionId max_xid; int i; - /* Mark the transaction aborted in pg_clog */ - sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]); - TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids); - /* Make sure nextXid is beyond any XID mentioned in the record */ max_xid = xid; for (i = 0; i < xlrec->nsubxacts; i++) { if (TransactionIdPrecedes(max_xid, sub_xids[i])) max_xid = sub_xids[i]; } if (TransactionIdFollowsOrEquals(max_xid, ShmemVariableCache->nextXid)) { ShmemVariableCache->nextXid = max_xid; TransactionIdAdvance(ShmemVariableCache->nextXid); } --- 4975,5119 ---- } } + /* + * Be careful with the order of execution, as with xact_redo_commit(). + * The two functions are similar but differ in key places. + */ static void ! xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid, bool preparedXact) { + PGPROC *proc = NULL; TransactionId *sub_xids; TransactionId max_xid; int i; /* Make sure nextXid is beyond any XID mentioned in the record */ max_xid = xid; + sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]); + + /* + * Find the highest xid and remove unobserved xids if required. + */ for (i = 0; i < xlrec->nsubxacts; i++) { if (TransactionIdPrecedes(max_xid, sub_xids[i])) max_xid = sub_xids[i]; } + + if (InArchiveRecovery) + { + /* + * If we've just observed some new xids on the commit record + * make sure they're visible before we update clog. + */ + if (XactCompletionHasUnMarkedSubxids(xlrec)) + { + if (!TransactionIdIsValid(latestObservedXid)) + latestObservedXid = xid; + + if (TransactionIdPrecedes(latestObservedXid, max_xid)) + { + TransactionId next_expected_xid = latestObservedXid; + + TransactionIdAdvance(next_expected_xid); + if (TransactionIdPrecedes(next_expected_xid, max_xid)) + { + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + UnobservedTransactionsAddXids(next_expected_xid, max_xid); + LWLockRelease(ProcArrayLock); + } + latestObservedXid = max_xid; + } + } + + /* + * Even though there is a slotId on the xlrec header we use the slotId + * from the nody of the xlrec, to allow for cases where MaxBackends + * is larger than can fit in the xlrec header. + */ + proc = SlotIdGetRecoveryProc(xlrec->slotId); + + if (!preparedXact) + { + /* + * Double check everything to make sure there's no mistakes + * before we update the proc array. This test can be true in + * a number of normal running situations, it could also be + * a bug, which we test for last. + */ + if (xid != proc->xid) + { + /* + * If proc->xid is invalid that can be normal if this is + * the first time we've seen the xid. + */ + if (TransactionIdIsValid(proc->xid)) + { + /* + * There was a pre-existing xid in the slot. This can + * occur because of FATAL errors that don't write + * abort records. So it is an "implied abort". + * We need to remove any locks held by the failed + * transaction. We don't do that here, since we + * will be dropping all locks on this slot very soon + * anyway. + * + * If the correct xid exists in a different Recovery + * Proc then we have a bug related to slot usage. + */ + if (XidInRecoveryProcs(xid) && !preparedXact) + { + ProcArrayDisplay(LOG); + elog(FATAL, "abort accessed the wrong slot " + "xid %d slot %d proc->xid %d prep %s", + xid, xlrec->slotId, proc->xid, + (preparedXact ? "t" : "f")); + } + else + elog(trace_recovery(DEBUG3), + "implied abort of transaction id %d", + proc->xid); + } + } + } + } + + /* Mark the transaction aborted in pg_clog */ + TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids); + + if (InArchiveRecovery) + { + /* + * We must mark clog before we update the ProcArray. Only update + * if we have already initialised the state and we have previously + * added an xid to the proc. We need no lock to check xid since it + * is controlled by Startup process. It's possible for xids to + * appear that haven't been seen before. We don't need to check + * UnobservedXids because in the normal case this will already have + * happened, but there are cases where they might sneak through. + * Leave these for the periodic cleanup by XACT_RUNNING_XACT records. + */ + if (TransactionIdIsValid(latestObservedXid) && + TransactionIdIsValid(proc->xid) && !preparedXact) + { + if (XactCompletionHasUnMarkedSubxids(xlrec)) + ProcArrayEndTransaction(proc, max_xid, xlrec->nsubxacts, sub_xids); + else + ProcArrayEndTransaction(proc, max_xid, 0, NULL); + } + + /* + * Release locks, if any. There are no invalidations to send. + */ + RelationReleaseRecoveryLocks(xlrec->slotId); + } + + /* Make sure nextXid is beyond any XID mentioned in the record */ if (TransactionIdFollowsOrEquals(max_xid, ShmemVariableCache->nextXid)) { ShmemVariableCache->nextXid = max_xid; + ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid; TransactionIdAdvance(ShmemVariableCache->nextXid); } *************** *** 4295,4307 **** { xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record); ! xact_redo_commit(xlrec, record->xl_xid); } else if (info == XLOG_XACT_ABORT) { xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(record); ! xact_redo_abort(xlrec, record->xl_xid); } else if (info == XLOG_XACT_PREPARE) { --- 5139,5155 ---- { xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record); ! xact_redo_commit(xlrec, record->xl_xid, false); } else if (info == XLOG_XACT_ABORT) { xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(record); ! Assert(!XactCompletionUpdateDBFile(xlrec) && ! !XactCompletionUpdateAuthFile(xlrec) && ! !XactCompletionRelcacheInitFileInval(xlrec)); ! ! xact_redo_abort(xlrec, record->xl_xid, false); } else if (info == XLOG_XACT_PREPARE) { *************** *** 4313,4328 **** { xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared *) XLogRecGetData(record); ! xact_redo_commit(&xlrec->crec, xlrec->xid); RemoveTwoPhaseFile(xlrec->xid, false); } else if (info == XLOG_XACT_ABORT_PREPARED) { xl_xact_abort_prepared *xlrec = (xl_xact_abort_prepared *) XLogRecGetData(record); ! xact_redo_abort(&xlrec->arec, xlrec->xid); RemoveTwoPhaseFile(xlrec->xid, false); } else elog(PANIC, "xact_redo: unknown op code %u", info); } --- 5161,5204 ---- { xl_xact_commit_prepared *xlrec = (xl_xact_commit_prepared *) XLogRecGetData(record); ! xact_redo_commit(&xlrec->crec, xlrec->xid, true); RemoveTwoPhaseFile(xlrec->xid, false); } else if (info == XLOG_XACT_ABORT_PREPARED) { xl_xact_abort_prepared *xlrec = (xl_xact_abort_prepared *) XLogRecGetData(record); ! xact_redo_abort(&xlrec->arec, xlrec->xid, true); RemoveTwoPhaseFile(xlrec->xid, false); } + else if (info == XLOG_XACT_ASSIGNMENT) + { + /* + * This is a no-op since RecordKnownAssignedTransactionIds() + * already did all the work on this record for us. + */ + return; + } + else if (info == XLOG_XACT_RUNNING_XACTS) + { + xl_xact_running_xacts *xlrec = (xl_xact_running_xacts *) XLogRecGetData(record); + + /* + * Initialise if we have a valid snapshot to work with + */ + if (TransactionIdIsValid(xlrec->latestRunningXid) && + (!TransactionIdIsValid(latestObservedXid) || + TransactionIdPrecedes(latestObservedXid, xlrec->latestRunningXid))) + { + latestObservedXid = xlrec->latestRunningXid; + ShmemVariableCache->latestCompletedXid = xlrec->latestCompletedXid; + elog(trace_recovery(DEBUG1), + "initial snapshot created; latestObservedXid = %d latestCompletedXid = %d", + latestObservedXid, xlrec->latestCompletedXid); + } + + ProcArrayUpdateRecoveryTransactions(lsn, xlrec); + } else elog(PANIC, "xact_redo: unknown op code %u", info); } *************** *** 4335,4341 **** appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time)); if (xlrec->nrels > 0) { ! appendStringInfo(buf, "; rels:"); for (i = 0; i < xlrec->nrels; i++) { RelFileNode rnode = xlrec->xnodes[i].rnode; --- 5211,5217 ---- appendStringInfoString(buf, timestamptz_to_str(xlrec->xact_time)); if (xlrec->nrels > 0) { ! appendStringInfo(buf, "; %d rels:", xlrec->nrels); for (i = 0; i < xlrec->nrels; i++) { RelFileNode rnode = xlrec->xnodes[i].rnode; *************** *** 4349,4360 **** if (xlrec->nsubxacts > 0) { TransactionId *xacts = (TransactionId *) ! &xlrec->xnodes[xlrec->nrels]; ! ! appendStringInfo(buf, "; subxacts:"); for (i = 0; i < xlrec->nsubxacts; i++) appendStringInfo(buf, " %u", xacts[i]); } } static void --- 5225,5256 ---- if (xlrec->nsubxacts > 0) { TransactionId *xacts = (TransactionId *) ! &xlrec->xnodes[xlrec->nrels]; ! appendStringInfo(buf, "; %d subxacts:", xlrec->nsubxacts); for (i = 0; i < xlrec->nsubxacts; i++) appendStringInfo(buf, " %u", xacts[i]); } + if (xlrec->nmsgs > 0) + { + /* yeh, really... */ + int offset = MinSizeOfXactCommit + + (xlrec->nsubxacts * sizeof(TransactionId)) + + (xlrec->nrels * sizeof(RelFileFork)); + SharedInvalidationMessage *msgs = (SharedInvalidationMessage *) + (((char *) xlrec) + offset); + appendStringInfo(buf, "; %d inval msgs:", xlrec->nmsgs); + for (i = 0; i < xlrec->nmsgs; i++) + { + SharedInvalidationMessage *msg = msgs + i; + + if (msg->id >= 0) + appendStringInfo(buf, "catcache id%d ", msg->id); + else if (msg->id == SHAREDINVALRELCACHE_ID) + appendStringInfo(buf, "relcache "); + else if (msg->id == SHAREDINVALSMGR_ID) + appendStringInfo(buf, "smgr "); + } + } } static void *************** *** 4387,4392 **** --- 5283,5326 ---- } } + static void + xact_desc_running_xacts(StringInfo buf, xl_xact_running_xacts *xlrec) + { + int xid_index, + subxid_index; + TransactionId *subxip = (TransactionId *) &(xlrec->xrun[xlrec->xcnt]); + + appendStringInfo(buf, "nxids %u nsubxids %u latestRunningXid %d", + xlrec->xcnt, + xlrec->subxcnt, + xlrec->latestRunningXid); + + for (xid_index = 0; xid_index < xlrec->xcnt; xid_index++) + { + RunningXact *rxact = (RunningXact *) xlrec->xrun; + + appendStringInfo(buf, "; xid %d pid %d backend %d db %d role %d " + "vacflag %u nsubxids %u offset %d overflowed %s", + rxact[xid_index].xid, + rxact[xid_index].pid, + rxact[xid_index].slotId, + rxact[xid_index].databaseId, + rxact[xid_index].roleId, + rxact[xid_index].vacuumFlags, + rxact[xid_index].nsubxids, + rxact[xid_index].subx_offset, + (rxact[xid_index].overflowed ? "t" : "f")); + + if (rxact[xid_index].nsubxids > 0) + { + appendStringInfo(buf, "; subxacts: "); + for (subxid_index = 0; subxid_index < rxact[xid_index].nsubxids; subxid_index++) + appendStringInfo(buf, " %u", + subxip[subxid_index + rxact[xid_index].subx_offset]); + } + } + } + void xact_desc(StringInfo buf, uint8 xl_info, char *rec) { *************** *** 4424,4429 **** --- 5358,5393 ---- appendStringInfo(buf, "abort %u: ", xlrec->xid); xact_desc_abort(buf, &xlrec->arec); } + else if (info == XLOG_XACT_ASSIGNMENT) + { + xl_xact_assignment *xlrec = (xl_xact_assignment *) rec; + + /* ignore the main xid, it may be Invalid and misleading */ + appendStringInfo(buf, "assignment: xid %u slotid %d", + xlrec->xassign, xlrec->slotId); + } + else if (info == XLOG_XACT_RUNNING_XACTS) + { + xl_xact_running_xacts *xlrec = (xl_xact_running_xacts *) rec; + + appendStringInfo(buf, "running xacts: "); + xact_desc_running_xacts(buf, xlrec); + } + else if (info == XLOG_XACT_ASSIGNMENT) + { + xl_xact_assignment *xlrec = (xl_xact_assignment *) rec; + + /* ignore the main xid, it may be Invalid and misleading */ + appendStringInfo(buf, "assignment: xid %u slotid %d", + xlrec->xassign, xlrec->slotId); + } + else if (info == XLOG_XACT_RUNNING_XACTS) + { + xl_xact_running_xacts *xlrec = (xl_xact_running_xacts *) rec; + + appendStringInfo(buf, "running xacts: "); + xact_desc_running_xacts(buf, xlrec); + } else appendStringInfo(buf, "UNKNOWN"); } Index: src/backend/access/transam/xlog.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.321 diff -c -r1.321 xlog.c *** src/backend/access/transam/xlog.c 31 Oct 2008 15:04:59 -0000 1.321 --- src/backend/access/transam/xlog.c 1 Nov 2008 15:36:19 -0000 *************** *** 43,54 **** --- 43,56 ---- #include "storage/ipc.h" #include "storage/pmsignal.h" #include "storage/procarray.h" + #include "storage/sinval.h" #include "storage/smgr.h" #include "storage/spin.h" #include "utils/builtins.h" #include "utils/guc.h" #include "utils/ps_status.h" + #define WAL_DEBUG /* File path names (all relative to $PGDATA) */ #define BACKUP_LABEL_FILE "backup_label" *************** *** 68,74 **** int sync_method = DEFAULT_SYNC_METHOD; #ifdef WAL_DEBUG ! bool XLOG_DEBUG = false; #endif /* --- 70,78 ---- int sync_method = DEFAULT_SYNC_METHOD; #ifdef WAL_DEBUG ! bool XLOG_DEBUG_FLUSH = false; ! bool XLOG_DEBUG_BGFLUSH = false; ! bool XLOG_DEBUG_REDO = true; #endif /* *************** *** 113,119 **** /* * ThisTimeLineID will be same in all backends --- it identifies current ! * WAL timeline for the database system. */ TimeLineID ThisTimeLineID = 0; --- 117,124 ---- /* * ThisTimeLineID will be same in all backends --- it identifies current ! * WAL timeline for the database system. Zero is always a bug, so we ! * start with that to allow us to spot any errors. */ TimeLineID ThisTimeLineID = 0; *************** *** 121,146 **** bool InRecovery = false; /* Are we recovering using offline XLOG archives? */ ! static bool InArchiveRecovery = false; /* Was the last xlog file restored from archive, or local? */ static bool restoredFromArchive = false; /* options taken from recovery.conf */ static char *recoveryRestoreCommand = NULL; ! static bool recoveryTarget = false; static bool recoveryTargetExact = false; static bool recoveryTargetInclusive = true; static bool recoveryLogRestartpoints = false; static TransactionId recoveryTargetXid; static TimestampTz recoveryTargetTime; static TimestampTz recoveryLastXTime = 0; /* if recoveryStopsHere returns true, it saves actual stop xid/time here */ static TransactionId recoveryStopXid; static TimestampTz recoveryStopTime; static bool recoveryStopAfter; /* * During normal operation, the only timeline we care about is ThisTimeLineID. * During recovery, however, things are more complicated. To simplify life --- 126,175 ---- bool InRecovery = false; /* Are we recovering using offline XLOG archives? */ ! bool InArchiveRecovery = false; ! ! /* Local copy of shared RecoveryProcessingMode state */ ! static bool LocalRecoveryProcessingMode = true; ! static bool knownProcessingMode = false; /* Was the last xlog file restored from archive, or local? */ static bool restoredFromArchive = false; + /* recovery target modes */ + #define RECOVERY_TARGET_NONE 0 + #define RECOVERY_TARGET_PAUSE_ALL 1 + #define RECOVERY_TARGET_PAUSE_CLEANUP 2 + #define RECOVERY_TARGET_PAUSE_XID 3 + #define RECOVERY_TARGET_PAUSE_TIME 4 + #define RECOVERY_TARGET_ADVANCE 5 + #define RECOVERY_TARGET_STOP_IMMEDIATE 6 + #define RECOVERY_TARGET_STOP_XID 7 + #define RECOVERY_TARGET_STOP_TIME 8 + /* options taken from recovery.conf */ static char *recoveryRestoreCommand = NULL; ! static int recoveryTargetMode = RECOVERY_TARGET_NONE; static bool recoveryTargetExact = false; static bool recoveryTargetInclusive = true; static bool recoveryLogRestartpoints = false; static TransactionId recoveryTargetXid; static TimestampTz recoveryTargetTime; + static int recoveryTargetAdvance = 0; + + #define DEFAULT_MAX_STANDBY_DELAY 300 + int maxStandbyDelay = DEFAULT_MAX_STANDBY_DELAY; + static TimestampTz recoveryLastXTime = 0; + static TransactionId recoveryLastXid = InvalidTransactionId; /* if recoveryStopsHere returns true, it saves actual stop xid/time here */ static TransactionId recoveryStopXid; static TimestampTz recoveryStopTime; static bool recoveryStopAfter; + /* is the database proven consistent yet? */ + bool reachedSafeStartPoint = false; + /* * During normal operation, the only timeline we care about is ThisTimeLineID. * During recovery, however, things are more complicated. To simplify life *************** *** 240,249 **** * ControlFileLock: must be held to read/update control file or create * new log file. * ! * CheckpointLock: must be held to do a checkpoint (ensures only one ! * checkpointer at a time; currently, with all checkpoints done by the ! * bgwriter, this is just pro forma). ! * *---------- */ --- 269,298 ---- * ControlFileLock: must be held to read/update control file or create * new log file. * ! * CheckpointLock: must be held to do a checkpoint or restartpoint, ensuring ! * we get just one of those at any time. In 8.4+ recovery, both startup and ! * bgwriter processes may take restartpoints, so this locking must be strict ! * to ensure there are no mistakes. ! * ! * In 8.4 we progress through a number of states at startup. Initially, the ! * postmaster is in PM_STARTUP state and spawns the Startup process. We then ! * progress until the database is in a consistent state, then if we are in ! * InArchiveRecovery we go into PM_RECOVERY state. The bgwriter then starts ! * up and takes over responsibility for performing restartpoints. We then ! * progress until the end of recovery when we enter PM_RUN state upon ! * termination of the Startup process. In summary: ! * ! * PM_STARTUP state: Startup process performs restartpoints ! * PM_RECOVERY state: bgwriter process performs restartpoints ! * PM_RUN state: bgwriter process performs checkpoints ! * ! * These transitions are fairly delicate, with many things that need to ! * happen at the same time in order to change state successfully throughout ! * the system. Changing PM_STARTUP to PM_RECOVERY only occurs when we can ! * prove the databases are in a consistent state. Changing from PM_RECOVERY ! * to PM_RUN happens whenever recovery ends, which could be forced upon us ! * externally or it can occur becasue of damage or termination of the WAL ! * sequence. *---------- */ *************** *** 285,295 **** --- 334,351 ---- /* * Total shared-memory state for XLOG. + * + * This small structure is accessed by many backends, so we take care to + * pad out the parts of the structure so they can be accessed by separate + * CPUs without causing false sharing cache flushes. Padding is generous + * to allow for a wide variety of CPU architectures. */ + #define XLOGCTL_BUFFER_SPACING 128 typedef struct XLogCtlData { /* Protected by WALInsertLock: */ XLogCtlInsert Insert; + char InsertPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlInsert)]; /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; *************** *** 297,305 **** --- 353,368 ---- uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */ TransactionId ckptXid; XLogRecPtr asyncCommitLSN; /* LSN of newest async commit */ + /* add data structure padding for above info_lck declarations */ + char InfoPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogwrtRqst) + - sizeof(XLogwrtResult) + - sizeof(uint32) + - sizeof(TransactionId) + - sizeof(XLogRecPtr)]; /* Protected by WALWriteLock: */ XLogCtlWrite Write; + char WritePadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlWrite)]; /* * These values do not change after startup, although the pointed-to pages *************** *** 311,316 **** --- 374,410 ---- int XLogCacheBlck; /* highest allocated xlog buffer index */ TimeLineID ThisTimeLineID; + /* + * IsRecoveryProcessingMode shows whether the postmaster is in a + * postmaster state earlier than PM_RUN, or not. This is a globally + * accessible state to allow EXEC_BACKEND case. + * + * We also retain a local state variable InRecovery. InRecovery=true + * means the code is being executed by Startup process and therefore + * always during Recovery Processing Mode. This allows us to identify + * code executed *during* Recovery Processing Mode but not necessarily + * by Startup process itself. + * + * Protected by mode_lck + */ + bool SharedRecoveryProcessingMode; + slock_t mode_lck; + + /* + * recovery target control information + * + * Protected by info_lck + */ + int recoveryTargetMode; + TransactionId recoveryTargetXid; + TimestampTz recoveryTargetTime; + int recoveryTargetAdvance; + + TimestampTz recoveryLastXTime; + TransactionId recoveryLastXid; + + char InfoLockPadding[XLOGCTL_BUFFER_SPACING]; + slock_t info_lck; /* locks shared variables shown above */ } XLogCtlData; *************** *** 397,404 **** --- 491,500 ---- static void readRecoveryCommandFile(void); static void exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg); + static void exitRecovery(void); static bool recoveryStopsHere(XLogRecord *record, bool *includeThis); static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags); + static XLogRecPtr GetRedoLocationForCheckpoint(void); static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites, XLogRecPtr *lsn, BkpBlock *bkpb); *************** *** 473,478 **** --- 569,576 ---- XLogRecData dtbuf_rdt1[XLR_MAX_BKP_BLOCKS]; XLogRecData dtbuf_rdt2[XLR_MAX_BKP_BLOCKS]; XLogRecData dtbuf_rdt3[XLR_MAX_BKP_BLOCKS]; + TransactionId xl_xid2 = InvalidTransactionId; + uint16 xl_info2 = 0; pg_crc32 rdata_crc; uint32 len, write_len; *************** *** 480,485 **** --- 578,591 ---- bool updrqst; bool doPageWrites; bool isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH); + bool isRecoveryEnd = (rmid == RM_XLOG_ID && + (info == XLOG_RECOVERY_END || + info == XLOG_CHECKPOINT_ONLINE)); + + /* cross-check on whether we should be here or not */ + if (IsRecoveryProcessingMode() && !isRecoveryEnd) + elog(FATAL, "cannot make new WAL entries during recovery " + "(RMgrId = %d info = %d)", rmid, info); /* info's high bits are reserved for use by me */ if (info & XLR_INFO_MASK) *************** *** 628,633 **** --- 734,744 ---- if (len == 0 && !isLogSwitch) elog(PANIC, "invalid xlog record length %u", len); + /* + * Get standby information before we do lock and critical section. + */ + GetStandbyInfoForTransaction(rmid, info, rdata, &xl_xid2, &xl_info2); + START_CRIT_SECTION(); /* Now wait to get insert lock */ *************** *** 816,821 **** --- 927,934 ---- record->xl_len = len; /* doesn't include backup blocks */ record->xl_info = info; record->xl_rmid = rmid; + record->xl_xid2 = xl_xid2; + record->xl_info2 = xl_info2; /* Now we can finish computing the record's CRC */ COMP_CRC32(rdata_crc, (char *) record + sizeof(pg_crc32), *************** *** 823,847 **** FIN_CRC32(rdata_crc); record->xl_crc = rdata_crc; - #ifdef WAL_DEBUG - if (XLOG_DEBUG) - { - StringInfoData buf; - - initStringInfo(&buf); - appendStringInfo(&buf, "INSERT @ %X/%X: ", - RecPtr.xlogid, RecPtr.xrecoff); - xlog_outrec(&buf, record); - if (rdata->data != NULL) - { - appendStringInfo(&buf, " - "); - RmgrTable[record->xl_rmid].rm_desc(&buf, record->xl_info, rdata->data); - } - elog(LOG, "%s", buf.data); - pfree(buf.data); - } - #endif - /* Record begin of record in appropriate places */ ProcLastRecPtr = RecPtr; Insert->PrevRecord = RecPtr; --- 936,941 ---- *************** *** 1720,1727 **** XLogRecPtr WriteRqstPtr; XLogwrtRqst WriteRqst; ! /* Disabled during REDO */ ! if (InRedo) return; /* Quick exit if already known flushed */ --- 1814,1820 ---- XLogRecPtr WriteRqstPtr; XLogwrtRqst WriteRqst; ! if (IsRecoveryProcessingMode()) return; /* Quick exit if already known flushed */ *************** *** 1729,1735 **** return; #ifdef WAL_DEBUG ! if (XLOG_DEBUG) elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X", record.xlogid, record.xrecoff, LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff, --- 1822,1828 ---- return; #ifdef WAL_DEBUG ! if (XLOG_DEBUG_FLUSH) elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X", record.xlogid, record.xrecoff, LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff, *************** *** 1809,1817 **** * the bad page is encountered again during recovery then we would be * unable to restart the database at all! (This scenario has actually * happened in the field several times with 7.1 releases. Note that we ! * cannot get here while InRedo is true, but if the bad page is brought in ! * and marked dirty during recovery then CreateCheckPoint will try to ! * flush it at the end of recovery.) * * The current approach is to ERROR under normal conditions, but only * WARNING during recovery, so that the system can be brought up even if --- 1902,1910 ---- * the bad page is encountered again during recovery then we would be * unable to restart the database at all! (This scenario has actually * happened in the field several times with 7.1 releases. Note that we ! * cannot get here while IsRecoveryProcessingMode(), but if the bad page is ! * brought in and marked dirty during recovery then if a checkpoint were ! * performed at the end of recovery it will try to flush it. * * The current approach is to ERROR under normal conditions, but only * WARNING during recovery, so that the system can be brought up even if *************** *** 1821,1827 **** * and so we will not force a restart for a bad LSN on a data page. */ if (XLByteLT(LogwrtResult.Flush, record)) ! elog(InRecovery ? WARNING : ERROR, "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X", record.xlogid, record.xrecoff, LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff); --- 1914,1920 ---- * and so we will not force a restart for a bad LSN on a data page. */ if (XLByteLT(LogwrtResult.Flush, record)) ! elog(ERROR, "xlog flush request %X/%X is not satisfied --- flushed only to %X/%X", record.xlogid, record.xrecoff, LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff); *************** *** 1879,1885 **** return; #ifdef WAL_DEBUG ! if (XLOG_DEBUG) elog(LOG, "xlog bg flush request %X/%X; write %X/%X; flush %X/%X", WriteRqstPtr.xlogid, WriteRqstPtr.xrecoff, LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff, --- 1972,1978 ---- return; #ifdef WAL_DEBUG ! if (XLOG_DEBUG_BGFLUSH) elog(LOG, "xlog bg flush request %X/%X; write %X/%X; flush %X/%X", WriteRqstPtr.xlogid, WriteRqstPtr.xrecoff, LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff, *************** *** 2094,2100 **** unlink(tmppath); } ! elog(DEBUG2, "done creating and filling new WAL file"); /* Set flag to tell caller there was no existent file */ *use_existent = false; --- 2187,2194 ---- unlink(tmppath); } ! XLogFileName(tmppath, ThisTimeLineID, log, seg); ! elog(DEBUG2, "done creating and filling new WAL file %s", tmppath); /* Set flag to tell caller there was no existent file */ *use_existent = false; *************** *** 2400,2405 **** --- 2494,2521 ---- xlogfname); set_ps_display(activitymsg, false); + /* + * Calculate and write out a new safeStartPoint. This defines + * the latest LSN that might appear on-disk while we apply + * the WAL records in this file. If we crash during recovery + * we must reach this point again before we can prove + * database consistency. Not a restartpoint! Restart points + * define where we should start recovery from, if we crash. + */ + if (InArchiveRecovery) + { + uint32 nextLog = log; + uint32 nextSeg = seg; + + NextLogSeg(nextLog, nextSeg); + + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); + ControlFile->minSafeStartPoint.xlogid = nextLog; + ControlFile->minSafeStartPoint.xrecoff = nextSeg * XLogSegSize; + UpdateControlFile(); + LWLockRelease(ControlFileLock); + } + return fd; } if (errno != ENOENT) /* unexpected failure? */ *************** *** 2866,2871 **** --- 2982,3007 ---- } /* + * RecordIsCleanupRecord() determines whether or not the record + * will remove rows from data blocks. This is important because + * applying these records could effect the validity of MVCC snapshots, + * so there are various controls over replaying such records. + */ + static bool + RecordIsCleanupRecord(XLogRecord *record) + { + RmgrId rmid = record->xl_rmid; + // uint8 info = record->xl_info & ~XLR_INFO_MASK; + + // if (rmid == RM_HEAP2_ID )|| + // (rmid == RM_BTREE_ID && btree_needs_cleanup_lock(info))) + if (rmid == RM_HEAP2_ID ) + return true; + + return false; + } + + /* * Restore the backup blocks present in an XLOG record, if any. * * We assume all of the record has been read into memory at *record. *************** *** 2887,2892 **** --- 3023,3037 ---- BkpBlock bkpb; char *blk; int i; + int lockmode; + + /* + * What kind of lock do we need to apply the backup blocks? + */ + if (RecordIsCleanupRecord(record)) + lockmode = BUFFER_LOCK_CLEANUP; + else + lockmode = BUFFER_LOCK_EXCLUSIVE; blk = (char *) XLogRecGetData(record) + record->xl_len; for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++) *************** *** 2898,2904 **** blk += sizeof(BkpBlock); buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block, ! RBM_ZERO); Assert(BufferIsValid(buffer)); page = (Page) BufferGetPage(buffer); --- 3043,3049 ---- blk += sizeof(BkpBlock); buffer = XLogReadBufferExtended(bkpb.node, bkpb.fork, bkpb.block, ! RBM_ZERO, lockmode); Assert(BufferIsValid(buffer)); page = (Page) BufferGetPage(buffer); *************** *** 4228,4233 **** --- 4373,4379 ---- XLogCtl->XLogCacheBlck = XLOGbuffers - 1; XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages); SpinLockInit(&XLogCtl->info_lck); + SpinLockInit(&XLogCtl->mode_lck); /* * If we are not in bootstrap mode, pg_control should already exist. Read *************** *** 4311,4316 **** --- 4457,4463 ---- record->xl_prev.xlogid = 0; record->xl_prev.xrecoff = 0; record->xl_xid = InvalidTransactionId; + record->xl_xid2 = InvalidTransactionId; record->xl_tot_len = SizeOfXLogRecord + sizeof(checkPoint); record->xl_len = sizeof(checkPoint); record->xl_info = XLOG_CHECKPOINT_SHUTDOWN; *************** *** 4494,4500 **** ereport(LOG, (errmsg("recovery_target_xid = %u", recoveryTargetXid))); ! recoveryTarget = true; recoveryTargetExact = true; } else if (strcmp(tok1, "recovery_target_time") == 0) --- 4641,4647 ---- ereport(LOG, (errmsg("recovery_target_xid = %u", recoveryTargetXid))); ! recoveryTargetMode = RECOVERY_TARGET_STOP_XID; recoveryTargetExact = true; } else if (strcmp(tok1, "recovery_target_time") == 0) *************** *** 4505,4511 **** */ if (recoveryTargetExact) continue; ! recoveryTarget = true; recoveryTargetExact = false; /* --- 4652,4658 ---- */ if (recoveryTargetExact) continue; ! recoveryTargetMode = RECOVERY_TARGET_STOP_TIME; recoveryTargetExact = false; /* *************** *** 4544,4549 **** --- 4691,4716 ---- ereport(LOG, (errmsg("log_restartpoints = %s", tok2))); } + else if (strcmp(tok1, "max_standby_delay") == 0) + { + errno = 0; + maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0); + if (errno == EINVAL || errno == ERANGE) + ereport(FATAL, + (errmsg("max_standby_delay is not a valid number: \"%s\"", + tok2))); + /* + * 2E6 seconds is about 23 days. Allows us to measure delay in + * milliseconds. + */ + if (maxStandbyDelay > INT_MAX || maxStandbyDelay < 0) + ereport(FATAL, + (errmsg("max_standby_delay must be between 0 (wait forever) and 2 000 000 secs"))); + + ereport(LOG, + (errmsg("max_standby_delay = %u", + maxStandbyDelay))); + } else ereport(FATAL, (errmsg("unrecognized recovery parameter \"%s\"", *************** *** 4678,4700 **** unlink(recoveryPath); /* ignore any error */ /* ! * Rename the config file out of the way, so that we don't accidentally ! * re-enter archive recovery mode in a subsequent crash. */ - unlink(RECOVERY_COMMAND_DONE); - if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0) - ereport(FATAL, - (errcode_for_file_access(), - errmsg("could not rename file \"%s\" to \"%s\": %m", - RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE))); ereport(LOG, (errmsg("archive recovery complete"))); } /* ! * For point-in-time recovery, this function decides whether we want to ! * stop applying the XLOG at or after the current record. * * Returns TRUE if we are stopping, FALSE otherwise. On TRUE return, * *includeThis is set TRUE if we should apply this record before stopping. --- 4845,4901 ---- unlink(recoveryPath); /* ignore any error */ /* ! * As of 8.4 we no longer rename the recovery.conf file out of the ! * way until after we have performed a full checkpoint. This ensures ! * that any crash between now and the end of the checkpoint does not ! * attempt to restart from a WAL file that is no longer available to us. ! * As soon as we remove recovery.conf we lose our recovery_command and ! * cannot reaccess WAL files from the archive. */ ereport(LOG, (errmsg("archive recovery complete"))); } + #ifdef DEBUG_RECOVERY_CONTROL + static void + LogRecoveryTargetModeInfo(void) + { + int lrecoveryTargetMode; + TransactionId lrecoveryTargetXid; + TimestampTz lrecoveryTargetTime; + int lrecoveryTargetAdvance; + + TimestampTz lrecoveryLastXTime; + TransactionId lrecoveryLastXid; + + { + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + SpinLockAcquire(&xlogctl->info_lck); + + lrecoveryTargetMode = xlogctl->recoveryTargetMode; + lrecoveryTargetXid = xlogctl->recoveryTargetXid; + lrecoveryTargetTime = xlogctl->recoveryTargetTime; + lrecoveryTargetAdvance = xlogctl->recoveryTargetAdvance; + lrecoveryLastXTime = xlogctl->recoveryLastXTime; + lrecoveryLastXid = xlogctl->recoveryLastXid; + + SpinLockRelease(&xlogctl->info_lck); + } + + elog(LOG, "mode %d xid %u time %s adv %d", + lrecoveryTargetMode, + lrecoveryTargetXid, + timestamptz_to_str(lrecoveryTargetTime), + lrecoveryTargetAdvance); + } + #endif + /* ! * For archive recovery, this function decides whether we want to ! * pause or stop applying the XLOG at or after the current record. * * Returns TRUE if we are stopping, FALSE otherwise. On TRUE return, * *includeThis is set TRUE if we should apply this record before stopping. *************** *** 4708,4778 **** recoveryStopsHere(XLogRecord *record, bool *includeThis) { bool stopsHere; ! uint8 record_info; TimestampTz recordXtime; ! /* We only consider stopping at COMMIT or ABORT records */ ! if (record->xl_rmid != RM_XACT_ID) ! return false; ! record_info = record->xl_info & ~XLR_INFO_MASK; ! if (record_info == XLOG_XACT_COMMIT) { ! xl_xact_commit *recordXactCommitData; ! recordXactCommitData = (xl_xact_commit *) XLogRecGetData(record); ! recordXtime = recordXactCommitData->xact_time; ! } ! else if (record_info == XLOG_XACT_ABORT) ! { ! xl_xact_abort *recordXactAbortData; ! recordXactAbortData = (xl_xact_abort *) XLogRecGetData(record); ! recordXtime = recordXactAbortData->xact_time; ! } ! else ! return false; ! /* Do we have a PITR target at all? */ ! if (!recoveryTarget) ! { ! recoveryLastXTime = recordXtime; ! return false; } ! if (recoveryTargetExact) { /* ! * there can be only one transaction end record with this exact ! * transactionid ! * ! * when testing for an xid, we MUST test for equality only, since ! * transactions are numbered in the order they start, not the order ! * they complete. A higher numbered xid will complete before you about ! * 50% of the time... ! */ ! stopsHere = (record->xl_xid == recoveryTargetXid); ! if (stopsHere) ! *includeThis = recoveryTargetInclusive; ! } ! else ! { /* ! * there can be many transactions that share the same commit time, so ! * we stop after the last one, if we are inclusive, or stop at the ! * first one if we are exclusive */ ! if (recoveryTargetInclusive) ! stopsHere = (recordXtime > recoveryTargetTime); ! else ! stopsHere = (recordXtime >= recoveryTargetTime); ! if (stopsHere) ! *includeThis = false; } if (stopsHere) { recoveryStopXid = record->xl_xid; ! recoveryStopTime = recordXtime; recoveryStopAfter = *includeThis; if (record_info == XLOG_XACT_COMMIT) --- 4909,5150 ---- recoveryStopsHere(XLogRecord *record, bool *includeThis) { bool stopsHere; ! bool pauseHere = false; ! bool paused = false; ! uint8 record_info = 0; /* valid iff (is_xact_completion_record) */ TimestampTz recordXtime; ! bool is_xact_completion_record = false; ! /* We only consider stopping at COMMIT or ABORT records */ ! if (record->xl_rmid == RM_XACT_ID) { ! record_info = record->xl_info & ~XLR_INFO_MASK; ! if (record_info == XLOG_XACT_COMMIT) ! { ! xl_xact_commit *recordXactCommitData; ! recordXactCommitData = (xl_xact_commit *) XLogRecGetData(record); ! recordXtime = recordXactCommitData->xact_time; ! is_xact_completion_record = true; ! } ! else if (record_info == XLOG_XACT_ABORT) ! { ! xl_xact_abort *recordXactAbortData; ! recordXactAbortData = (xl_xact_abort *) XLogRecGetData(record); ! recordXtime = recordXactAbortData->xact_time; ! is_xact_completion_record = true; ! } ! /* Remember the most recent COMMIT/ABORT time for logging purposes */ ! if (is_xact_completion_record) ! { ! recoveryLastXTime = recordXtime; ! recoveryLastXid = record->xl_xid; ! } } ! do { + int prevRecoveryTargetMode = recoveryTargetMode; + /* ! * Let's see if user has updated our recoveryTargetMode. ! */ ! { ! /* use volatile pointer to prevent code rearrangement */ ! volatile XLogCtlData *xlogctl = XLogCtl; ! ! SpinLockAcquire(&xlogctl->info_lck); ! recoveryTargetMode = xlogctl->recoveryTargetMode; ! if (recoveryTargetMode != RECOVERY_TARGET_NONE) ! { ! recoveryTargetXid = xlogctl->recoveryTargetXid; ! recoveryTargetTime = xlogctl->recoveryTargetTime; ! recoveryTargetAdvance = xlogctl->recoveryTargetAdvance; ! } ! if (is_xact_completion_record) ! { ! xlogctl->recoveryLastXTime = recordXtime; ! xlogctl->recoveryLastXid = record->xl_xid; ! } ! SpinLockRelease(&xlogctl->info_lck); ! } ! ! /* Decide how to act on any pause target */ ! switch (recoveryTargetMode) ! { ! case RECOVERY_TARGET_NONE: ! /* ! * If we aren't paused and we're not looking to stop, ! * just exit out quickly and get on with recovery. ! */ ! if (paused) ! ereport(LOG, ! (errmsg("recovery restarting"))); ! return false; ! ! case RECOVERY_TARGET_PAUSE_ALL: ! pauseHere = true; ! break; ! ! case RECOVERY_TARGET_ADVANCE: ! if (paused) ! { ! if (recoveryTargetAdvance > 0) ! return false; ! } ! else if (recoveryTargetAdvance-- <= 0) ! pauseHere = true; ! break; ! ! case RECOVERY_TARGET_STOP_IMMEDIATE: ! case RECOVERY_TARGET_STOP_XID: ! case RECOVERY_TARGET_STOP_TIME: ! paused = false; ! break; ! ! /* ! * If we're paused, and mode has changed reset to allow new settings ! * to apply and maybe allow us to continue. ! */ ! if (paused && prevRecoveryTargetMode != recoveryTargetMode) ! paused = false; ! ! case RECOVERY_TARGET_PAUSE_CLEANUP: ! /* ! * Advance until we see a cleanup record. ! */ ! if (RecordIsCleanupRecord(record)) ! pauseHere = true; ! break; ! ! case RECOVERY_TARGET_PAUSE_XID: ! /* ! * there can be only one transaction end record with this exact ! * transactionid ! * ! * when testing for an xid, we MUST test for equality only, since ! * transactions are numbered in the order they start, not the order ! * they complete. A higher numbered xid will complete before you about ! * 50% of the time... ! */ ! if (is_xact_completion_record) ! pauseHere = (record->xl_xid == recoveryTargetXid); ! break; ! ! case RECOVERY_TARGET_PAUSE_TIME: ! /* ! * there can be many transactions that share the same commit time, so ! * we pause after the last one, if we are inclusive, or pause at the ! * first one if we are exclusive ! */ ! if (is_xact_completion_record) ! { ! if (recoveryTargetInclusive) ! pauseHere = (recoveryLastXTime > recoveryTargetTime); ! else ! pauseHere = (recoveryLastXTime >= recoveryTargetTime); ! } ! break; ! ! default: ! ereport(WARNING, ! (errmsg("unknown recovery mode %d, continuing recovery", ! recoveryTargetMode))); ! return false; ! } ! ! if (pauseHere && !paused) ! { ! if (is_xact_completion_record) ! { ! if (record_info == XLOG_XACT_COMMIT) ! ereport(LOG, ! (errmsg("recovery pausing before commit of transaction %u, time %s", ! record->xl_xid, ! timestamptz_to_str(recoveryLastXTime)))); ! else ! ereport(LOG, ! (errmsg("recovery pausing before abort of transaction %u, time %s", ! record->xl_xid, ! timestamptz_to_str(recoveryLastXTime)))); ! } ! else ! ereport(LOG, ! (errmsg("recovery pausing; last completed transaction %u, time %s", ! recoveryLastXid, ! timestamptz_to_str(recoveryLastXTime)))); ! ! set_ps_display("recovery paused", false); ! ! paused = true; ! } ! /* ! * Pause for a while before rechecking mode at top of loop. */ ! if (paused) ! pg_usleep(200000L); ! ! /* ! * We leave the loop at the bottom only if our recovery mode is ! * set (or has been recently reset) to one of the stop options. ! */ ! } while (paused); ! ! /* ! * Decide how to act if stop target mode set. We run this separately from ! * pause to allow user to reset their stop target while paused. ! */ ! switch (recoveryTargetMode) ! { ! case RECOVERY_TARGET_STOP_IMMEDIATE: ! ereport(LOG, ! (errmsg("recovery stopping immediately"))); ! return true; ! ! case RECOVERY_TARGET_STOP_XID: ! /* ! * there can be only one transaction end record with this exact ! * transactionid ! * ! * when testing for an xid, we MUST test for equality only, since ! * transactions are numbered in the order they start, not the order ! * they complete. A higher numbered xid will complete before you about ! * 50% of the time... ! */ ! if (is_xact_completion_record) ! { ! stopsHere = (record->xl_xid == recoveryTargetXid); ! if (stopsHere) ! *includeThis = recoveryTargetInclusive; ! } ! break; ! ! case RECOVERY_TARGET_STOP_TIME: ! /* ! * there can be many transactions that share the same commit time, so ! * we stop after the last one, if we are inclusive, or stop at the ! * first one if we are exclusive ! */ ! if (is_xact_completion_record) ! { ! if (recoveryTargetInclusive) ! stopsHere = (recoveryLastXTime > recoveryTargetTime); ! else ! stopsHere = (recoveryLastXTime >= recoveryTargetTime); ! if (stopsHere) ! *includeThis = false; ! } ! break; } if (stopsHere) { + Assert(is_xact_completion_record); recoveryStopXid = record->xl_xid; ! recoveryStopTime = recoveryLastXTime; recoveryStopAfter = *includeThis; if (record_info == XLOG_XACT_COMMIT) *************** *** 4801,4890 **** recoveryStopXid, timestamptz_to_str(recoveryStopTime)))); } - - if (recoveryStopAfter) - recoveryLastXTime = recordXtime; } - else - recoveryLastXTime = recordXtime; return stopsHere; } /* ! * This must be called ONCE during postmaster or standalone-backend startup */ ! void ! StartupXLOG(void) { ! XLogCtlInsert *Insert; ! CheckPoint checkPoint; ! bool wasShutdown; ! bool reachedStopPoint = false; ! bool haveBackupLabel = false; ! XLogRecPtr RecPtr, ! LastRec, ! checkPointLoc, ! minRecoveryLoc, ! EndOfLog; ! uint32 endLogId; ! uint32 endLogSeg; ! XLogRecord *record; ! uint32 freespace; ! TransactionId oldestActiveXID; ! /* ! * Read control file and check XLOG status looks valid. ! * ! * Note: in most control paths, *ControlFile is already valid and we need ! * not do ReadControlFile() here, but might as well do it to be sure. ! */ ! ReadControlFile(); ! if (ControlFile->state < DB_SHUTDOWNED || ! ControlFile->state > DB_IN_PRODUCTION || ! !XRecOffIsValid(ControlFile->checkPoint.xrecoff)) ! ereport(FATAL, ! (errmsg("control file contains invalid data"))); ! if (ControlFile->state == DB_SHUTDOWNED) ! ereport(LOG, ! (errmsg("database system was shut down at %s", ! str_time(ControlFile->time)))); ! else if (ControlFile->state == DB_SHUTDOWNING) ! ereport(LOG, ! (errmsg("database system shutdown was interrupted; last known up at %s", ! str_time(ControlFile->time)))); ! else if (ControlFile->state == DB_IN_CRASH_RECOVERY) ! ereport(LOG, ! (errmsg("database system was interrupted while in recovery at %s", ! str_time(ControlFile->time)), ! errhint("This probably means that some data is corrupted and" ! " you will have to use the last backup for recovery."))); ! else if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY) ! ereport(LOG, ! (errmsg("database system was interrupted while in recovery at log time %s", ! str_time(ControlFile->checkPointCopy.time)), ! errhint("If this has occurred more than once some data might be corrupted" ! " and you might need to choose an earlier recovery target."))); ! else if (ControlFile->state == DB_IN_PRODUCTION) ! ereport(LOG, ! (errmsg("database system was interrupted; last known up at %s", ! str_time(ControlFile->time)))); ! /* This is just to allow attaching to startup process with a debugger */ ! #ifdef XLOG_REPLAY_DELAY ! if (ControlFile->state != DB_SHUTDOWNED) ! pg_usleep(60000000L); ! #endif ! /* ! * Initialize on the assumption we want to recover to the same timeline ! * that's active according to pg_control. ! */ ! recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID; ! /* * Check for recovery control file, and if so set up state for offline * recovery */ --- 5173,5449 ---- recoveryStopXid, timestamptz_to_str(recoveryStopTime)))); } } return stopsHere; } /* ! * Utility function used by various user functions to set the recovery ! * target mode. This allows user control over the progress of recovery. */ ! static void ! SetRecoveryTargetMode(int mode, TransactionId xid, TimestampTz ts, int advance) { ! if (!superuser()) ! ereport(ERROR, ! (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), ! errmsg("must be superuser to control recovery"))); ! if (!IsRecoveryProcessingMode()) ! ereport(ERROR, ! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), ! errmsg("recovery is not in progress"), ! errhint("WAL control functions can only be executed during recovery."))); ! { ! /* use volatile pointer to prevent code rearrangement */ ! volatile XLogCtlData *xlogctl = XLogCtl; ! SpinLockAcquire(&xlogctl->info_lck); ! xlogctl->recoveryTargetMode = mode; ! if (mode == RECOVERY_TARGET_STOP_XID || ! mode == RECOVERY_TARGET_PAUSE_XID) ! xlogctl->recoveryTargetXid = xid; ! else if (mode == RECOVERY_TARGET_STOP_TIME || ! mode == RECOVERY_TARGET_PAUSE_TIME) ! xlogctl->recoveryTargetTime = ts; ! else if (mode == RECOVERY_TARGET_ADVANCE) ! xlogctl->recoveryTargetAdvance = advance; ! SpinLockRelease(&xlogctl->info_lck); ! } ! return; ! } ! ! /* ! * Forces recovery mode to reset to unfrozen. ! * Returns void. ! */ ! Datum ! pg_recovery_continue(PG_FUNCTION_ARGS) ! { ! SetRecoveryTargetMode(RECOVERY_TARGET_NONE, InvalidTransactionId, 0, 0); ! ! PG_RETURN_VOID(); ! } ! ! /* ! * Pause recovery immediately. Stays paused until asked to play again. ! * Returns void. ! */ ! Datum ! pg_recovery_pause(PG_FUNCTION_ARGS) ! { ! SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_ALL, InvalidTransactionId, 0, 0); ! ! PG_RETURN_VOID(); ! } ! ! /* ! * Pause recovery at the next cleanup record. Stays paused until asked to ! * play again. ! */ ! Datum ! pg_recovery_pause_cleanup(PG_FUNCTION_ARGS) ! { ! SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_CLEANUP, InvalidTransactionId, 0, 0); ! ! PG_RETURN_VOID(); ! } ! ! /* ! * Pause recovery at stated xid, if ever seen. Once paused, stays paused ! * until asked to play again. ! */ ! Datum ! pg_recovery_pause_xid(PG_FUNCTION_ARGS) ! { ! int xidi = PG_GETARG_INT32(0); ! TransactionId xid = (TransactionId) xidi; ! ! if (xid < 3) ! elog(ERROR, "cannot specify special values for transaction id"); ! ! SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_XID, xid, 0, 0); ! ! PG_RETURN_VOID(); ! } ! ! /* ! * Pause recovery at stated timestamp, if ever reached. Once paused, stays paused ! * until asked to play again. ! */ ! Datum ! pg_recovery_pause_time(PG_FUNCTION_ARGS) ! { ! TimestampTz ts = PG_GETARG_TIMESTAMPTZ(0); ! ! SetRecoveryTargetMode(RECOVERY_TARGET_PAUSE_TIME, InvalidTransactionId, ts, 0); ! ! PG_RETURN_VOID(); ! } ! ! /* ! * If paused, advance N records. ! */ ! Datum ! pg_recovery_advance(PG_FUNCTION_ARGS) ! { ! int adv = PG_GETARG_INT32(0); ! ! if (adv < 1) ! elog(ERROR, "recovery advance must be greater than or equal to 1"); ! ! SetRecoveryTargetMode(RECOVERY_TARGET_ADVANCE, InvalidTransactionId, 0, adv); ! ! PG_RETURN_VOID(); ! } ! ! /* ! * Forces recovery to stop now if paused, or at end of next record if playing. ! */ ! Datum ! pg_recovery_stop(PG_FUNCTION_ARGS) ! { ! SetRecoveryTargetMode(RECOVERY_TARGET_STOP_IMMEDIATE, InvalidTransactionId, 0, 0); ! ! PG_RETURN_VOID(); ! } ! ! /* ! * Returns bool with current recovery mode ! */ ! Datum ! pg_is_in_recovery(PG_FUNCTION_ARGS) ! { ! PG_RETURN_BOOL(IsRecoveryProcessingMode()); ! } ! ! /* ! * Returns timestamp of last completed transaction ! */ ! Datum ! pg_last_completed_xact_timestamp(PG_FUNCTION_ARGS) ! { ! PG_RETURN_TIMESTAMPTZ(recoveryLastXTime); ! } ! ! /* ! * Returns delay in milliseconds, or -1 if delay too large ! */ ! int ! GetLatestReplicationDelay(void) ! { ! long delay_secs; ! int delay_usecs; ! int delay; ! TimestampTz currTz = GetCurrentTimestamp(); ! ! TimestampDifference(recoveryLastXTime, currTz, ! &delay_secs, &delay_usecs); ! ! /* ! * If delay is very large we probably aren't looking at ! * a replication situation at all, just a recover from backup. ! * So return a special value instead. ! */ ! if (delay_secs > (long)(INT_MAX / 1000)) ! delay = -1; ! else ! delay = (int)(delay_secs * 1000) + (delay_usecs / 1000); ! ! return delay; ! } ! ! /* ! * Returns xid of last completed transaction ! */ ! Datum ! pg_last_completed_xid(PG_FUNCTION_ARGS) ! { ! PG_RETURN_INT32(recoveryLastXid); ! } ! ! /* ! * This must be called ONCE during postmaster or standalone-backend startup ! */ ! void ! StartupXLOG(void) ! { ! XLogCtlInsert *Insert; ! CheckPoint checkPoint; ! bool wasShutdown; ! bool reachedStopPoint = false; ! bool performedRecovery = false; ! bool haveBackupLabel = false; ! XLogRecPtr RecPtr, ! LastRec, ! checkPointLoc, ! minRecoveryLoc, ! EndOfLog; ! uint32 endLogId; ! uint32 endLogSeg; ! XLogRecord *record; ! uint32 freespace; ! TransactionId oldestActiveXID; ! ! XLogCtl->SharedRecoveryProcessingMode = true; ! ! /* ! * Read control file and check XLOG status looks valid. ! * ! * Note: in most control paths, *ControlFile is already valid and we need ! * not do ReadControlFile() here, but might as well do it to be sure. ! */ ! ReadControlFile(); ! ! if (ControlFile->state < DB_SHUTDOWNED || ! ControlFile->state > DB_IN_PRODUCTION || ! !XRecOffIsValid(ControlFile->checkPoint.xrecoff)) ! ereport(FATAL, ! (errmsg("control file contains invalid data"))); ! ! if (ControlFile->state == DB_SHUTDOWNED) ! ereport(LOG, ! (errmsg("database system was shut down at %s", ! str_time(ControlFile->time)))); ! else if (ControlFile->state == DB_SHUTDOWNING) ! ereport(LOG, ! (errmsg("database system shutdown was interrupted; last known up at %s", ! str_time(ControlFile->time)))); ! else if (ControlFile->state == DB_IN_CRASH_RECOVERY) ! ereport(LOG, ! (errmsg("database system was interrupted while in recovery at %s", ! str_time(ControlFile->time)), ! errhint("This probably means that some data is corrupted and" ! " you will have to use the last backup for recovery."))); ! else if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY) ! ereport(LOG, ! (errmsg("database system was interrupted while in recovery at log time %s", ! str_time(ControlFile->checkPointCopy.time)), ! errhint("If this has occurred more than once some data might be corrupted" ! " and you might need to choose an earlier recovery target."))); ! else if (ControlFile->state == DB_IN_PRODUCTION) ! ereport(LOG, ! (errmsg("database system was interrupted; last known up at %s", ! str_time(ControlFile->time)))); ! ! /* This is just to allow attaching to startup process with a debugger */ ! #ifdef XLOG_REPLAY_DELAY ! if (ControlFile->state != DB_SHUTDOWNED) ! pg_usleep(60000000L); ! #endif ! ! /* ! * Initialize on the assumption we want to recover to the same timeline ! * that's active according to pg_control. ! */ ! recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID; ! ! /* * Check for recovery control file, and if so set up state for offline * recovery */ *************** *** 5046,5054 **** --- 5605,5619 ---- if (minRecoveryLoc.xlogid != 0 || minRecoveryLoc.xrecoff != 0) ControlFile->minRecoveryPoint = minRecoveryLoc; ControlFile->time = (pg_time_t) time(NULL); + /* No need to hold ControlFileLock yet, we aren't up far enough */ UpdateControlFile(); /* + * Reset pgstat data, because it may be invalid after recovery. + */ + pgstat_reset_all(); + + /* * If there was a backup label file, it's done its job and the info * has now been propagated into pg_control. We must get rid of the * label file so that if we crash during recovery, we'll pick up at *************** *** 5105,5111 **** do { #ifdef WAL_DEBUG ! if (XLOG_DEBUG) { StringInfoData buf; --- 5670,5681 ---- do { #ifdef WAL_DEBUG ! int loglevel = DEBUG3; ! ! if (XLogRecIsFirstUseOfXid(record) || rmid == RM_XACT_ID) ! loglevel = DEBUG2; ! ! if (loglevel >= trace_recovery_messages) { StringInfoData buf; *************** *** 5148,5153 **** --- 5718,5743 ---- TransactionIdAdvance(ShmemVariableCache->nextXid); } + if (InArchiveRecovery) + { + /* + * Make sure the incoming transaction is emulated as running + * prior to allowing any changes made by it to touch data. + */ + if (XLogRecIsFirstUseOfXid(record)) + RecordKnownAssignedTransactionIds(EndRecPtr, record); + + /* + * Wait, kill or otherwise resolve any conflicts between + * incoming cleanup records and user queries. This is the + * main barrier that allows MVCC to work correctly when + * running standby servers. Only need to do this if there + * is a possibility that users may be active. + */ + if (reachedSafeStartPoint && RecordIsCleanupRecord(record)) + XactResolveRedoVisibilityConflicts(EndRecPtr, record); + } + if (record->xl_info & XLR_BKP_BLOCK_MASK) RestoreBkpBlocks(record, EndRecPtr); *************** *** 5158,5163 **** --- 5748,5784 ---- LastRec = ReadRecPtr; + /* + * Can we signal Postmaster to enter consistent recovery mode? + * + * There are two points in the log that we must pass. The first + * is minRecoveryPoint, which is the LSN at the time the + * base backup was taken that we are about to rollfoward from. + * If recovery has ever crashed or was stopped there is also + * another point also: minSafeStartPoint, which we know the + * latest LSN that recovery could have reached prior to crash. + * + * We must also have assembled sufficient information about + * transaction state to allow valid snapshots to be taken. + */ + if (!reachedSafeStartPoint && + IsRunningXactDataIsValid() && + XLByteLE(ControlFile->minSafeStartPoint, EndRecPtr) && + XLByteLE(ControlFile->minRecoveryPoint, EndRecPtr)) + { + reachedSafeStartPoint = true; + if (InArchiveRecovery) + { + ereport(LOG, + (errmsg("database has now reached consistent state at %X/%X", + EndRecPtr.xlogid, EndRecPtr.xrecoff))); + if (IsUnderPostmaster) + SendPostmasterSignal(PMSIGNAL_RECOVERY_START); + InitRecoveryTransactionEnvironment(); + StartCleanupDelayStats(); + } + } + record = ReadRecord(NULL, LOG); } while (record != NULL && recoveryContinue); *************** *** 5179,5184 **** --- 5800,5806 ---- /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); + reachedSafeStartPoint = true; } } *************** *** 5192,5207 **** /* * Complain if we did not roll forward far enough to render the backup ! * dump consistent. */ ! if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint)) { if (reachedStopPoint) /* stopped because of stop request */ ereport(FATAL, (errmsg("requested recovery stop point is before end time of backup dump"))); else /* ran off end of WAL */ ereport(FATAL, ! (errmsg("WAL ends before end time of backup dump"))); } /* --- 5814,5829 ---- /* * Complain if we did not roll forward far enough to render the backup ! * dump consistent and start safely. */ ! if (InArchiveRecovery && !reachedSafeStartPoint) { if (reachedStopPoint) /* stopped because of stop request */ ereport(FATAL, (errmsg("requested recovery stop point is before end time of backup dump"))); else /* ran off end of WAL */ ereport(FATAL, ! (errmsg("end of WAL reached before end time of backup dump"))); } /* *************** *** 5316,5354 **** XLogCheckInvalidPages(); /* ! * Reset pgstat data, because it may be invalid after recovery. */ ! pgstat_reset_all(); ! /* ! * Perform a checkpoint to update all our recovery activity to disk. ! * ! * Note that we write a shutdown checkpoint rather than an on-line ! * one. This is not particularly critical, but since we may be ! * assigning a new TLI, using a shutdown checkpoint allows us to have ! * the rule that TLI only changes in shutdown checkpoints, which ! * allows some extra error checking in xlog_redo. ! */ ! CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); } - /* - * Preallocate additional log files, if wanted. - */ - PreallocXlogFiles(EndOfLog); - - /* - * Okay, we're officially UP. - */ - InRecovery = false; - - ControlFile->state = DB_IN_PRODUCTION; - ControlFile->time = (pg_time_t) time(NULL); - UpdateControlFile(); - - /* start the archive_timeout timer running */ - XLogCtl->Write.lastSegSwitchTime = ControlFile->time; - /* initialize shared-memory copy of latest checkpoint XID/epoch */ XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch; XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid; --- 5938,5951 ---- XLogCheckInvalidPages(); /* ! * Finally exit recovery and mark that in WAL. Pre-8.4 we wrote ! * a shutdown checkpoint here, but we ask bgwriter to do that now. */ ! exitRecovery(); ! performedRecovery = true; } /* initialize shared-memory copy of latest checkpoint XID/epoch */ XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch; XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid; *************** *** 5357,5362 **** --- 5954,5963 ---- ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid; TransactionIdRetreat(ShmemVariableCache->latestCompletedXid); + /* Shutdown the recovery environment. Must be in this order */ + ProcArrayClearRecoveryTransactions(); + RelationClearRecoveryLocks(); + /* Start up the commit log and related stuff, too */ StartupCLOG(); StartupSUBTRANS(oldestActiveXID); *************** *** 5382,5387 **** --- 5983,6084 ---- readRecordBuf = NULL; readRecordBufSize = 0; } + + /* + * Prior to 8.4 we wrote a Shutdown Checkpoint at the end of recovery. + * This could add minutes to the startup time, so we want bgwriter + * to perform it. This then frees the Startup process to complete so we can + * allow transactions and WAL inserts. We still write a checkpoint, but + * it will be an online checkpoint. Online checkpoints have a redo + * location that can be prior to the actual checkpoint record. So we want + * to derive that redo location *before* we let anybody else write WAL, + * otherwise we might miss some WAL records if we crash. + */ + if (performedRecovery) + { + XLogRecPtr redo; + + /* + * We must grab the pointer before anybody writes WAL + */ + redo = GetRedoLocationForCheckpoint(); + + /* + * Set up information for the bgwriter, but if it is not active + * for whatever reason, perform the checkpoint ourselves. + */ + if (SetRedoLocationForArchiveCheckpoint(redo)) + { + /* + * Okay, we can come up now. Allow others to write WAL. + */ + XLogCtl->SharedRecoveryProcessingMode = false; + elog(trace_recovery(DEBUG1), "WAL inserts enabled"); + + /* + * Now request checkpoint from bgwriter. + */ + RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE); + } + else + { + /* + * Startup process performs the checkpoint, but defers + * the change in processing mode until afterwards. + */ + CreateCheckPoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE); + } + } + else + { + /* + * No recovery, so lets just get on with it. + */ + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); + ControlFile->state = DB_IN_PRODUCTION; + ControlFile->time = (pg_time_t) time(NULL); + UpdateControlFile(); + LWLockRelease(ControlFileLock); + } + + /* + * Okay, we can come up now. Allow others to write WAL. + */ + XLogCtl->SharedRecoveryProcessingMode = false; + + /* start the archive_timeout timer running */ + XLogCtl->Write.lastSegSwitchTime = (pg_time_t) time(NULL); + } + + /* + * IsRecoveryProcessingMode() + * + * Fast test for whether we're still in recovery or not. We test the shared + * state each time only until we leave recovery mode. After that we never + * look again, relying upon the settings of our local state variables. This + * is designed to avoid the need for a separate initialisation step. + */ + bool + IsRecoveryProcessingMode(void) + { + if (knownProcessingMode && !LocalRecoveryProcessingMode) + return false; + + { + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + if (xlogctl == NULL) + return false; + + SpinLockAcquire(&xlogctl->mode_lck); + LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode; + SpinLockRelease(&xlogctl->mode_lck); + } + + knownProcessingMode = true; + + return LocalRecoveryProcessingMode; } /* *************** *** 5639,5658 **** static void LogCheckpointStart(int flags) { ! elog(LOG, "checkpoint starting:%s%s%s%s%s%s", ! (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "", ! (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "", ! (flags & CHECKPOINT_FORCE) ? " force" : "", ! (flags & CHECKPOINT_WAIT) ? " wait" : "", ! (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "", ! (flags & CHECKPOINT_CAUSE_TIME) ? " time" : ""); } /* * Log end of a checkpoint. */ static void ! LogCheckpointEnd(void) { long write_secs, sync_secs, --- 6336,6359 ---- static void LogCheckpointStart(int flags) { ! if (flags & CHECKPOINT_RESTARTPOINT) ! elog(LOG, "restartpoint starting:%s", ! (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : ""); ! else ! elog(LOG, "checkpoint starting:%s%s%s%s%s%s", ! (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "", ! (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "", ! (flags & CHECKPOINT_FORCE) ? " force" : "", ! (flags & CHECKPOINT_WAIT) ? " wait" : "", ! (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "", ! (flags & CHECKPOINT_CAUSE_TIME) ? " time" : ""); } /* * Log end of a checkpoint. */ static void ! LogCheckpointEnd(int flags) { long write_secs, sync_secs, *************** *** 5675,5691 **** CheckpointStats.ckpt_sync_end_t, &sync_secs, &sync_usecs); ! elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); " ! "%d transaction log file(s) added, %d removed, %d recycled; " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s", ! CheckpointStats.ckpt_bufs_written, ! (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, ! CheckpointStats.ckpt_segs_added, ! CheckpointStats.ckpt_segs_removed, ! CheckpointStats.ckpt_segs_recycled, ! write_secs, write_usecs / 1000, ! sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000); } /* --- 6376,6401 ---- CheckpointStats.ckpt_sync_end_t, &sync_secs, &sync_usecs); ! if (flags & CHECKPOINT_RESTARTPOINT) ! elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s", ! CheckpointStats.ckpt_bufs_written, ! (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, ! write_secs, write_usecs / 1000, ! sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000); ! else ! elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); " ! "%d transaction log file(s) added, %d removed, %d recycled; " ! "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s", ! CheckpointStats.ckpt_bufs_written, ! (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers, ! CheckpointStats.ckpt_segs_added, ! CheckpointStats.ckpt_segs_removed, ! CheckpointStats.ckpt_segs_recycled, ! write_secs, write_usecs / 1000, ! sync_secs, sync_usecs / 1000, ! total_secs, total_usecs / 1000); } /* *************** *** 5710,5726 **** XLogRecPtr recptr; XLogCtlInsert *Insert = &XLogCtl->Insert; XLogRecData rdata; - uint32 freespace; uint32 _logId; uint32 _logSeg; TransactionId *inCommitXids; int nInCommit; /* * Acquire CheckpointLock to ensure only one checkpoint happens at a time. ! * (This is just pro forma, since in the present system structure there is ! * only one process that is allowed to issue checkpoints at any given ! * time.) */ LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); --- 6420,6435 ---- XLogRecPtr recptr; XLogCtlInsert *Insert = &XLogCtl->Insert; XLogRecData rdata; uint32 _logId; uint32 _logSeg; TransactionId *inCommitXids; int nInCommit; + bool leavingArchiveRecovery = false; /* * Acquire CheckpointLock to ensure only one checkpoint happens at a time. ! * That shouldn't be happening, but checkpoints are an important aspect ! * of our resilience, so we take no chances. */ LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); *************** *** 5735,5749 **** --- 6444,6467 ---- CheckpointStats.ckpt_start_t = GetCurrentTimestamp(); /* + * Find out if this is the first checkpoint after archive recovery. + */ + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); + leavingArchiveRecovery = (ControlFile->state == DB_IN_ARCHIVE_RECOVERY); + LWLockRelease(ControlFileLock); + + /* * Use a critical section to force system panic if we have trouble. */ START_CRIT_SECTION(); if (shutdown) { + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); ControlFile->state = DB_SHUTDOWNING; ControlFile->time = (pg_time_t) time(NULL); UpdateControlFile(); + LWLockRelease(ControlFileLock); } /* *************** *** 5799,5848 **** } } ! /* ! * Compute new REDO record ptr = location of next XLOG record. ! * ! * NB: this is NOT necessarily where the checkpoint record itself will be, ! * since other backends may insert more XLOG records while we're off doing ! * the buffer flush work. Those XLOG records are logically after the ! * checkpoint, even though physically before it. Got that? ! */ ! freespace = INSERT_FREESPACE(Insert); ! if (freespace < SizeOfXLogRecord) ! { ! (void) AdvanceXLInsertBuffer(false); ! /* OK to ignore update return flag, since we will do flush anyway */ ! freespace = INSERT_FREESPACE(Insert); ! } ! INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx); ! ! /* ! * Here we update the shared RedoRecPtr for future XLogInsert calls; this ! * must be done while holding the insert lock AND the info_lck. ! * ! * Note: if we fail to complete the checkpoint, RedoRecPtr will be left ! * pointing past where it really needs to point. This is okay; the only ! * consequence is that XLogInsert might back up whole buffers that it ! * didn't really need to. We can't postpone advancing RedoRecPtr because ! * XLogInserts that happen while we are dumping buffers must assume that ! * their buffer changes are not included in the checkpoint. ! */ { ! /* use volatile pointer to prevent code rearrangement */ ! volatile XLogCtlData *xlogctl = XLogCtl; ! SpinLockAcquire(&xlogctl->info_lck); ! RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo; ! SpinLockRelease(&xlogctl->info_lck); } /* - * Now we can release WAL insert lock, allowing other xacts to proceed - * while we are flushing disk buffers. - */ - LWLockRelease(WALInsertLock); - - /* * If enabled, log checkpoint start. We postpone this until now so as not * to log anything if we decided to skip the checkpoint. */ --- 6517,6544 ---- } } ! if (leavingArchiveRecovery) ! checkPoint.redo = GetRedoLocationForArchiveCheckpoint(); ! else { ! /* ! * Compute new REDO record ptr = location of next XLOG record. ! * ! * NB: this is NOT necessarily where the checkpoint record itself will be, ! * since other backends may insert more XLOG records while we're off doing ! * the buffer flush work. Those XLOG records are logically after the ! * checkpoint, even though physically before it. Got that? ! */ ! checkPoint.redo = GetRedoLocationForCheckpoint(); ! /* ! * Now we can release WAL insert lock, allowing other xacts to proceed ! * while we are flushing disk buffers. ! */ ! LWLockRelease(WALInsertLock); } /* * If enabled, log checkpoint start. We postpone this until now so as not * to log anything if we decided to skip the checkpoint. */ *************** *** 5949,5959 **** XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg); /* ! * Update the control file. */ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); if (shutdown) ControlFile->state = DB_SHUTDOWNED; ControlFile->prevCheckPoint = ControlFile->checkPoint; ControlFile->checkPoint = ProcLastRecPtr; ControlFile->checkPointCopy = checkPoint; --- 6645,6662 ---- XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg); /* ! * Update the control file. In 8.4, this routine becomes the primary ! * point for recording changes of state in the control file at the ! * end of recovery. Postmaster state already shows us being in ! * normal running mode, but it is only after this point that we ! * are completely free of reperforming a recovery if we crash. Note ! * that this is executed by bgwriter after the death of Startup process. */ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); if (shutdown) ControlFile->state = DB_SHUTDOWNED; + else + ControlFile->state = DB_IN_PRODUCTION; ControlFile->prevCheckPoint = ControlFile->checkPoint; ControlFile->checkPoint = ProcLastRecPtr; ControlFile->checkPointCopy = checkPoint; *************** *** 5961,5966 **** --- 6664,6684 ---- UpdateControlFile(); LWLockRelease(ControlFileLock); + if (leavingArchiveRecovery) + { + /* + * Rename the config file out of the way, so that we don't accidentally + * re-enter archive recovery mode in a subsequent crash. Prior to + * 8.4 this step was performed at end of exitArchiveRecovery(). + */ + unlink(RECOVERY_COMMAND_DONE); + if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not rename file \"%s\" to \"%s\": %m", + RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE))); + } + /* Update shared-memory copy of checkpoint XID/epoch */ { /* use volatile pointer to prevent code rearrangement */ *************** *** 6004,6020 **** * Truncate pg_subtrans if possible. We can throw away all data before * the oldest XMIN of any running transaction. No future transaction will * attempt to reference any pg_subtrans entry older than that (see Asserts ! * in subtrans.c). During recovery, though, we mustn't do this because ! * StartupSUBTRANS hasn't been called yet. */ ! if (!InRecovery) TruncateSUBTRANS(GetOldestXmin(true, false)); /* All real work is done, but log before releasing lock. */ if (log_checkpoints) ! LogCheckpointEnd(); LWLockRelease(CheckpointLock); } /* --- 6722,6793 ---- * Truncate pg_subtrans if possible. We can throw away all data before * the oldest XMIN of any running transaction. No future transaction will * attempt to reference any pg_subtrans entry older than that (see Asserts ! * in subtrans.c). */ ! if (!shutdown) TruncateSUBTRANS(GetOldestXmin(true, false)); /* All real work is done, but log before releasing lock. */ if (log_checkpoints) ! LogCheckpointEnd(flags); LWLockRelease(CheckpointLock); + + /* + * Take a snapshot of running transactions and write this to WAL. + * This allows us to reconstruct the state of running transactions + * during archive recovery, if required. + * + * If we are shutting down, or Startup process is completing crash + * recovery we don't need to write running xact data. + */ + if (!shutdown && !IsRecoveryProcessingMode()) + LogCurrentRunningXacts(); + } + + /* + * GetRedoLocationForCheckpoint() + * + * When !IsRecoveryProcessingMode() this must be called while holding + * WALInsertLock(). + */ + static XLogRecPtr + GetRedoLocationForCheckpoint() + { + XLogCtlInsert *Insert = &XLogCtl->Insert; + uint32 freespace; + XLogRecPtr redo; + + freespace = INSERT_FREESPACE(Insert); + if (freespace < SizeOfXLogRecord) + { + (void) AdvanceXLInsertBuffer(false); + /* OK to ignore update return flag, since we will do flush anyway */ + freespace = INSERT_FREESPACE(Insert); + } + INSERT_RECPTR(redo, Insert, Insert->curridx); + + /* + * Here we update the shared RedoRecPtr for future XLogInsert calls; this + * must be done while holding the insert lock AND the info_lck. + * + * Note: if we fail to complete the checkpoint, RedoRecPtr will be left + * pointing past where it really needs to point. This is okay; the only + * consequence is that XLogInsert might back up whole buffers that it + * didn't really need to. We can't postpone advancing RedoRecPtr because + * XLogInserts that happen while we are dumping buffers must assume that + * their buffer changes are not included in the checkpoint. + */ + { + /* use volatile pointer to prevent code rearrangement */ + volatile XLogCtlData *xlogctl = XLogCtl; + + SpinLockAcquire(&xlogctl->info_lck); + RedoRecPtr = xlogctl->Insert.RedoRecPtr = redo; + SpinLockRelease(&xlogctl->info_lck); + } + + return redo; } /* *************** *** 6073,6079 **** if (RmgrTable[rmid].rm_safe_restartpoint != NULL) if (!(RmgrTable[rmid].rm_safe_restartpoint())) { ! elog(DEBUG2, "RM %d not safe to record restart point at %X/%X", rmid, checkPoint->redo.xlogid, checkPoint->redo.xrecoff); --- 6846,6852 ---- if (RmgrTable[rmid].rm_safe_restartpoint != NULL) if (!(RmgrTable[rmid].rm_safe_restartpoint())) { ! elog(trace_recovery(DEBUG2), "RM %d not safe to record restart point at %X/%X", rmid, checkPoint->redo.xlogid, checkPoint->redo.xrecoff); *************** *** 6081,6111 **** } } /* ! * OK, force data out to disk */ ! CheckPointGuts(checkPoint->redo, CHECKPOINT_IMMEDIATE); /* ! * Update pg_control so that any subsequent crash will restart from this ! * checkpoint. Note: ReadRecPtr gives the XLOG address of the checkpoint ! * record itself. */ - ControlFile->prevCheckPoint = ControlFile->checkPoint; - ControlFile->checkPoint = ReadRecPtr; - ControlFile->checkPointCopy = *checkPoint; - ControlFile->time = (pg_time_t) time(NULL); - UpdateControlFile(); ereport((recoveryLogRestartpoints ? LOG : DEBUG2), ! (errmsg("recovery restart point at %X/%X", ! checkPoint->redo.xlogid, checkPoint->redo.xrecoff))); ! if (recoveryLastXTime) ! ereport((recoveryLogRestartpoints ? LOG : DEBUG2), ! (errmsg("last completed transaction was at log time %s", ! timestamptz_to_str(recoveryLastXTime)))); ! } /* * Write a NEXTOID log record */ --- 6854,6926 ---- } } + RequestRestartPoint(ReadRecPtr, checkPoint, reachedSafeStartPoint); + } + + /* + * As of 8.4, RestartPoints are always created by the bgwriter + * once we have reachedSafeStartPoint. We use bgwriter's shared memory + * area wherever we call it from, to keep better code structure. + */ + void + CreateRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, int flags) + { + if (recoveryLogRestartpoints || log_checkpoints) + { + /* + * Prepare to accumulate statistics. + */ + + MemSet(&CheckpointStats, 0, sizeof(CheckpointStats)); + CheckpointStats.ckpt_start_t = GetCurrentTimestamp(); + + LogCheckpointStart(CHECKPOINT_RESTARTPOINT | flags); + } + + /* + * Acquire CheckpointLock to ensure only one restartpoint happens at a time. + * We rely on this lock to ensure that the startup process doesn't exit + * Recovery while we are half way through a restartpoint. + */ + LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); + + CheckPointGuts(restartPoint->redo, CHECKPOINT_RESTARTPOINT | flags); + /* ! * Update pg_control, using current time */ ! LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); ! ControlFile->prevCheckPoint = ControlFile->checkPoint; ! ControlFile->checkPoint = ReadPtr; ! ControlFile->checkPointCopy = *restartPoint; ! ControlFile->time = (pg_time_t) time(NULL); ! UpdateControlFile(); ! LWLockRelease(ControlFileLock); /* ! * Currently, there is no need to truncate pg_subtrans during recovery. ! * If we did do that, we will need to have called StartupSUBTRANS() ! * already and then TruncateSUBTRANS() would go here. */ + /* All real work is done, but log before releasing lock. */ + if (recoveryLogRestartpoints || log_checkpoints) + LogCheckpointEnd(CHECKPOINT_RESTARTPOINT); + ereport((recoveryLogRestartpoints ? LOG : DEBUG2), ! (errmsg("recovery restart point at %X/%X", ! restartPoint->redo.xlogid, restartPoint->redo.xrecoff))); + ReportCleanupDelayStats(); + + if (recoveryLastXTime) + ereport((recoveryLogRestartpoints ? LOG : DEBUG2), + (errmsg("last completed transaction was at log time %s", + timestamptz_to_str(recoveryLastXTime)))); + + LWLockRelease(CheckpointLock); + } + /* * Write a NEXTOID log record */ *************** *** 6168,6174 **** } /* ! * XLOG resource manager's routines */ void xlog_redo(XLogRecPtr lsn, XLogRecord *record) --- 6983,7045 ---- } /* ! * exitRecovery() ! * ! * Exit recovery state and write a XLOG_RECOVERY_END record. This is the ! * only record type that can record a change of timelineID. We assume ! * caller has already set ThisTimeLineID, if appropriate. ! */ ! static void ! exitRecovery(void) ! { ! XLogRecData rdata; ! ! rdata.buffer = InvalidBuffer; ! rdata.data = (char *) (&ThisTimeLineID); ! rdata.len = sizeof(TimeLineID); ! rdata.next = NULL; ! ! /* ! * If a restartpoint is in progress, we will not be able to successfully ! * acquire CheckpointLock. If bgwriter is still in progress then send ! * a second signal to nudge bgwriter to go faster so we can avoid delay. ! * Then wait for lock, so we know the restartpoint has completed. We do ! * this because we don't want to interrupt the restartpoint half way ! * through, which might leave us in a mess and we want to be robust. We're ! * going to checkpoint soon anyway, so not it's not wasted effort. ! */ ! if (LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE)) ! LWLockRelease(CheckpointLock); ! else ! { ! RequestRestartPointCompletion(); ! ereport(trace_recovery(DEBUG1), ! (errmsg("startup process waiting for restartpoint to complete"))); ! LWLockAcquire(CheckpointLock, LW_EXCLUSIVE); ! LWLockRelease(CheckpointLock); ! } ! ! /* ! * This is the only type of WAL message that can be inserted during ! * recovery. This ensures that we don't allow others to get access ! * until after we have changed state. ! */ ! (void) XLogInsert(RM_XLOG_ID, XLOG_RECOVERY_END, &rdata); ! ! /* ! * We don't XLogFlush() here otherwise we'll end up zeroing the WAL ! * file ourselves. So just let bgwriter's forthcoming checkpoint do ! * that for us. ! */ ! ! InRecovery = false; ! } ! ! /* ! * XLOG resource manager's routines. ! * ! * Definitions of message info are in include/catalog/pg_control.h, ! * though not all messages relate to control file processing. */ void xlog_redo(XLogRecPtr lsn, XLogRecord *record) *************** *** 6198,6224 **** MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset); /* ControlFile->checkPointCopy always tracks the latest ckpt XID */ ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch; ControlFile->checkPointCopy.nextXid = checkPoint.nextXid; ! /* ! * TLI may change in a shutdown checkpoint, but it shouldn't decrease */ - if (checkPoint.ThisTimeLineID != ThisTimeLineID) - { - if (checkPoint.ThisTimeLineID < ThisTimeLineID || - !list_member_int(expectedTLIs, - (int) checkPoint.ThisTimeLineID)) - ereport(PANIC, - (errmsg("unexpected timeline ID %u (after %u) in checkpoint record", - checkPoint.ThisTimeLineID, ThisTimeLineID))); - /* Following WAL records should be run with new TLI */ - ThisTimeLineID = checkPoint.ThisTimeLineID; - } RecoveryRestartPoint(&checkPoint); } else if (info == XLOG_CHECKPOINT_ONLINE) { CheckPoint checkPoint; --- 7069,7120 ---- MultiXactSetNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset); + /* We know nothing was running on the master at this point */ + ProcArrayClearRecoveryTransactions(); + RelationClearRecoveryLocks(); + /* ControlFile->checkPointCopy always tracks the latest ckpt XID */ ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch; ControlFile->checkPointCopy.nextXid = checkPoint.nextXid; ! /* ! * TLI no longer changes at shutdown checkpoint, since as of 8.4, ! * shutdown checkpoints only occur at shutdown. Much less confusing. */ RecoveryRestartPoint(&checkPoint); } + else if (info == XLOG_RECOVERY_END) + { + TimeLineID tli; + + memcpy(&tli, XLogRecGetData(record), sizeof(TimeLineID)); + + /* We know nothing was running on the master at this point */ + ProcArrayClearRecoveryTransactions(); + RelationClearRecoveryLocks(); + + /* + * TLI may change when recovery ends, but it shouldn't decrease. + * + * This is the only WAL record that can tell us to change timelineID + * while we process WAL records. + * + * We can *choose* to stop recovery at any point, generating a + * new timelineID which is recorded using this record type. + */ + if (tli != ThisTimeLineID) + { + if (tli < ThisTimeLineID || + !list_member_int(expectedTLIs, + (int) tli)) + ereport(PANIC, + (errmsg("unexpected timeline ID %u (after %u) at recovery end record", + tli, ThisTimeLineID))); + /* Following WAL records should be run with new TLI */ + ThisTimeLineID = tli; + } + } else if (info == XLOG_CHECKPOINT_ONLINE) { CheckPoint checkPoint; *************** *** 6240,6246 **** ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch; ControlFile->checkPointCopy.nextXid = checkPoint.nextXid; ! /* TLI should not change in an on-line checkpoint */ if (checkPoint.ThisTimeLineID != ThisTimeLineID) ereport(PANIC, (errmsg("unexpected timeline ID %u (should be %u) in checkpoint record", --- 7136,7142 ---- ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch; ControlFile->checkPointCopy.nextXid = checkPoint.nextXid; ! /* TLI must not change at a checkpoint */ if (checkPoint.ThisTimeLineID != ThisTimeLineID) ereport(PANIC, (errmsg("unexpected timeline ID %u (should be %u) in checkpoint record", *************** *** 6308,6313 **** --- 7204,7216 ---- record->xl_prev.xlogid, record->xl_prev.xrecoff, record->xl_xid); + appendStringInfo(buf, "; pxid %u %s %s len %u slot %d", + record->xl_xid2, + (XLogRecIsFirstXidRecord(record) ? "t" : "f"), + (XLogRecIsFirstSubXidRecord(record) ? "t" : "f"), + record->xl_len, + XLogRecGetSlotId(record)); + for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++) { if (record->xl_info & XLR_SET_BKP_BLOCK(i)) *************** *** 6476,6481 **** --- 7379,7390 ---- errhint("archive_command must be defined before " "online backups can be made safely."))); + if (IsRecoveryProcessingMode()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("WAL control functions cannot be executed during recovery."))); + backupidstr = text_to_cstring(backupid); /* *************** *** 6639,6644 **** --- 7548,7559 ---- errmsg("WAL archiving is not active"), errhint("archive_mode must be enabled at server start."))); + if (IsRecoveryProcessingMode()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("WAL control functions cannot be executed during recovery."))); + /* * OK to clear forcePageWrites */ *************** *** 6790,6795 **** --- 7705,7716 ---- (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), (errmsg("must be superuser to switch transaction log files")))); + if (IsRecoveryProcessingMode()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("WAL control functions cannot be executed during recovery."))); + switchpoint = RequestXLogSwitch(); /* *************** *** 6812,6817 **** --- 7733,7744 ---- { char location[MAXFNAMELEN]; + if (IsRecoveryProcessingMode()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("WAL control functions cannot be executed during recovery."))); + /* Make sure we have an up-to-date local LogwrtResult */ { /* use volatile pointer to prevent code rearrangement */ *************** *** 6839,6844 **** --- 7766,7777 ---- XLogRecPtr current_recptr; char location[MAXFNAMELEN]; + if (IsRecoveryProcessingMode()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery is in progress"), + errhint("WAL control functions cannot be executed during recovery."))); + /* * Get the current end-of-WAL position ... shared lock is sufficient */ Index: src/backend/access/transam/xlogutils.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlogutils.c,v retrieving revision 1.60 diff -c -r1.60 xlogutils.c *** src/backend/access/transam/xlogutils.c 31 Oct 2008 15:04:59 -0000 1.60 --- src/backend/access/transam/xlogutils.c 1 Nov 2008 15:40:31 -0000 *************** *** 212,218 **** XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init) { return XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, ! init ? RBM_ZERO : RBM_NORMAL); } /* --- 212,225 ---- XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init) { return XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, ! init ? RBM_ZERO : RBM_NORMAL, BUFFER_LOCK_EXCLUSIVE); ! } ! ! Buffer ! XLogReadBufferForCleanup(RelFileNode rnode, BlockNumber blkno, bool init) ! { ! return XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, ! init ? RBM_ZERO : RBM_NORMAL, BUFFER_LOCK_CLEANUP); } /* *************** *** 239,245 **** */ Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, ! BlockNumber blkno, ReadBufferMode mode) { BlockNumber lastblock; Buffer buffer; --- 246,252 ---- */ Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, ! BlockNumber blkno, ReadBufferMode mode, int lockmode) { BlockNumber lastblock; Buffer buffer; *************** *** 291,297 **** Assert(BufferGetBlockNumber(buffer) == blkno); } ! LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); if (mode == RBM_NORMAL) { --- 298,309 ---- Assert(BufferGetBlockNumber(buffer) == blkno); } ! if (lockmode == BUFFER_LOCK_EXCLUSIVE) ! LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); ! else if (lockmode == BUFFER_LOCK_CLEANUP) ! LockBufferForCleanup(buffer); ! else ! elog(FATAL, "Invalid buffer lock mode %d", lockmode); if (mode == RBM_NORMAL) { Index: src/backend/bootstrap/bootstrap.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/bootstrap/bootstrap.c,v retrieving revision 1.246 diff -c -r1.246 bootstrap.c *** src/backend/bootstrap/bootstrap.c 30 Sep 2008 10:52:11 -0000 1.246 --- src/backend/bootstrap/bootstrap.c 1 Nov 2008 14:49:38 -0000 *************** *** 35,40 **** --- 35,41 ---- #include "storage/bufmgr.h" #include "storage/ipc.h" #include "storage/proc.h" + #include "storage/sinvaladt.h" #include "tcop/tcopprot.h" #include "utils/builtins.h" #include "utils/flatfiles.h" *************** *** 418,424 **** case StartupProcess: bootstrap_signals(); StartupXLOG(); ! BuildFlatFiles(false); proc_exit(0); /* startup done */ case BgWriterProcess: --- 419,425 ---- case StartupProcess: bootstrap_signals(); StartupXLOG(); ! BuildFlatFiles(false, true, true); proc_exit(0); /* startup done */ case BgWriterProcess: Index: src/backend/commands/discard.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/commands/discard.c,v retrieving revision 1.4 diff -c -r1.4 discard.c *** src/backend/commands/discard.c 1 Jan 2008 19:45:49 -0000 1.4 --- src/backend/commands/discard.c 1 Nov 2008 14:49:38 -0000 *************** *** 65,71 **** ResetAllOptions(); DropAllPreparedStatements(); PortalHashTableDeleteAll(); ! Async_UnlistenAll(); ResetPlanCache(); ResetTempTableNamespace(); } --- 65,72 ---- ResetAllOptions(); DropAllPreparedStatements(); PortalHashTableDeleteAll(); ! if (!IsRecoveryProcessingMode()) ! Async_UnlistenAll(); ResetPlanCache(); ResetTempTableNamespace(); } Index: src/backend/commands/indexcmds.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/commands/indexcmds.c,v retrieving revision 1.180 diff -c -r1.180 indexcmds.c *** src/backend/commands/indexcmds.c 13 Oct 2008 16:25:19 -0000 1.180 --- src/backend/commands/indexcmds.c 1 Nov 2008 14:49:38 -0000 *************** *** 648,654 **** * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not * check for that. */ ! old_snapshots = GetCurrentVirtualXIDs(snapshot->xmax, false, PROC_IS_AUTOVACUUM | PROC_IN_VACUUM); while (VirtualTransactionIdIsValid(*old_snapshots)) --- 648,654 ---- * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not * check for that. */ ! old_snapshots = GetCurrentVirtualXIDs(snapshot->xmax, MyDatabaseId, PROC_IS_AUTOVACUUM | PROC_IN_VACUUM); while (VirtualTransactionIdIsValid(*old_snapshots)) Index: src/backend/commands/lockcmds.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/commands/lockcmds.c,v retrieving revision 1.19 diff -c -r1.19 lockcmds.c *** src/backend/commands/lockcmds.c 8 Sep 2008 00:47:40 -0000 1.19 --- src/backend/commands/lockcmds.c 1 Nov 2008 14:49:38 -0000 *************** *** 49,54 **** --- 49,66 ---- */ reloid = RangeVarGetRelid(relation, false); + /* + * During recovery we only accept these variations: + * + * LOCK TABLE foo -- parser translates as AccessEclusiveLock request + * LOCK TABLE foo IN AccessShareLock MODE + * LOCK TABLE foo IN AccessExclusiveLock MODE + */ + if (IsRecoveryProcessingMode() && + !(lockstmt->mode == AccessShareLock || + lockstmt->mode == AccessExclusiveLock)) + PreventCommandDuringRecovery(); + if (lockstmt->mode == AccessShareLock) aclresult = pg_class_aclcheck(reloid, GetUserId(), ACL_SELECT); Index: src/backend/commands/sequence.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/commands/sequence.c,v retrieving revision 1.154 diff -c -r1.154 sequence.c *** src/backend/commands/sequence.c 13 Jul 2008 20:45:47 -0000 1.154 --- src/backend/commands/sequence.c 1 Nov 2008 14:49:38 -0000 *************** *** 457,462 **** --- 457,464 ---- rescnt = 0; bool logit = false; + PreventCommandDuringRecovery(); + /* open and AccessShareLock sequence */ init_sequence(relid, &elm, &seqrel); Index: src/backend/commands/vacuum.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/commands/vacuum.c,v retrieving revision 1.379 diff -c -r1.379 vacuum.c *** src/backend/commands/vacuum.c 31 Oct 2008 15:05:00 -0000 1.379 --- src/backend/commands/vacuum.c 1 Nov 2008 14:49:38 -0000 *************** *** 138,143 **** --- 138,144 ---- /* vtlinks array for tuple chain following - sorted by new_tid */ int num_vtlinks; VTupleLink vtlinks; + TransactionId latestRemovedXid; } VRelStats; /*---------------------------------------------------------------------- *************** *** 221,227 **** static void repair_frag(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages, VacPageList fraged_pages, int nindexes, Relation *Irel); ! static void move_chain_tuple(Relation rel, Buffer old_buf, Page old_page, HeapTuple old_tup, Buffer dst_buf, Page dst_page, VacPage dst_vacpage, ExecContext ec, ItemPointer ctid, bool cleanVpd); --- 222,228 ---- static void repair_frag(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages, VacPageList fraged_pages, int nindexes, Relation *Irel); ! static void move_chain_tuple(VRelStats *vacrelstats, Relation rel, Buffer old_buf, Page old_page, HeapTuple old_tup, Buffer dst_buf, Page dst_page, VacPage dst_vacpage, ExecContext ec, ItemPointer ctid, bool cleanVpd); *************** *** 234,240 **** int num_moved); static void vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacpagelist); ! static void vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage); static void vacuum_index(VacPageList vacpagelist, Relation indrel, double num_tuples, int keep_tuples); static void scan_index(Relation indrel, double num_tuples); --- 235,241 ---- int num_moved); static void vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacpagelist); ! static void vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage); static void vacuum_index(VacPageList vacpagelist, Relation indrel, double num_tuples, int keep_tuples); static void scan_index(Relation indrel, double num_tuples); *************** *** 1220,1225 **** --- 1221,1227 ---- vacrelstats->rel_tuples = 0; vacrelstats->rel_indexed_tuples = 0; vacrelstats->hasindex = false; + vacrelstats->latestRemovedXid = InvalidTransactionId; /* scan the heap */ vacuum_pages.num_pages = fraged_pages.num_pages = 0; *************** *** 1623,1628 **** --- 1625,1633 ---- { ItemId lpp; + HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data, + &vacrelstats->latestRemovedXid); + /* * Here we are building a temporary copy of the page with dead * tuples removed. Below we will apply *************** *** 1936,1942 **** /* there are dead tuples on this page - clean them */ Assert(!isempty); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); ! vacuum_page(onerel, buf, last_vacuum_page); LockBuffer(buf, BUFFER_LOCK_UNLOCK); } else --- 1941,1947 ---- /* there are dead tuples on this page - clean them */ Assert(!isempty); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); ! vacuum_page(vacrelstats, onerel, buf, last_vacuum_page); LockBuffer(buf, BUFFER_LOCK_UNLOCK); } else *************** *** 2425,2431 **** tuple.t_data = (HeapTupleHeader) PageGetItem(Cpage, Citemid); tuple_len = tuple.t_len = ItemIdGetLength(Citemid); ! move_chain_tuple(onerel, Cbuf, Cpage, &tuple, dst_buffer, dst_page, destvacpage, &ec, &Ctid, vtmove[ti].cleanVpd); --- 2430,2436 ---- tuple.t_data = (HeapTupleHeader) PageGetItem(Cpage, Citemid); tuple_len = tuple.t_len = ItemIdGetLength(Citemid); ! move_chain_tuple(vacrelstats, onerel, Cbuf, Cpage, &tuple, dst_buffer, dst_page, destvacpage, &ec, &Ctid, vtmove[ti].cleanVpd); *************** *** 2511,2517 **** dst_page = BufferGetPage(dst_buffer); /* if this page was not used before - clean it */ if (!PageIsEmpty(dst_page) && dst_vacpage->offsets_used == 0) ! vacuum_page(onerel, dst_buffer, dst_vacpage); } else LockBuffer(dst_buffer, BUFFER_LOCK_EXCLUSIVE); --- 2516,2522 ---- dst_page = BufferGetPage(dst_buffer); /* if this page was not used before - clean it */ if (!PageIsEmpty(dst_page) && dst_vacpage->offsets_used == 0) ! vacuum_page(vacrelstats, onerel, dst_buffer, dst_vacpage); } else LockBuffer(dst_buffer, BUFFER_LOCK_EXCLUSIVE); *************** *** 2688,2694 **** LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); page = BufferGetPage(buf); if (!PageIsEmpty(page)) ! vacuum_page(onerel, buf, *curpage); UnlockReleaseBuffer(buf); } } --- 2693,2699 ---- LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); page = BufferGetPage(buf); if (!PageIsEmpty(page)) ! vacuum_page(vacrelstats, onerel, buf, *curpage); UnlockReleaseBuffer(buf); } } *************** *** 2824,2830 **** recptr = log_heap_clean(onerel, buf, NULL, 0, NULL, 0, unused, uncnt, ! false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } --- 2829,2835 ---- recptr = log_heap_clean(onerel, buf, NULL, 0, NULL, 0, unused, uncnt, ! vacrelstats->latestRemovedXid, false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } *************** *** 2871,2877 **** * already too long and almost unreadable. */ static void ! move_chain_tuple(Relation rel, Buffer old_buf, Page old_page, HeapTuple old_tup, Buffer dst_buf, Page dst_page, VacPage dst_vacpage, ExecContext ec, ItemPointer ctid, bool cleanVpd) --- 2876,2882 ---- * already too long and almost unreadable. */ static void ! move_chain_tuple(VRelStats *vacrelstats, Relation rel, Buffer old_buf, Page old_page, HeapTuple old_tup, Buffer dst_buf, Page dst_page, VacPage dst_vacpage, ExecContext ec, ItemPointer ctid, bool cleanVpd) *************** *** 2927,2933 **** int sv_offsets_used = dst_vacpage->offsets_used; dst_vacpage->offsets_used = 0; ! vacuum_page(rel, dst_buf, dst_vacpage); dst_vacpage->offsets_used = sv_offsets_used; } --- 2932,2938 ---- int sv_offsets_used = dst_vacpage->offsets_used; dst_vacpage->offsets_used = 0; ! vacuum_page(vacrelstats, rel, dst_buf, dst_vacpage); dst_vacpage->offsets_used = sv_offsets_used; } *************** *** 3225,3231 **** buf = ReadBufferExtended(onerel, MAIN_FORKNUM, (*vacpage)->blkno, RBM_NORMAL, vac_strategy); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); ! vacuum_page(onerel, buf, *vacpage); UnlockReleaseBuffer(buf); } } --- 3230,3236 ---- buf = ReadBufferExtended(onerel, MAIN_FORKNUM, (*vacpage)->blkno, RBM_NORMAL, vac_strategy); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); ! vacuum_page(vacrelstats, onerel, buf, *vacpage); UnlockReleaseBuffer(buf); } } *************** *** 3252,3258 **** * Caller must hold pin and lock on buffer. */ static void ! vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage) { Page page = BufferGetPage(buffer); int i; --- 3257,3263 ---- * Caller must hold pin and lock on buffer. */ static void ! vacuum_page(VRelStats *vacrelstats, Relation onerel, Buffer buffer, VacPage vacpage) { Page page = BufferGetPage(buffer); int i; *************** *** 3281,3287 **** recptr = log_heap_clean(onerel, buffer, NULL, 0, NULL, 0, vacpage->offsets, vacpage->offsets_free, ! false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } --- 3286,3292 ---- recptr = log_heap_clean(onerel, buffer, NULL, 0, NULL, 0, vacpage->offsets, vacpage->offsets_free, ! vacrelstats->latestRemovedXid, false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } Index: src/backend/commands/vacuumlazy.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/commands/vacuumlazy.c,v retrieving revision 1.109 diff -c -r1.109 vacuumlazy.c *** src/backend/commands/vacuumlazy.c 31 Oct 2008 15:05:00 -0000 1.109 --- src/backend/commands/vacuumlazy.c 1 Nov 2008 14:49:38 -0000 *************** *** 87,92 **** --- 87,93 ---- int max_dead_tuples; /* # slots allocated in array */ ItemPointer dead_tuples; /* array of ItemPointerData */ int num_index_scans; + TransactionId latestRemovedXid; } LVRelStats; *************** *** 217,222 **** --- 218,253 ---- } } + /* + * For Hot Standby we need to know the highest transaction id that will + * be removed by any change. VACUUM proceeds in a number of passes so + * we need to consider how each pass operates. The first pass runs + * heap_page_prune(), which can issue XLOG_HEAP2_CLEAN records as it + * progresses - these will have a latestRemovedXid on each record. + * In many cases this removes all of the tuples to be removed. + * Then we look at tuples to be removed, but do not actually remove them + * until phase three. However, index records for those rows are removed + * in phase two and index blocks do not have MVCC information attached. + * So before we can allow removal of *any* index tuples we need to issue + * a WAL record indicating what the latestRemovedXid will be at the end + * of phase three. This then allows Hot Standby queries to block at the + * correct place, i.e. before phase two, rather than during phase three + * as we issue more XLOG_HEAP2_CLEAN records. If we need to run multiple + * phase two/three because of memory constraints we need to issue multiple + * log records also. + */ + static void + vacuum_log_cleanup_info(Relation rel, LVRelStats *vacrelstats) + { + /* + * No need to log changes for temp tables, they do not contain + * data visible on the standby server. + */ + if (rel->rd_istemp) + return; + + (void) log_heap_cleanup_info(rel->rd_node, vacrelstats->latestRemovedXid); + } /* * lazy_scan_heap() -- scan an open heap relation *************** *** 264,269 **** --- 295,301 ---- nblocks = RelationGetNumberOfBlocks(onerel); vacrelstats->rel_pages = nblocks; vacrelstats->nonempty_pages = 0; + vacrelstats->latestRemovedXid = InvalidTransactionId; lazy_space_alloc(vacrelstats, nblocks); *************** *** 289,294 **** --- 321,329 ---- if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage && vacrelstats->num_dead_tuples > 0) { + /* Log cleanup info before we touch indexes */ + vacuum_log_cleanup_info(onerel, vacrelstats); + /* Remove index entries */ for (i = 0; i < nindexes; i++) lazy_vacuum_index(Irel[i], *************** *** 473,478 **** --- 508,515 ---- if (tupgone) { lazy_record_dead_tuple(vacrelstats, &(tuple.t_self)); + HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data, + &vacrelstats->latestRemovedXid); tups_vacuumed += 1; } else *************** *** 551,556 **** --- 588,596 ---- /* XXX put a threshold on min number of tuples here? */ if (vacrelstats->num_dead_tuples > 0) { + /* Log cleanup info before we touch indexes */ + vacuum_log_cleanup_info(onerel, vacrelstats); + /* Remove index entries */ for (i = 0; i < nindexes; i++) lazy_vacuum_index(Irel[i], *************** *** 688,694 **** recptr = log_heap_clean(onerel, buffer, NULL, 0, NULL, 0, unused, uncnt, ! false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } --- 728,734 ---- recptr = log_heap_clean(onerel, buffer, NULL, 0, NULL, 0, unused, uncnt, ! vacrelstats->latestRemovedXid, false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } Index: src/backend/postmaster/bgwriter.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/postmaster/bgwriter.c,v retrieving revision 1.53 diff -c -r1.53 bgwriter.c *** src/backend/postmaster/bgwriter.c 14 Oct 2008 08:06:39 -0000 1.53 --- src/backend/postmaster/bgwriter.c 1 Nov 2008 14:49:38 -0000 *************** *** 49,54 **** --- 49,55 ---- #include #include "access/xlog_internal.h" + #include "catalog/pg_control.h" #include "libpq/pqsignal.h" #include "miscadmin.h" #include "pgstat.h" *************** *** 129,134 **** --- 130,142 ---- int ckpt_flags; /* checkpoint flags, as defined in xlog.h */ + /* + * When the Startup process wants bgwriter to perform a restartpoint, it + * sets these fields so that we can update the control file afterwards. + */ + XLogRecPtr ReadPtr; /* Requested log pointer */ + CheckPoint restartPoint; /* restartPoint data for ControlFile */ + uint32 num_backend_writes; /* counts non-bgwriter buffer writes */ int num_requests; /* current # of requests */ *************** *** 165,171 **** /* these values are valid when ckpt_active is true: */ static pg_time_t ckpt_start_time; ! static XLogRecPtr ckpt_start_recptr; static double ckpt_cached_elapsed; static pg_time_t last_checkpoint_time; --- 173,179 ---- /* these values are valid when ckpt_active is true: */ static pg_time_t ckpt_start_time; ! static XLogRecPtr ckpt_start_recptr; /* not used if IsRecoveryProcessingMode */ static double ckpt_cached_elapsed; static pg_time_t last_checkpoint_time; *************** *** 197,202 **** --- 205,211 ---- { sigjmp_buf local_sigjmp_buf; MemoryContext bgwriter_context; + bool BgWriterRecoveryMode; BgWriterShmem->bgwriter_pid = MyProcPid; am_bg_writer = true; *************** *** 355,370 **** */ PG_SETMASK(&UnBlockSig); /* * Loop forever */ for (;;) { - bool do_checkpoint = false; - int flags = 0; - pg_time_t now; - int elapsed_secs; - /* * Emergency bailout if postmaster has died. This is to avoid the * necessity for manual cleanup of all postmaster children. --- 364,380 ---- */ PG_SETMASK(&UnBlockSig); + BgWriterRecoveryMode = IsRecoveryProcessingMode(); + + if (BgWriterRecoveryMode) + elog(DEBUG1, "bgwriter starting during recovery, pid = %u", + BgWriterShmem->bgwriter_pid); + /* * Loop forever */ for (;;) { /* * Emergency bailout if postmaster has died. This is to avoid the * necessity for manual cleanup of all postmaster children. *************** *** 372,499 **** if (!PostmasterIsAlive(true)) exit(1); - /* - * Process any requests or signals received recently. - */ - AbsorbFsyncRequests(); - if (got_SIGHUP) { got_SIGHUP = false; ProcessConfigFile(PGC_SIGHUP); } - if (checkpoint_requested) - { - checkpoint_requested = false; - do_checkpoint = true; - BgWriterStats.m_requested_checkpoints++; - } - if (shutdown_requested) - { - /* - * From here on, elog(ERROR) should end with exit(1), not send - * control back to the sigsetjmp block above - */ - ExitOnAnyError = true; - /* Close down the database */ - ShutdownXLOG(0, 0); - /* Normal exit from the bgwriter is here */ - proc_exit(0); /* done */ - } ! /* ! * Force a checkpoint if too much time has elapsed since the last one. ! * Note that we count a timed checkpoint in stats only when this ! * occurs without an external request, but we set the CAUSE_TIME flag ! * bit even if there is also an external request. ! */ ! now = (pg_time_t) time(NULL); ! elapsed_secs = now - last_checkpoint_time; ! if (elapsed_secs >= CheckPointTimeout) { ! if (!do_checkpoint) ! BgWriterStats.m_timed_checkpoints++; ! do_checkpoint = true; ! flags |= CHECKPOINT_CAUSE_TIME; } ! ! /* ! * Do a checkpoint if requested, otherwise do one cycle of ! * dirty-buffer writing. ! */ ! if (do_checkpoint) { ! /* use volatile pointer to prevent code rearrangement */ ! volatile BgWriterShmemStruct *bgs = BgWriterShmem; ! ! /* ! * Atomically fetch the request flags to figure out what kind of a ! * checkpoint we should perform, and increase the started-counter ! * to acknowledge that we've started a new checkpoint. ! */ ! SpinLockAcquire(&bgs->ckpt_lck); ! flags |= bgs->ckpt_flags; ! bgs->ckpt_flags = 0; ! bgs->ckpt_started++; ! SpinLockRelease(&bgs->ckpt_lck); /* ! * We will warn if (a) too soon since last checkpoint (whatever ! * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag ! * since the last checkpoint start. Note in particular that this ! * implementation will not generate warnings caused by ! * CheckPointTimeout < CheckPointWarning. */ ! if ((flags & CHECKPOINT_CAUSE_XLOG) && ! elapsed_secs < CheckPointWarning) ! ereport(LOG, ! (errmsg("checkpoints are occurring too frequently (%d seconds apart)", ! elapsed_secs), ! errhint("Consider increasing the configuration parameter \"checkpoint_segments\"."))); ! /* ! * Initialize bgwriter-private variables used during checkpoint. ! */ ! ckpt_active = true; ! ckpt_start_recptr = GetInsertRecPtr(); ! ckpt_start_time = now; ! ckpt_cached_elapsed = 0; /* ! * Do the checkpoint. */ ! CreateCheckPoint(flags); /* ! * After any checkpoint, close all smgr files. This is so we ! * won't hang onto smgr references to deleted files indefinitely. */ ! smgrcloseall(); ! /* ! * Indicate checkpoint completion to any waiting backends. ! */ ! SpinLockAcquire(&bgs->ckpt_lck); ! bgs->ckpt_done = bgs->ckpt_started; ! SpinLockRelease(&bgs->ckpt_lck); ! ckpt_active = false; ! ! /* ! * Note we record the checkpoint start time not end time as ! * last_checkpoint_time. This is so that time-driven checkpoints ! * happen at a predictable spacing. ! */ ! last_checkpoint_time = now; } - else - BgBufferSync(); - - /* Check for archive_timeout and switch xlog files if necessary. */ - CheckArchiveTimeout(); - - /* Nap for the configured time. */ - BgWriterNap(); } } --- 382,595 ---- if (!PostmasterIsAlive(true)) exit(1); if (got_SIGHUP) { got_SIGHUP = false; ProcessConfigFile(PGC_SIGHUP); } ! if (BgWriterRecoveryMode) { ! if (shutdown_requested) ! { ! /* ! * From here on, elog(ERROR) should end with exit(1), not send ! * control back to the sigsetjmp block above ! */ ! ExitOnAnyError = true; ! /* Normal exit from the bgwriter is here */ ! proc_exit(0); /* done */ ! } ! ! if (!IsRecoveryProcessingMode()) ! { ! elog(DEBUG2, "bgwriter changing from recovery to normal mode"); ! ! InitXLOGAccess(); ! BgWriterRecoveryMode = false; ! ! /* ! * Start time-driven events from now ! */ ! last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL); ! ! /* ! * Notice that we do *not* act on a checkpoint_requested ! * state at this point. We have changed mode, so we wish to ! * perform a checkpoint not a restartpoint. ! */ ! continue; ! } ! ! if (checkpoint_requested) ! { ! XLogRecPtr ReadPtr; ! CheckPoint restartPoint; ! ! checkpoint_requested = false; ! ! /* ! * Initialize bgwriter-private variables used during checkpoint. ! */ ! ckpt_active = true; ! ckpt_start_time = (pg_time_t) time(NULL); ! ckpt_cached_elapsed = 0; ! ! /* ! * Get the requested values from shared memory that the ! * Startup process has put there for us. ! */ ! SpinLockAcquire(&BgWriterShmem->ckpt_lck); ! ReadPtr = BgWriterShmem->ReadPtr; ! memcpy(&restartPoint, &BgWriterShmem->restartPoint, sizeof(CheckPoint)); ! SpinLockRelease(&BgWriterShmem->ckpt_lck); ! ! /* Use smoothed writes, until interrupted if ever */ ! CreateRestartPoint(ReadPtr, &restartPoint, 0); ! ! /* ! * After any checkpoint, close all smgr files. This is so we ! * won't hang onto smgr references to deleted files indefinitely. ! */ ! smgrcloseall(); ! ! ckpt_active = false; ! checkpoint_requested = false; ! } ! else ! { ! /* Clean buffers dirtied by recovery */ ! BgBufferSync(); ! ! /* Nap for the configured time. */ ! BgWriterNap(); ! } } ! else /* Normal processing */ { ! bool do_checkpoint = false; ! int flags = 0; ! pg_time_t now; ! int elapsed_secs; /* ! * Process any requests or signals received recently. */ ! AbsorbFsyncRequests(); ! if (checkpoint_requested) ! { ! checkpoint_requested = false; ! do_checkpoint = true; ! BgWriterStats.m_requested_checkpoints++; ! } ! if (shutdown_requested) ! { ! /* ! * From here on, elog(ERROR) should end with exit(1), not send ! * control back to the sigsetjmp block above ! */ ! ExitOnAnyError = true; ! /* Close down the database */ ! ShutdownXLOG(0, 0); ! /* Normal exit from the bgwriter is here */ ! proc_exit(0); /* done */ ! } /* ! * Force a checkpoint if too much time has elapsed since the last one. ! * Note that we count a timed checkpoint in stats only when this ! * occurs without an external request, but we set the CAUSE_TIME flag ! * bit even if there is also an external request. */ ! now = (pg_time_t) time(NULL); ! elapsed_secs = now - last_checkpoint_time; ! if (elapsed_secs >= CheckPointTimeout) ! { ! if (!do_checkpoint) ! BgWriterStats.m_timed_checkpoints++; ! do_checkpoint = true; ! flags |= CHECKPOINT_CAUSE_TIME; ! } /* ! * Do a checkpoint if requested, otherwise do one cycle of ! * dirty-buffer writing. */ ! if (do_checkpoint) ! { ! /* use volatile pointer to prevent code rearrangement */ ! volatile BgWriterShmemStruct *bgs = BgWriterShmem; ! ! /* ! * Atomically fetch the request flags to figure out what kind of a ! * checkpoint we should perform, and increase the started-counter ! * to acknowledge that we've started a new checkpoint. ! */ ! SpinLockAcquire(&bgs->ckpt_lck); ! flags |= bgs->ckpt_flags; ! bgs->ckpt_flags = 0; ! bgs->ckpt_started++; ! SpinLockRelease(&bgs->ckpt_lck); ! ! /* ! * We will warn if (a) too soon since last checkpoint (whatever ! * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag ! * since the last checkpoint start. Note in particular that this ! * implementation will not generate warnings caused by ! * CheckPointTimeout < CheckPointWarning. ! */ ! if ((flags & CHECKPOINT_CAUSE_XLOG) && ! elapsed_secs < CheckPointWarning) ! ereport(LOG, ! (errmsg("checkpoints are occurring too frequently (%d seconds apart)", ! elapsed_secs), ! errhint("Consider increasing the configuration parameter \"checkpoint_segments\"."))); ! ! /* ! * Initialize bgwriter-private variables used during checkpoint. ! */ ! ckpt_active = true; ! ckpt_start_recptr = GetInsertRecPtr(); ! ckpt_start_time = now; ! ckpt_cached_elapsed = 0; ! ! /* ! * Do the checkpoint. ! */ ! CreateCheckPoint(flags); ! ! /* ! * After any checkpoint, close all smgr files. This is so we ! * won't hang onto smgr references to deleted files indefinitely. ! */ ! smgrcloseall(); ! ! /* ! * Indicate checkpoint completion to any waiting backends. ! */ ! SpinLockAcquire(&bgs->ckpt_lck); ! bgs->ckpt_done = bgs->ckpt_started; ! SpinLockRelease(&bgs->ckpt_lck); ! ! ckpt_active = false; ! ! /* ! * Note we record the checkpoint start time not end time as ! * last_checkpoint_time. This is so that time-driven checkpoints ! * happen at a predictable spacing. ! */ ! last_checkpoint_time = now; ! } ! else ! BgBufferSync(); ! /* Check for archive_timeout and switch xlog files if necessary. */ ! CheckArchiveTimeout(); ! /* Nap for the configured time. */ ! BgWriterNap(); } } } *************** *** 586,592 **** (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested)) break; pg_usleep(1000000L); ! AbsorbFsyncRequests(); udelay -= 1000000L; } --- 682,689 ---- (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested)) break; pg_usleep(1000000L); ! if (!IsRecoveryProcessingMode()) ! AbsorbFsyncRequests(); udelay -= 1000000L; } *************** *** 640,645 **** --- 737,755 ---- if (!am_bg_writer) return; + /* Perform minimal duties during recovery and skip wait if requested */ + if (IsRecoveryProcessingMode()) + { + BgBufferSync(); + + if (!shutdown_requested && + !checkpoint_requested && + IsCheckpointOnSchedule(progress)) + BgWriterNap(); + + return; + } + /* * Perform the usual bgwriter duties and take a nap, unless we're behind * schedule, in which case we just try to catch up as quickly as possible. *************** *** 714,729 **** * However, it's good enough for our purposes, we're only calculating an * estimate anyway. */ ! recptr = GetInsertRecPtr(); ! elapsed_xlogs = ! (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile + ! ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) / ! CheckPointSegments; ! ! if (progress < elapsed_xlogs) { ! ckpt_cached_elapsed = elapsed_xlogs; ! return false; } /* --- 824,842 ---- * However, it's good enough for our purposes, we're only calculating an * estimate anyway. */ ! if (!IsRecoveryProcessingMode()) { ! recptr = GetInsertRecPtr(); ! elapsed_xlogs = ! (((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile + ! ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) / ! CheckPointSegments; ! ! if (progress < elapsed_xlogs) ! { ! ckpt_cached_elapsed = elapsed_xlogs; ! return false; ! } } /* *************** *** 965,970 **** --- 1078,1156 ---- } /* + * Always runs in Startup process (see xlog.c) + */ + void + RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter) + { + /* + * Should we just do it ourselves? + */ + if (!IsPostmasterEnvironment || !sendToBGWriter) + { + CreateRestartPoint(ReadPtr, restartPoint, CHECKPOINT_IMMEDIATE); + return; + } + + /* + * Push requested values into shared memory, then signal to request restartpoint. + */ + if (BgWriterShmem->bgwriter_pid == 0) + elog(LOG, "could not request restartpoint because bgwriter not running"); + + SpinLockAcquire(&BgWriterShmem->ckpt_lck); + BgWriterShmem->ReadPtr = ReadPtr; + memcpy(&BgWriterShmem->restartPoint, restartPoint, sizeof(CheckPoint)); + SpinLockRelease(&BgWriterShmem->ckpt_lck); + + if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0) + elog(LOG, "could not signal for restartpoint: %m"); + } + + /* + * Sends another checkpoint request signal to bgwriter, which causes it + * to avoid smoothed writes and continue processing as if it had been + * called with CHECKPOINT_IMMEDIATE. This is used at the end of recovery. + */ + void + RequestRestartPointCompletion(void) + { + if (BgWriterShmem->bgwriter_pid != 0 && + kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0) + elog(LOG, "could not signal for restartpoint immediate: %m"); + } + + XLogRecPtr + GetRedoLocationForArchiveCheckpoint(void) + { + XLogRecPtr redo; + + SpinLockAcquire(&BgWriterShmem->ckpt_lck); + redo = BgWriterShmem->ReadPtr; + SpinLockRelease(&BgWriterShmem->ckpt_lck); + + return redo; + } + + /* + * Store the information needed for a checkpoint at the end of recovery. + * Returns true if bgwriter can perform checkpoint, or false if bgwriter + * not active or otherwise unable to comply. + */ + bool + SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo) + { + SpinLockAcquire(&BgWriterShmem->ckpt_lck); + BgWriterShmem->ReadPtr = redo; + SpinLockRelease(&BgWriterShmem->ckpt_lck); + + if (BgWriterShmem->bgwriter_pid == 0 || !IsPostmasterEnvironment) + return false; + + return true; + } + + /* * ForwardFsyncRequest * Forward a file-fsync request from a backend to the bgwriter * Index: src/backend/postmaster/postmaster.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/postmaster/postmaster.c,v retrieving revision 1.566 diff -c -r1.566 postmaster.c *** src/backend/postmaster/postmaster.c 28 Oct 2008 12:10:43 -0000 1.566 --- src/backend/postmaster/postmaster.c 1 Nov 2008 14:49:38 -0000 *************** *** 230,237 **** * We use a simple state machine to control startup, shutdown, and * crash recovery (which is rather like shutdown followed by startup). * ! * Normal child backends can only be launched when we are in PM_RUN state. ! * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.) * In other states we handle connection requests by launching "dead_end" * child processes, which will simply send the client an error message and * quit. (We track these in the BackendList so that we can know when they --- 230,239 ---- * We use a simple state machine to control startup, shutdown, and * crash recovery (which is rather like shutdown followed by startup). * ! * Normal child backends can only be launched when we are in PM_RUN or ! * PM_RECOVERY state. Any transaction started in PM_RECOVERY state will ! * be read-only for the whole of its life. (We also allow launch of normal ! * child backends in PM_WAIT_BACKUP state, but only for superusers.) * In other states we handle connection requests by launching "dead_end" * child processes, which will simply send the client an error message and * quit. (We track these in the BackendList so that we can know when they *************** *** 254,259 **** --- 256,266 ---- { PM_INIT, /* postmaster starting */ PM_STARTUP, /* waiting for startup subprocess */ + PM_RECOVERY, /* consistent recovery mode; state only + * entered for archive and streaming recovery, + * and only after the point where the + * all data is in consistent state. + */ PM_RUN, /* normal "database is alive" state */ PM_WAIT_BACKUP, /* waiting for online backup mode to end */ PM_WAIT_BACKENDS, /* waiting for live backends to exit */ *************** *** 1302,1308 **** * state that prevents it, start one. It doesn't matter if this * fails, we'll just try again later. */ ! if (BgWriterPID == 0 && pmState == PM_RUN) BgWriterPID = StartBackgroundWriter(); /* --- 1309,1315 ---- * state that prevents it, start one. It doesn't matter if this * fails, we'll just try again later. */ ! if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY)) BgWriterPID = StartBackgroundWriter(); /* *************** *** 1651,1661 **** (errcode(ERRCODE_CANNOT_CONNECT_NOW), errmsg("the database system is shutting down"))); break; - case CAC_RECOVERY: - ereport(FATAL, - (errcode(ERRCODE_CANNOT_CONNECT_NOW), - errmsg("the database system is in recovery mode"))); - break; case CAC_TOOMANY: ereport(FATAL, (errcode(ERRCODE_TOO_MANY_CONNECTIONS), --- 1658,1663 ---- *************** *** 1664,1669 **** --- 1666,1672 ---- case CAC_WAITBACKUP: /* OK for now, will check in InitPostgres */ break; + case CAC_RECOVERY: case CAC_OK: break; } *************** *** 1982,1991 **** ereport(LOG, (errmsg("received smart shutdown request"))); ! if (pmState == PM_RUN) { /* autovacuum workers are told to shut down immediately */ ! SignalAutovacWorkers(SIGTERM); /* and the autovac launcher too */ if (AutoVacPID != 0) signal_child(AutoVacPID, SIGTERM); --- 1985,1995 ---- ereport(LOG, (errmsg("received smart shutdown request"))); ! if (pmState == PM_RUN || pmState == PM_RECOVERY) { /* autovacuum workers are told to shut down immediately */ ! if (pmState == PM_RUN) ! SignalAutovacWorkers(SIGTERM); /* and the autovac launcher too */ if (AutoVacPID != 0) signal_child(AutoVacPID, SIGTERM); *************** *** 2019,2025 **** if (StartupPID != 0) signal_child(StartupPID, SIGTERM); ! if (pmState == PM_RUN || pmState == PM_WAIT_BACKUP) { ereport(LOG, (errmsg("aborting any active transactions"))); --- 2023,2029 ---- if (StartupPID != 0) signal_child(StartupPID, SIGTERM); ! if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_WAIT_BACKUP) { ereport(LOG, (errmsg("aborting any active transactions"))); *************** *** 2115,2122 **** */ if (pid == StartupPID) { StartupPID = 0; ! Assert(pmState == PM_STARTUP); /* FATAL exit of startup is treated as catastrophic */ if (!EXIT_STATUS_0(exitstatus)) --- 2119,2129 ---- */ if (pid == StartupPID) { + bool leavingRecovery = (pmState == PM_RECOVERY); + StartupPID = 0; ! Assert(pmState == PM_STARTUP || pmState == PM_RECOVERY || ! pmState == PM_WAIT_BACKUP || pmState == PM_WAIT_BACKENDS); /* FATAL exit of startup is treated as catastrophic */ if (!EXIT_STATUS_0(exitstatus)) *************** *** 2124,2130 **** LogChildExit(LOG, _("startup process"), pid, exitstatus); ereport(LOG, ! (errmsg("aborting startup due to startup process failure"))); ExitPostmaster(1); } --- 2131,2137 ---- LogChildExit(LOG, _("startup process"), pid, exitstatus); ereport(LOG, ! (errmsg("aborting startup due to startup process failure"))); ExitPostmaster(1); } *************** *** 2157,2166 **** load_role(); /* ! * Crank up the background writer. It doesn't matter if this ! * fails, we'll just try again later. */ ! Assert(BgWriterPID == 0); BgWriterPID = StartBackgroundWriter(); /* --- 2164,2173 ---- load_role(); /* ! * Check whether we need to start background writer, if not ! * already running. */ ! if (BgWriterPID == 0) BgWriterPID = StartBackgroundWriter(); /* *************** *** 2177,2184 **** PgStatPID = pgstat_start(); /* at this point we are really open for business */ ! ereport(LOG, ! (errmsg("database system is ready to accept connections"))); continue; } --- 2184,2195 ---- PgStatPID = pgstat_start(); /* at this point we are really open for business */ ! if (leavingRecovery) ! ereport(LOG, ! (errmsg("database can now be accessed with read and write transactions"))); ! else ! ereport(LOG, ! (errmsg("database system is ready to accept connections"))); continue; } *************** *** 2898,2904 **** bn->pid = pid; bn->cancel_key = MyCancelKey; bn->is_autovacuum = false; ! bn->dead_end = (port->canAcceptConnections != CAC_OK && port->canAcceptConnections != CAC_WAITBACKUP); DLAddHead(BackendList, DLNewElem(bn)); #ifdef EXEC_BACKEND --- 2909,2916 ---- bn->pid = pid; bn->cancel_key = MyCancelKey; bn->is_autovacuum = false; ! bn->dead_end = (!(port->canAcceptConnections == CAC_RECOVERY || ! port->canAcceptConnections == CAC_OK) && port->canAcceptConnections != CAC_WAITBACKUP); DLAddHead(BackendList, DLNewElem(bn)); #ifdef EXEC_BACKEND *************** *** 3845,3850 **** --- 3857,3909 ---- PG_SETMASK(&BlockSig); + if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START)) + { + Assert(pmState == PM_STARTUP); + + /* + * Go to shutdown mode if a shutdown request was pending. + */ + if (Shutdown > NoShutdown) + { + pmState = PM_WAIT_BACKENDS; + /* PostmasterStateMachine logic does the rest */ + } + else + { + /* + * Startup process has entered recovery + */ + pmState = PM_RECOVERY; + + /* + * Load the flat authorization file into postmaster's cache. The + * startup process won't have recomputed this from the database yet, + * so we it may change following recovery. + */ + load_role(); + + /* + * Crank up the background writer. It doesn't matter if this + * fails, we'll just try again later. + */ + Assert(BgWriterPID == 0); + BgWriterPID = StartBackgroundWriter(); + + /* + * Likewise, start other special children as needed. + */ + Assert(PgStatPID == 0); + PgStatPID = pgstat_start(); + + /* We can now accept read-only connections */ + ereport(LOG, + (errmsg("database system is ready to accept connections"))); + ereport(LOG, + (errmsg("database can now be accessed with read only transactions"))); + } + } + if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE)) { /* Index: src/backend/storage/buffer/README =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/buffer/README,v retrieving revision 1.14 diff -c -r1.14 README *** src/backend/storage/buffer/README 21 Mar 2008 13:23:28 -0000 1.14 --- src/backend/storage/buffer/README 1 Nov 2008 14:49:38 -0000 *************** *** 264,266 **** --- 264,275 ---- This ensures that the page image transferred to disk is reasonably consistent. We might miss a hint-bit update or two but that isn't a problem, for the same reasons mentioned under buffer access rules. + + As of 8.4, background writer starts during recovery mode when there is + some form of potentially extended recovery to perform. It performs an + identical service to normal processing, except that checkpoints it + writes are technically restartpoints. Flushing outstanding WAL for dirty + buffers is also skipped, though there shouldn't ever be new WAL entries + at that time in any case. We could choose to start background writer + immediately but we hold off until we can prove the database is in a + consistent state so that postmaster has a single, clean state change. Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.240 diff -c -r1.240 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 31 Oct 2008 15:05:00 -0000 1.240 --- src/backend/storage/buffer/bufmgr.c 1 Nov 2008 14:49:38 -0000 *************** *** 70,76 **** /* local state for LockBufferForCleanup */ static volatile BufferDesc *PinCountWaitBuf = NULL; ! static Buffer ReadBuffer_common(SMgrRelation reln, bool isLocalBuf, ForkNumber forkNum, BlockNumber blockNum, --- 70,78 ---- /* local state for LockBufferForCleanup */ static volatile BufferDesc *PinCountWaitBuf = NULL; ! static long CleanupWaitSecs = 0; ! static int CleanupWaitUSecs = 0; ! static bool CleanupWaitStats = false; static Buffer ReadBuffer_common(SMgrRelation reln, bool isLocalBuf, ForkNumber forkNum, BlockNumber blockNum, *************** *** 2324,2329 **** --- 2326,2378 ---- } /* + * On standby servers only the Startup process applies Cleanup. As a result + * a single buffer pin can be enough to effectively halt recovery for short + * periods. We need special instrumentation to monitor this so we can judge + * whether additional measures are required to control the negative effects. + */ + void + StartCleanupDelayStats(void) + { + CleanupWaitSecs = 0; + CleanupWaitUSecs = 0; + CleanupWaitStats = true; + } + + void + EndCleanupDelayStats(void) + { + CleanupWaitStats = false; + } + + /* + * Called by Startup process whenever we request restartpoint + */ + void + ReportCleanupDelayStats(void) + { + elog(trace_recovery(DEBUG2), "cleanup wait total=%ld.%03d s", + CleanupWaitSecs, CleanupWaitUSecs / 1000); + } + + static void + CleanupDelayStats(TimestampTz start_ts, TimestampTz end_ts) + { + long wait_secs; + int wait_usecs; + + TimestampDifference(start_ts, end_ts, &wait_secs, &wait_usecs); + + CleanupWaitSecs +=wait_secs; + CleanupWaitUSecs +=wait_usecs; + if (CleanupWaitUSecs > 999999) + { + CleanupWaitSecs += 1; + CleanupWaitUSecs -= 1000000; + } + } + + /* * LockBufferForCleanup - lock a buffer in preparation for deleting items * * Items may be deleted from a disk page only when the caller (a) holds an *************** *** 2366,2371 **** --- 2415,2422 ---- for (;;) { + TimestampTz start_ts = 0; + /* Try to acquire lock */ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); LockBufHdr(bufHdr); *************** *** 2388,2396 **** --- 2439,2452 ---- PinCountWaitBuf = bufHdr; UnlockBufHdr(bufHdr); LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + if (CleanupWaitStats) + start_ts = GetCurrentTimestamp(); /* Wait to be signaled by UnpinBuffer() */ ProcWaitForSignal(); PinCountWaitBuf = NULL; + if (CleanupWaitStats) + CleanupDelayStats(start_ts, GetCurrentTimestamp()); + /* Loop back and try again */ } } Index: src/backend/storage/freespace/freespace.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/freespace/freespace.c,v retrieving revision 1.66 diff -c -r1.66 freespace.c *** src/backend/storage/freespace/freespace.c 31 Oct 2008 19:40:27 -0000 1.66 --- src/backend/storage/freespace/freespace.c 1 Nov 2008 15:42:15 -0000 *************** *** 222,228 **** blkno = fsm_logical_to_physical(addr); /* If the page doesn't exist already, extend */ ! buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR); page = BufferGetPage(buf); if (PageIsNew(page)) PageInit(page, BLCKSZ, 0); --- 222,229 ---- blkno = fsm_logical_to_physical(addr); /* If the page doesn't exist already, extend */ ! buf = XLogReadBufferExtended(rnode, FSM_FORKNUM, blkno, ! RBM_ZERO_ON_ERROR, BUFFER_LOCK_CLEANUP); page = BufferGetPage(buf); if (PageIsNew(page)) PageInit(page, BLCKSZ, 0); *************** *** 822,828 **** * pages. */ buf = XLogReadBufferExtended(xlrec->node, FSM_FORKNUM, fsmblk, ! RBM_ZERO_ON_ERROR); if (BufferIsValid(buf)) { Page page = BufferGetPage(buf); --- 823,829 ---- * pages. */ buf = XLogReadBufferExtended(xlrec->node, FSM_FORKNUM, fsmblk, ! RBM_ZERO_ON_ERROR, BUFFER_LOCK_CLEANUP); if (BufferIsValid(buf)) { Page page = BufferGetPage(buf); Index: src/backend/storage/ipc/procarray.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/ipc/procarray.c,v retrieving revision 1.46 diff -c -r1.46 procarray.c *** src/backend/storage/ipc/procarray.c 4 Aug 2008 18:03:46 -0000 1.46 --- src/backend/storage/ipc/procarray.c 1 Nov 2008 14:49:38 -0000 *************** *** 17,22 **** --- 17,37 ---- * as are the myProcLocks lists. They can be distinguished from regular * backend PGPROCs at need by checking for pid == 0. * + * The process array now also includes PGPROC structures representing + * transactions being recovered. The xid and subxids fields of these are valid, + * though few other fields are. They can be distinguished from regular backend + * PGPROCs by checking for pid == 0. The proc array also has an + * secondary array of UnobservedXids representing transactions that are + * known to be running on the master but for which we do not yet know the + * slotId, so cannot be assigned to the correct recovery proc. We infer + * the existence of UnobservedXids by watching the sequence of arriving + * xids. This is very important because if we leave those xids out of the + * the snapshot then they will appear to be already complete. Later, when + * they have actually completed this could lead to confusion as to whether + * those xids are visible or not, blowing a huge hole in MVCC. We need 'em. + * We go to extreme lengths to ensure that the number of UnobservedXids is + * both bounded and realistically manageable. There are simpler designs, + * but they lead to unbounded worst case behaviour, so we sweat. * * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California *************** *** 33,56 **** #include "access/subtrans.h" #include "access/transam.h" ! #include "access/xact.h" #include "access/twophase.h" #include "miscadmin.h" #include "storage/procarray.h" #include "utils/snapmgr.h" /* Our shared memory area */ typedef struct ProcArrayStruct { int numProcs; /* number of valid procs entries */ ! int maxProcs; /* allocated size of procs array */ /* * We declare procs[] as 1 entry because C wants a fixed-size array, but * actually it is maxProcs entries long. */ PGPROC *procs[1]; /* VARIABLE LENGTH ARRAY */ } ProcArrayStruct; static ProcArrayStruct *procArray; --- 48,86 ---- #include "access/subtrans.h" #include "access/transam.h" ! #include "access/xlog.h" #include "access/twophase.h" #include "miscadmin.h" + #include "storage/proc.h" #include "storage/procarray.h" #include "utils/snapmgr.h" + static RunningXactsData CurrentRunningXactsData; + + /* Handy constant for an invalid xlog recptr */ + static const XLogRecPtr InvalidXLogRecPtr = {0, 0}; + + void ProcArrayDisplay(int trace_level); + /* Our shared memory area */ typedef struct ProcArrayStruct { int numProcs; /* number of valid procs entries */ ! int maxProcs; /* allocated size of total procs array */ ! ! int maxRecoveryProcs; /* number of allocated recovery procs */ ! ! int numUnobservedXids; /* number of valid unobserved xids */ ! int maxUnobservedXids; /* allocated size of unobserved array */ /* * We declare procs[] as 1 entry because C wants a fixed-size array, but * actually it is maxProcs entries long. */ PGPROC *procs[1]; /* VARIABLE LENGTH ARRAY */ + + /* ARRAY OF UNOBSERVED TRANSACTION XIDs FOLLOWS */ } ProcArrayStruct; static ProcArrayStruct *procArray; *************** *** 100,107 **** Size size; size = offsetof(ProcArrayStruct, procs); ! size = add_size(size, mul_size(sizeof(PGPROC *), ! add_size(MaxBackends, max_prepared_xacts))); return size; } --- 130,148 ---- Size size; size = offsetof(ProcArrayStruct, procs); ! ! /* Normal processing */ ! /* MyProc slots */ ! size = add_size(size, mul_size(sizeof(PGPROC *), MaxBackends)); ! size = add_size(size, mul_size(sizeof(PGPROC *), max_prepared_xacts)); ! ! /* Recovery processing */ ! ! /* Recovery Procs */ ! size = add_size(size, mul_size(sizeof(PGPROC *), MaxBackends)); ! /* UnobservedXids */ ! size = add_size(size, mul_size(sizeof(TransactionId), MaxBackends)); ! size = add_size(size, mul_size(sizeof(TransactionId), MaxBackends)); return size; } *************** *** 123,130 **** --- 164,203 ---- /* * We're the first - initialize. */ + /* Normal processing */ procArray->numProcs = 0; procArray->maxProcs = MaxBackends + max_prepared_xacts; + + /* Recovery processing */ + procArray->maxRecoveryProcs = MaxBackends; + procArray->maxProcs += procArray->maxRecoveryProcs; + + procArray->maxUnobservedXids = 2 * MaxBackends; + procArray->numUnobservedXids = 0; + + if (!IsUnderPostmaster) + { + int i; + + /* + * Create and add the Procs for recovery emulation. + * + * We do this now, so that we can identify which Recovery Proc + * goes with each normal backend. Normal procs were allocated + * first so we can use the slotId of the *proc* to look up + * the Recovery Proc in the *procarray*. Recovery Procs never + * move around in the procarray, whereas normal procs do. + * e.g. Proc with slotId=7 is always associated with procarray[7] + * for recovery processing. see also + */ + for (i = 0; i < procArray->maxRecoveryProcs; i++) + { + PGPROC *RecoveryProc = InitRecoveryProcess(); + + ProcArrayAdd(RecoveryProc); + } + elog(DEBUG3, "Added %d Recovery Procs", i); + } } } *************** *** 213,218 **** --- 286,338 ---- elog(LOG, "failed to find proc %p in ProcArray", proc); } + /* + * ProcArrayStartRecoveryTransaction + * + * Update Recovery Proc to show transaction is complete. There is no + * locking here. It is either handled by caller, or potentially + * ignored (see comments for GetNewTransactionId()). + * + * In recovery we supply an LSN also, to ensure we can tell which of + * several inputs is the latest information on the state of the proc. + * + * There is no ProcArrayStartNormalTransaction, that is handled by + * GetNewTransactionId in varsup.c + */ + void + ProcArrayStartRecoveryTransaction(PGPROC *proc, TransactionId xid, XLogRecPtr lsn, bool isSubXact) + { + elog(trace_recovery(DEBUG4), + "start recovery xid = %d lsn = %X/%X %s", + xid, lsn.xlogid, lsn.xrecoff, (isSubXact ? "(SUB)" : "")); + /* + * Use volatile pointer to prevent code rearrangement; other backends + * could be examining my subxids info concurrently, and we don't want + * them to see an invalid intermediate state, such as incrementing + * nxids before filling the array entry. Note we are assuming that + * TransactionId and int fetch/store are atomic. + */ + { + volatile PGPROC *myproc = proc; + + proc->lsn = lsn; + + if (!isSubXact) + myproc->xid = xid; + else + { + int nxids = myproc->subxids.nxids; + + if (nxids < PGPROC_MAX_CACHED_SUBXIDS) + { + myproc->subxids.xids[nxids] = xid; + myproc->subxids.nxids = nxids + 1; + } + else + myproc->subxids.overflowed = true; + } + } + } /* * ProcArrayEndTransaction -- mark a transaction as no longer running *************** *** 220,226 **** * This is used interchangeably for commit and abort cases. The transaction * commit/abort must already be reported to WAL and pg_clog. * ! * proc is currently always MyProc, but we pass it explicitly for flexibility. * latestXid is the latest Xid among the transaction's main XID and * subtransactions, or InvalidTransactionId if it has no XID. (We must ask * the caller to pass latestXid, instead of computing it from the PGPROC's --- 340,348 ---- * This is used interchangeably for commit and abort cases. The transaction * commit/abort must already be reported to WAL and pg_clog. * ! * In normal running proc is currently always MyProc, but in recovery we pass ! * one of the recovery procs. ! * * latestXid is the latest Xid among the transaction's main XID and * subtransactions, or InvalidTransactionId if it has no XID. (We must ask * the caller to pass latestXid, instead of computing it from the PGPROC's *************** *** 228,234 **** * incomplete.) */ void ! ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid) { if (TransactionIdIsValid(latestXid)) { --- 350,357 ---- * incomplete.) */ void ! ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid, ! int nsubxids, TransactionId *subxids) { if (TransactionIdIsValid(latestXid)) { *************** *** 253,258 **** --- 376,402 ---- proc->subxids.nxids = 0; proc->subxids.overflowed = false; + /* + * Check that any subtransactions are removed from UnobservedXids. + * We include the subxids array so that they can be removed atomically + * from UnobservedXids at the same time as we zero the main xid on + * the Recovery proc. + */ + if (nsubxids > 0) + { + int i; + + Assert(subxids != NULL); + + /* + * Ignore any failure to find the xids - this avoids complex + * bookkeeping solely to account for rare strangeness that + * would add too much overhead to be worth the cost. + */ + for (i = 0; i < nsubxids; i++) + UnobservedTransactionsRemoveXid(subxids[i], false); + } + /* Also advance global latestCompletedXid while holding the lock */ if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid, latestXid)) *************** *** 301,306 **** --- 445,451 ---- proc->xid = InvalidTransactionId; proc->lxid = InvalidLocalTransactionId; proc->xmin = InvalidTransactionId; + proc->lsn = InvalidXLogRecPtr; /* redundant, but just in case */ proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK; *************** *** 311,316 **** --- 456,602 ---- proc->subxids.overflowed = false; } + /* + * ProcArrayClearRecoveryTransactions + * + * Called during recovery when we see a Shutdown checkpoint or EndRecovery + * record, or at the end of recovery processing. + */ + void + ProcArrayClearRecoveryTransactions(void) + { + ProcArrayStruct *arrayP = procArray; + int index; + + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + + /* + * Reset Recovery Procs + */ + for (index = 0; index < arrayP->maxRecoveryProcs; index++) + { + PGPROC *RecoveryProc = arrayP->procs[index]; + + ProcArrayClearTransaction(RecoveryProc); + } + + /* + * Clear the UnobservedXids also + */ + UnobservedTransactionsClearXids(); + + LWLockRelease(ProcArrayLock); + } + + /* debug support functions for recovery processing */ + bool + XidInRecoveryProcs(TransactionId xid) + { + ProcArrayStruct *arrayP = procArray; + int index; + + for (index = 0; index < arrayP->maxRecoveryProcs; index++) + { + PGPROC *RecoveryProc = arrayP->procs[index]; + + if (RecoveryProc->xid == xid) + return true; + } + return false; + } + + void + ProcArrayDisplay(int trace_level) + { + ProcArrayStruct *arrayP = procArray; + int index; + + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + + for (index = 0; index < arrayP->maxRecoveryProcs; index++) + { + PGPROC *RecoveryProc = arrayP->procs[index]; + + if (TransactionIdIsValid(RecoveryProc->xid)) + elog(trace_level, + "proc %d proc->xid %d proc->lsn %X/%X", index, RecoveryProc->xid, + RecoveryProc->lsn.xlogid, RecoveryProc->lsn.xrecoff); + } + + UnobservedTransactionsDisplay(trace_level); + + LWLockRelease(ProcArrayLock); + } + + /* + * Use the data about running transactions on master to either create the + * initial state of the Recovery Procs, or maintain correctness of their + * state. This is almost the opposite of GetSnapshotData(). + * + * Only used during recovery. Notice the signature is very similar to a + * _redo function. + */ + void + ProcArrayUpdateRecoveryTransactions(XLogRecPtr lsn, xl_xact_running_xacts *xlrec) + { + PGPROC *proc; + int xid_index; + TransactionId *subxip = (TransactionId *) &(xlrec->xrun[xlrec->xcnt]); + + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + + for (xid_index = 0; xid_index < xlrec->xcnt; xid_index++) + { + RunningXact *rxact = (RunningXact *) xlrec->xrun; + + proc = SlotIdGetRecoveryProc(rxact[xid_index].slotId); + + elog(trace_recovery(DEBUG2), + "running xact proc->lsn %X/%X lsn %X/%X proc->xid %d xid %d", + proc->lsn.xlogid, proc->lsn.xrecoff, + lsn.xlogid, lsn.xrecoff, proc->xid, rxact[xid_index].xid); + /* + * If our state information is later for this proc, then + * overwrite it. It's possible for a commit and possibly + * a new transaction record to have arrived in WAL in between + * us doing GetRunningTransactionData() and grabbing the + * WALInsertLock, so we musn't assume we know best always. + */ + if (XLByteLT(proc->lsn, lsn)) + { + proc->lsn = lsn; + proc->xid = rxact[xid_index].xid; + /* proc-> pid stays 0 for Recovery Procs */ + /* proc->slotId should never be touched */ + proc->databaseId = rxact[xid_index].databaseId; + proc->roleId = rxact[xid_index].roleId; + proc->vacuumFlags = rxact[xid_index].vacuumFlags; + + proc->subxids.nxids = rxact[xid_index].nsubxids; + proc->subxids.overflowed = rxact[xid_index].overflowed; + + memcpy(proc->subxids.xids, subxip, + rxact[xid_index].nsubxids * sizeof(TransactionId)); + } + } + + /* + * We could look for Recovery Procs that weren't mentioned, but thats + * a lot of work for little benefit. We opt for a simple and cheap + * alternative: left prune the UnobservedXids array up to latestRunningXid. + * This is correct because at the time we take this snapshot, all + * completed transactions prior to latestRunningXid will be marked in + * WAL. So we won't ever see a WAL record for them again. + * + * We can't clear the array completely because race conditions allow + * things to slip through sometimes. + */ + UnobservedTransactionsPruneXids(xlrec->latestRunningXid); + + LWLockRelease(ProcArrayLock); + + ProcArrayDisplay(trace_recovery(DEBUG5)); + } /* * TransactionIdIsInProgress -- is given transaction running in some backend *************** *** 655,661 **** * but since PGPROC has only a limited cache area for subxact XIDs, full * information may not be available. If we find any overflowed subxid arrays, * we have to mark the snapshot's subxid data as overflowed, and extra work ! * will need to be done to determine what's running (see XidInMVCCSnapshot() * in tqual.c). * * We also update the following backend-global variables: --- 941,947 ---- * but since PGPROC has only a limited cache area for subxact XIDs, full * information may not be available. If we find any overflowed subxid arrays, * we have to mark the snapshot's subxid data as overflowed, and extra work ! * *may* need to be done to determine what's running (see XidInMVCCSnapshot() * in tqual.c). * * We also update the following backend-global variables: *************** *** 680,685 **** --- 966,972 ---- int index; int count = 0; int subcount = 0; + bool suboverflowed = false; Assert(snapshot != NULL); *************** *** 706,713 **** (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of memory"))); Assert(snapshot->subxip == NULL); snapshot->subxip = (TransactionId *) ! malloc(arrayP->maxProcs * PGPROC_MAX_CACHED_SUBXIDS * sizeof(TransactionId)); if (snapshot->subxip == NULL) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), --- 993,1001 ---- (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of memory"))); Assert(snapshot->subxip == NULL); + #define maxNumSubXids (arrayP->maxProcs * PGPROC_MAX_CACHED_SUBXIDS) snapshot->subxip = (TransactionId *) ! malloc(maxNumSubXids * sizeof(TransactionId)); if (snapshot->subxip == NULL) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), *************** *** 771,781 **** } /* ! * Save subtransaction XIDs if possible (if we've already overflowed, ! * there's no point). Note that the subxact XIDs must be later than ! * their parent, so no need to check them against xmin. We could ! * filter against xmax, but it seems better not to do that much work ! * while holding the ProcArrayLock. * * The other backend can add more subxids concurrently, but cannot * remove any. Hence it's important to fetch nxids just once. Should --- 1059,1069 ---- } /* ! * Save subtransaction XIDs, whether or not we have overflowed. ! * Note that the subxact XIDs must be later than their parent, so no ! * need to check them against xmin. We could filter against xmax, ! * but it seems better not to do that much work while holding the ! * ProcArrayLock. * * The other backend can add more subxids concurrently, but cannot * remove any. Hence it's important to fetch nxids just once. Should *************** *** 784,806 **** * * Again, our own XIDs are not included in the snapshot. */ ! if (subcount >= 0 && proc != MyProc) ! { ! if (proc->subxids.overflowed) ! subcount = -1; /* overflowed */ ! else { int nxids = proc->subxids.nxids; if (nxids > 0) { memcpy(snapshot->subxip + subcount, (void *) proc->subxids.xids, nxids * sizeof(TransactionId)); subcount += nxids; } } } } if (!TransactionIdIsValid(MyProc->xmin)) --- 1072,1147 ---- * * Again, our own XIDs are not included in the snapshot. */ ! if (proc != MyProc) { int nxids = proc->subxids.nxids; if (nxids > 0) { + if (proc->subxids.overflowed) + suboverflowed = true; + memcpy(snapshot->subxip + subcount, (void *) proc->subxids.xids, nxids * sizeof(TransactionId)); subcount += nxids; } + } } + + /* + * Also check for unobserved xids. There is no need for us to specify + * only if IsRecoveryProcessingMode(), since the list will always be + * empty when normal processing begins and the test will be optimised + * to nearly nothing very quickly. + */ + for (index = 0; index < arrayP->numUnobservedXids; index++) + { + volatile TransactionId *UnobservedXids; + TransactionId xid; + + UnobservedXids = (TransactionId *) &(arrayP->procs[arrayP->maxProcs]); + + /* Fetch xid just once - see GetNewTransactionId */ + xid = UnobservedXids[index]; + + /* + * If there are no more visible xids, we're done. This works + * because UnobservedXids is maintained in strict ascending order. + */ + if (!TransactionIdIsNormal(xid) || TransactionIdPrecedes(xid, xmax)) + break; + + /* + * Typically, there will be space in the snapshot. We know that the + * unobserved xids are being run by one of the procs marked with + * an xid of InvalidTransactionId, so we will have ignored that above, + * and the xidcache for that proc will have been empty also. + * + * We put the unobserved xid anywhere in the snapshot. The xid might + * be a top-level or it might be a subtransaction, but it won't + * change the answer to XidInMVCCSnapshot() whichever it is. That's + * just as well, since we don't know which it is, by definition. + */ + if (count < arrayP->maxProcs) + snapshot->xip[count++] = xid; + else + { + /* + * Store unobserved xids in the subxid cache instead. + */ + snapshot->subxip[subcount++] = xid; + } + + /* + * We don't really need xmin during recovery, but lets derive + * it anyway for consistency. It is possible that an unobserved + * xid could be xmin if there is contention between long-lived + * transactions. + */ + if (TransactionIdPrecedes(xid, xmin)) + xmin = xid; } if (!TransactionIdIsValid(MyProc->xmin)) *************** *** 824,829 **** --- 1165,1171 ---- snapshot->xmax = xmax; snapshot->xcnt = count; snapshot->subxcnt = subcount; + snapshot->suboverflowed = suboverflowed; snapshot->curcid = GetCurrentCommandId(false); *************** *** 839,844 **** --- 1181,1413 ---- } /* + * GetRunningTransactionData -- returns information about running transactions. + * + * Similar to GetSnapshotData but returning more information. We include + * all PGPROCs with an assigned TransactionId, even VACUUM processes. We + * include slotId and databaseId for each PGPROC. We also keep track + * of which subtransactions go with each PGPROC, information which is lost + * when we GetSnapshotData. + * + * This is never executed when IsRecoveryMode() so there is no need to look + * at UnobservedXids. + * + * We don't worry about updating other counters, we want to keep this as + * simple as possible and leave GetSnapshotData() as the primary code for + * that bookkeeping. + */ + RunningTransactions + GetRunningTransactionData(void) + { + ProcArrayStruct *arrayP = procArray; + RunningTransactions CurrentRunningXacts = (RunningTransactions) &CurrentRunningXactsData; + RunningXact *rxact; + TransactionId *subxip; + TransactionId latestRunningXid = InvalidTransactionId; + TransactionId prev_latestRunningXid = InvalidTransactionId; + TransactionId latestCompletedXid; + int numAttempts = 0; + int index; + int count = 0; + int subcount = 0; + bool suboverflowed = false; + + /* + * Allocating space for maxProcs xids is usually overkill; numProcs would + * be sufficient. But it seems better to do the malloc while not holding + * the lock, so we can't look at numProcs. Likewise, we allocate much + * more subxip storage than is probably needed. + * + * Should only be allocated for bgwriter, since only ever executed + * during checkpoints. + */ + if (CurrentRunningXacts->xrun == NULL) + { + /* + * First call + */ + CurrentRunningXacts->xrun = (RunningXact *) + malloc(arrayP->maxProcs * sizeof(RunningXact)); + if (CurrentRunningXacts->xrun == NULL) + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); + Assert(CurrentRunningXacts->subxip == NULL); + CurrentRunningXacts->subxip = (TransactionId *) + malloc(maxNumSubXids * sizeof(TransactionId)); + if (CurrentRunningXacts->subxip == NULL) + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); + } + + rxact = CurrentRunningXacts->xrun; + subxip = CurrentRunningXacts->subxip; + + /* + * Loop until we get a valid snapshot. See exit conditions below. + */ + for (;;) + { + count = 0; + subcount = 0; + suboverflowed = false; + + LWLockAcquire(ProcArrayLock, LW_SHARED); + + latestCompletedXid = ShmemVariableCache->latestCompletedXid; + + /* + * Spin over procArray checking xid, and subxids. Shared lock is enough + * because new transactions don't use locks at all, so LW_EXCLUSIVE + * wouldn't be enough to prevent them, so don't bother. + */ + for (index = 0; index < arrayP->numProcs; index++) + { + volatile PGPROC *proc = arrayP->procs[index]; + TransactionId xid; + int nxids; + + /* Fetch xid just once - see GetNewTransactionId */ + xid = proc->xid; + + /* + * We store all xids, even XIDs >= xmax and our own XID, if any. + * But we don't store transactions that don't have a TransactionId + * yet because they will not show as running on a standby server. + */ + if (!TransactionIdIsValid(xid)) + continue; + + rxact[count].xid = xid; + rxact[count].slotId = proc->slotId; + rxact[count].databaseId = proc->databaseId; + rxact[count].roleId = proc->roleId; + rxact[count].vacuumFlags = proc->vacuumFlags; + + if (TransactionIdPrecedes(latestRunningXid, xid)) + latestRunningXid = xid; + + /* + * Save subtransaction XIDs. + * + * The other backend can add more subxids concurrently, but cannot + * remove any. Hence it's important to fetch nxids just once. Should + * be safe to use memcpy, though. (We needn't worry about missing any + * xids added concurrently, because they must postdate xmax.) + * + * Again, our own XIDs *are* included in the snapshot. + */ + nxids = proc->subxids.nxids; + + if (nxids > 0) + { + TransactionId *subxids = (TransactionId *) proc->subxids.xids; + + rxact[count].subx_offset = subcount; + + memcpy(subxip + subcount, + (void *) proc->subxids.xids, + nxids * sizeof(TransactionId)); + subcount += nxids; + + if (proc->subxids.overflowed) + { + rxact[count].overflowed = true; + suboverflowed = true; + } + else if (TransactionIdPrecedes(latestRunningXid, subxids[nxids - 1])) + latestRunningXid = subxids[nxids - 1]; + } + + rxact[count].nsubxids = nxids; + + count++; + } + + LWLockRelease(ProcArrayLock); + + /* + * If there's no procs with TransactionIds allocated we need to + * find what the last xid assigned was. This takes and releases + * XidGenLock, but that shouldn't cause contention in this case. + * We could do this as well if the snapshot overflowed, but in + * that case we think that XidGenLock might be high, so we punt. + * + * By the time we do this, another proc may have incremented the + * nextxid, so we must rescan the procarray to check whether + * there are either new running transactions or the counter is + * the same as before. If transactions appear and disappear + * faster than we can do this, we're in trouble. So spin for at + * a few 3 attempts before giving up. + * + * We do it this way to avoid needing to grab XidGenLock in all + * cases, which is hardly ever actually required. + */ + if (count > 0) + break; + else + { + #define MAX_SNAPSHOT_ATTEMPTS 3 + if (numAttempts >= MAX_SNAPSHOT_ATTEMPTS) + { + latestRunningXid = InvalidTransactionId; + break; + } + + latestRunningXid = ReadNewTransactionId(); + TransactionIdRetreat(latestRunningXid); + + if (prev_latestRunningXid == latestRunningXid) + break; + + prev_latestRunningXid = latestRunningXid; + numAttempts++; + } + } + + CurrentRunningXacts->xcnt = count; + CurrentRunningXacts->subxcnt = subcount; + CurrentRunningXacts->latestCompletedXid = latestCompletedXid; + if (!suboverflowed) + CurrentRunningXacts->latestRunningXid = latestRunningXid; + else + CurrentRunningXacts->latestRunningXid = InvalidTransactionId; + + #ifdef RUNNING_XACT_DEBUG + elog(trace_recovery(DEBUG3), + "logging running xacts xcnt %d subxcnt %d latestCompletedXid %d latestRunningXid %d", + CurrentRunningXacts->xcnt, + CurrentRunningXacts->subxcnt, + CurrentRunningXacts->latestCompletedXid, + CurrentRunningXacts->latestRunningXid); + + for (index = 0; index < CurrentRunningXacts->xcnt; index++) + { + int j; + elog(trace_recovery(DEBUG3), + "xid %d pid %d backend %d db %d role %d nsubxids %d offset %d vf %u, overflow %s", + CurrentRunningXacts->xrun[index].xid, + CurrentRunningXacts->xrun[index].pid, + CurrentRunningXacts->xrun[index].slotId, + CurrentRunningXacts->xrun[index].databaseId, + CurrentRunningXacts->xrun[index].roleId, + CurrentRunningXacts->xrun[index].nsubxids, + CurrentRunningXacts->xrun[index].subx_offset, + CurrentRunningXacts->xrun[index].vacuumFlags, + CurrentRunningXacts->xrun[index].overflowed ? "t" : "f"); + for (j = 0; j < CurrentRunningXacts->xrun[index].nsubxids; j++) + elog(trace_recovery(DEBUG3), + "subxid offset %d j %d xid %d", + CurrentRunningXacts->xrun[index].subx_offset, j, + CurrentRunningXacts->subxip[j + CurrentRunningXacts->xrun[index].subx_offset]); + } + #endif + + return CurrentRunningXacts; + } + + /* * GetTransactionsInCommit -- Get the XIDs of transactions that are committing * * Constructs an array of XIDs of transactions that are currently in commit *************** *** 1024,1036 **** * The array is palloc'd and is terminated with an invalid VXID. * * If limitXmin is not InvalidTransactionId, we skip any backends ! * with xmin >= limitXmin. If allDbs is false, we skip backends attached * to other databases. If excludeVacuum isn't zero, we skip processes for * which (excludeVacuum & vacuumFlags) is not zero. Also, our own process * is always skipped. */ VirtualTransactionId * ! GetCurrentVirtualXIDs(TransactionId limitXmin, bool allDbs, int excludeVacuum) { VirtualTransactionId *vxids; ProcArrayStruct *arrayP = procArray; --- 1593,1605 ---- * The array is palloc'd and is terminated with an invalid VXID. * * If limitXmin is not InvalidTransactionId, we skip any backends ! * with xmin >= limitXmin. If dbOid is non-zero we skip backends attached * to other databases. If excludeVacuum isn't zero, we skip processes for * which (excludeVacuum & vacuumFlags) is not zero. Also, our own process * is always skipped. */ VirtualTransactionId * ! GetCurrentVirtualXIDs(TransactionId limitXmin, Oid dbOid, int excludeVacuum) { VirtualTransactionId *vxids; ProcArrayStruct *arrayP = procArray; *************** *** 1053,1059 **** if (excludeVacuum & proc->vacuumFlags) continue; ! if (allDbs || proc->databaseId == MyDatabaseId) { /* Fetch xmin just once - might change on us? */ TransactionId pxmin = proc->xmin; --- 1622,1628 ---- if (excludeVacuum & proc->vacuumFlags) continue; ! if (dbOid == 0 || proc->databaseId == dbOid) { /* Fetch xmin just once - might change on us? */ TransactionId pxmin = proc->xmin; *************** *** 1083,1088 **** --- 1652,1712 ---- return vxids; } + int + VirtualTransactionIdGetPid(VirtualTransactionId vxid) + { + ProcArrayStruct *arrayP = procArray; + int result = 0; + int index; + + if (!VirtualTransactionIdIsValid(vxid)) + return 0; + + LWLockAcquire(ProcArrayLock, LW_SHARED); + + for (index = 0; index < arrayP->numProcs; index++) + { + VirtualTransactionId procvxid; + PGPROC *proc = arrayP->procs[index]; + + GET_VXID_FROM_PGPROC(procvxid, *proc); + + if (procvxid.backendId == vxid.backendId && + procvxid.localTransactionId == vxid.localTransactionId) + { + result = proc->pid; + break; + } + } + + LWLockRelease(ProcArrayLock); + + return result; + } + + /* + * SlotIdGetRecoveryProc -- get a PGPROC for a given SlotId + * + * Run during recovery to identify which PGPROC to access. + * Throws ERROR if not found, or we pass an invalid value. + * + * see comments in CreateSharedProcArray() + */ + PGPROC * + SlotIdGetRecoveryProc(int slotId) + { + if (slotId < 0 || slotId > MaxBackends) + elog(ERROR, "invalid slotId %d", slotId); + + Assert(procArray->procs[slotId] != NULL); + + /* + * No need to acquire ProcArrayLock to identify proc, we just + * use the slotId as an array offset directly, since we assigned + * these at start. + */ + return procArray->procs[slotId]; + } /* * CountActiveBackends --- count backends (other than myself) that are in *************** *** 1367,1369 **** --- 1991,2195 ---- } #endif /* XIDCACHE_DEBUG */ + + /* ---------------------------------------------- + * UnobservedTransactions sub-module + * ---------------------------------------------- + * + * All functions must be called holding ProcArrayLock. + */ + + /* + * Add unobserved xids to end of UnobservedXids array + */ + void + UnobservedTransactionsAddXids(TransactionId firstXid, TransactionId lastXid) + { + TransactionId ixid = firstXid; + int index = procArray->numUnobservedXids; + TransactionId *UnobservedXids; + + UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]); + + Assert(TransactionIdIsNormal(firstXid)); + Assert(TransactionIdIsNormal(lastXid)); + Assert(TransactionIdPrecedes(firstXid, lastXid)); + + /* + * UnobservedXids is maintained as a ascending list of xids, with no gaps. + * Incoming xids are always higher than previous entries, so we just add + * them directly to the end of the array. + */ + while (ixid != lastXid) + { + /* + * check to see if we have space to store more UnobservedXids + */ + if (index >= procArray->maxUnobservedXids) + { + UnobservedTransactionsDisplay(WARNING); + elog(FATAL, "No more room in UnobservedXids array"); + } + + /* + * append ixid to UnobservedXids + */ + Assert(!TransactionIdIsValid(UnobservedXids[index])); + Assert(index == 0 || TransactionIdPrecedes(UnobservedXids[index - 1], ixid)); + + elog(trace_recovery(DEBUG4), "Adding UnobservedXid %d", ixid); + UnobservedXids[index] = ixid; + index++; + + TransactionIdAdvance(ixid); + } + + procArray->numUnobservedXids = index; + } + + /* + * Remove one unobserved xid from anywhere on UnobservedXids array. + * If xid has already been pruned away, no need to report as missing. + */ + void + UnobservedTransactionsRemoveXid(TransactionId xid, bool missing_is_error) + { + int index; + bool found = false; + TransactionId *UnobservedXids; + + UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]); + + elog(trace_recovery(DEBUG4), "Remove UnobservedXid = %d", xid); + + /* + * If we haven't initialised array yet, or if we've already cleared it + * ignore this and get on with it. If it's missing after this it is an + * ERROR if removal is requested and the value isn't present. + */ + if (procArray->numUnobservedXids == 0 || + (procArray->numUnobservedXids > 0 && + TransactionIdPrecedes(xid, UnobservedXids[0]))) + return; + + /* + * XXX we could use bsearch, if this has significant overhead. + */ + for (index = 0; index < procArray->numUnobservedXids; index++) + { + if (!found) + { + if (UnobservedXids[index] == xid) + found = true; + } + else + { + UnobservedXids[index - 1] = UnobservedXids[index]; + } + } + + if (found) + UnobservedXids[--procArray->numUnobservedXids] = InvalidTransactionId; + + if (!found && missing_is_error) + { + UnobservedTransactionsDisplay(LOG); + elog(ERROR, "could not remove unobserved xid = %d", xid); + } + } + + /* + * Prune array up to a particular limit. This frequently means clearing the + * whole array, but we don't attempt to optimise for that at present. + */ + void + UnobservedTransactionsPruneXids(TransactionId limitXid) + { + int index; + int pruneUpToThisIndex = 0; + TransactionId *UnobservedXids; + + UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]); + + elog(trace_recovery(DEBUG4), "Prune UnobservedXids up to %d", limitXid); + + for (index = 0; index < procArray->numUnobservedXids; index++) + { + if (TransactionIdFollowsOrEquals(limitXid, UnobservedXids[index])) + pruneUpToThisIndex = index + 1; + else + { + /* + * Anything to delete? + */ + if (pruneUpToThisIndex == 0) + return; + + /* + * Move unpruned values to start of array + */ + UnobservedXids[index - pruneUpToThisIndex] = UnobservedXids[index]; + UnobservedXids[index] = 0; + } + } + + procArray->numUnobservedXids -= pruneUpToThisIndex; + } + + /* + * Clear the whole array. + */ + void + UnobservedTransactionsClearXids(void) + { + int index; + TransactionId *UnobservedXids; + + elog(trace_recovery(DEBUG4), "Clear UnobservedXids"); + + UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]); + + #ifdef USE_ASSERT_CHECKING + /* + * UnobservedTransactionsAddXids() asserts that array will be empty + * when we add new values. so it must be zeroes here each time. + */ + for (index = 0; index < procArray->numUnobservedXids; index++) + { + UnobservedXids[index] = 0; + } + #endif + + procArray->numUnobservedXids = 0; + } + + void + UnobservedTransactionsDisplay(int trace_level) + { + int index; + TransactionId *UnobservedXids; + + UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]); + + for (index = 0; index < procArray->maxUnobservedXids; index++) + { + elog(trace_level, "%d unobserved[%d] = %d ", + procArray->numUnobservedXids, index, UnobservedXids[index]); + } + } + + bool + XidInUnobservedTransactions(TransactionId xid) + { + int index; + TransactionId *UnobservedXids; + + UnobservedXids = (TransactionId *) &(procArray->procs[procArray->maxProcs]); + + for (index = 0; index < procArray->numUnobservedXids; index++) + { + if (UnobservedXids[index] == xid) + return true; + } + return false; + } Index: src/backend/storage/ipc/sinvaladt.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/ipc/sinvaladt.c,v retrieving revision 1.74 diff -c -r1.74 sinvaladt.c *** src/backend/storage/ipc/sinvaladt.c 18 Jul 2008 14:45:48 -0000 1.74 --- src/backend/storage/ipc/sinvaladt.c 1 Nov 2008 14:49:38 -0000 *************** *** 142,147 **** --- 142,148 ---- int nextMsgNum; /* next message number to read */ bool resetState; /* backend needs to reset its state */ bool signaled; /* backend has been sent catchup signal */ + bool sendOnly; /* backend only sends, never receives */ /* * Next LocalTransactionId to use for each idle backend slot. We keep *************** *** 248,254 **** * Initialize a new backend to operate on the sinval buffer */ void ! SharedInvalBackendInit(void) { int index; ProcState *stateP = NULL; --- 249,255 ---- * Initialize a new backend to operate on the sinval buffer */ void ! SharedInvalBackendInit(bool sendOnly) { int index; ProcState *stateP = NULL; *************** *** 307,312 **** --- 308,314 ---- stateP->nextMsgNum = segP->maxMsgNum; stateP->resetState = false; stateP->signaled = false; + stateP->sendOnly = sendOnly; LWLockRelease(SInvalWriteLock); *************** *** 578,584 **** /* * Recompute minMsgNum = minimum of all backends' nextMsgNum, identify * the furthest-back backend that needs signaling (if any), and reset ! * any backends that are too far back. */ min = segP->maxMsgNum; minsig = min - SIG_THRESHOLD; --- 580,588 ---- /* * Recompute minMsgNum = minimum of all backends' nextMsgNum, identify * the furthest-back backend that needs signaling (if any), and reset ! * any backends that are too far back. Note that because we ignore ! * sendOnly backends here it is possible for them to keep sending ! * messages without a problem even when they are the only active backend. */ min = segP->maxMsgNum; minsig = min - SIG_THRESHOLD; *************** *** 590,596 **** int n = stateP->nextMsgNum; /* Ignore if inactive or already in reset state */ ! if (stateP->procPid == 0 || stateP->resetState) continue; /* --- 594,600 ---- int n = stateP->nextMsgNum; /* Ignore if inactive or already in reset state */ ! if (stateP->procPid == 0 || stateP->resetState || stateP->sendOnly) continue; /* Index: src/backend/storage/lmgr/lock.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/lmgr/lock.c,v retrieving revision 1.184 diff -c -r1.184 lock.c *** src/backend/storage/lmgr/lock.c 1 Aug 2008 13:16:09 -0000 1.184 --- src/backend/storage/lmgr/lock.c 1 Nov 2008 14:49:38 -0000 *************** *** 38,43 **** --- 38,44 ---- #include "miscadmin.h" #include "pg_trace.h" #include "pgstat.h" + #include "storage/sinval.h" #include "utils/memutils.h" #include "utils/ps_status.h" #include "utils/resowner.h" *************** *** 490,495 **** --- 491,505 ---- if (lockmode <= 0 || lockmode > lockMethodTable->numLockModes) elog(ERROR, "unrecognized lock mode: %d", lockmode); + if (IsRecoveryProcessingMode() && + locktag->locktag_type == LOCKTAG_OBJECT && + lockmode > AccessShareLock) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot acquire lockmode %s on database objects while recovery is in progress", + lockMethodTable->lockModeNames[lockmode]), + errhint("Only AccessShareLock can be acquired on database objects during recovery."))); + #ifdef LOCK_DEBUG if (LOCK_DEBUG_ENABLED(locktag)) elog(LOG, "LockAcquire: lock [%u,%u] %s", *************** *** 817,822 **** --- 827,866 ---- LWLockRelease(partitionLock); + /* + * We made it all the way here. We've got the lock and we've got + * it for the first time in this transaction. So now it's time + * to send a WAL message so that standby servers can see this event, + * if its an AccessExclusiveLock on a relation. + */ + if (!InRecovery && lockmode >= AccessExclusiveLock && + locktag->locktag_type == LOCKTAG_RELATION) + { + XLogRecData rdata; + xl_rel_lock xlrec; + + START_CRIT_SECTION(); + + xlrec.slotId = MyProc->slotId; + + /* + * Decode the locktag back to the original values, to avoid + * sending lots of empty bytes with every message. See + * lock.h to check how a locktag is defined for LOCKTAG_RELATION + */ + xlrec.dbOid = locktag->locktag_field1; + xlrec.relOid = locktag->locktag_field2; + + rdata.data = (char *) (&xlrec); + rdata.len = sizeof(xl_rel_lock); + rdata.buffer = InvalidBuffer; + rdata.next = NULL; + + (void) XLogInsert(RM_RELATION_ID, XLOG_RELATION_LOCK, &rdata); + + END_CRIT_SECTION(); + } + return LOCKACQUIRE_OK; } Index: src/backend/storage/lmgr/proc.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/lmgr/proc.c,v retrieving revision 1.201 diff -c -r1.201 proc.c *** src/backend/storage/lmgr/proc.c 9 Jun 2008 18:23:05 -0000 1.201 --- src/backend/storage/lmgr/proc.c 1 Nov 2008 14:49:38 -0000 *************** *** 103,108 **** --- 103,110 ---- size = add_size(size, mul_size(NUM_AUXILIARY_PROCS, sizeof(PGPROC))); /* MyProcs, including autovacuum */ size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC))); + /* RecoveryProcs, including recovery actions by autovacuum */ + size = add_size(size, mul_size(MaxBackends, sizeof(PGPROC))); /* ProcStructLock */ size = add_size(size, sizeof(slock_t)); *************** *** 152,157 **** --- 154,160 ---- PGPROC *procs; int i; bool found; + int slotId = 0; /* Create the ProcGlobal shared structure */ ProcGlobal = (PROC_HDR *) *************** *** 188,193 **** --- 191,197 ---- { PGSemaphoreCreate(&(procs[i].sem)); procs[i].links.next = ProcGlobal->freeProcs; + procs[i].slotId = slotId++; /* once set, never changed */ ProcGlobal->freeProcs = MAKE_OFFSET(&procs[i]); } *************** *** 201,209 **** --- 205,234 ---- { PGSemaphoreCreate(&(procs[i].sem)); procs[i].links.next = ProcGlobal->autovacFreeProcs; + procs[i].slotId = slotId++; /* once set, never changed */ ProcGlobal->autovacFreeProcs = MAKE_OFFSET(&procs[i]); } + /* + * Create enough Recovery Procs so there is a shadow proc for every + * normal proc. Recovery procs don't need semaphores because they + * aren't actually performing any work, they are just ghosts with + * enough substance to store enough information to make them look + * real to anyone requesting a snapshot from the procarray. + */ + procs = (PGPROC *) ShmemAlloc((MaxBackends) * sizeof(PGPROC)); + if (!procs) + ereport(FATAL, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of shared memory"))); + MemSet(procs, 0, MaxBackends * sizeof(PGPROC)); + for (i = 0; i < MaxBackends; i++) + { + procs[i].links.next = ProcGlobal->freeProcs; + procs[i].slotId = -1; + ProcGlobal->freeProcs = MAKE_OFFSET(&procs[i]); + } + MemSet(AuxiliaryProcs, 0, NUM_AUXILIARY_PROCS * sizeof(PGPROC)); for (i = 0; i < NUM_AUXILIARY_PROCS; i++) { *************** *** 278,284 **** /* * Initialize all fields of MyProc, except for the semaphore which was ! * prepared for us by InitProcGlobal. */ SHMQueueElemInit(&(MyProc->links)); MyProc->waitStatus = STATUS_OK; --- 303,310 ---- /* * Initialize all fields of MyProc, except for the semaphore which was ! * prepared for us by InitProcGlobal. Never, ever, change the slotId. ! * Recovery snapshot processing relies completely on this never changing. */ SHMQueueElemInit(&(MyProc->links)); MyProc->waitStatus = STATUS_OK; *************** *** 322,327 **** --- 348,435 ---- } /* + * InitRecoveryProcess -- initialize a per-master process data structure + * for use when emulating transactions in recovery + */ + PGPROC * + InitRecoveryProcess(void) + { + /* use volatile pointer to prevent code rearrangement */ + volatile PROC_HDR *procglobal = ProcGlobal; + SHMEM_OFFSET myOffset; + PGPROC *ThisProc = NULL; + + /* + * ProcGlobal should be set up already (if we are a backend, we inherit + * this by fork() or EXEC_BACKEND mechanism from the postmaster). + */ + if (procglobal == NULL) + elog(PANIC, "proc header uninitialized"); + + /* + * Try to get a proc struct from the free list. If this fails, we must be + * out of PGPROC structures (not to mention semaphores). + */ + SpinLockAcquire(ProcStructLock); + + myOffset = procglobal->freeProcs; + + if (myOffset != INVALID_OFFSET) + { + ThisProc = (PGPROC *) MAKE_PTR(myOffset); + procglobal->freeProcs = ThisProc->links.next; + SpinLockRelease(ProcStructLock); + } + else + { + /* + * Should never reach here if shared memory is allocated correctly. + */ + SpinLockRelease(ProcStructLock); + elog(FATAL, "too many procs - could not create recovery proc"); + } + + /* + * xid will be set later as WAL records arrive for this recovery proc + */ + ThisProc->xid = InvalidTransactionId; + + /* + * The backendid of the recovery proc stays at InvalidBackendId. There + * is a direct 1:1 correspondence between a master backendid and this + * proc, but that same backendid may also be in use during recovery, + * so if we set this field we would have duplicate backendids. + */ + ThisProc->backendId = InvalidBackendId; + + /* + * The following are not used in recovery + */ + ThisProc->pid = 0; + + SHMQueueElemInit(&(ThisProc->links)); + ThisProc->waitStatus = STATUS_OK; + ThisProc->lxid = InvalidLocalTransactionId; + ThisProc->xmin = InvalidTransactionId; + ThisProc->databaseId = InvalidOid; + ThisProc->roleId = InvalidOid; + ThisProc->inCommit = false; + ThisProc->vacuumFlags = 0; + ThisProc->lwWaiting = false; + ThisProc->lwExclusive = false; + ThisProc->lwWaitLink = NULL; + ThisProc->waitLock = NULL; + ThisProc->waitProcLock = NULL; + + /* + * There is little else to do. The recovery proc is never used to + * acquire buffers, nor will we ever acquire LWlocks using the proc. + * Deadlock checker is not active during recovery. + */ + return ThisProc; + } + + /* * InitProcessPhase2 -- make MyProc visible in the shared ProcArray. * * This is separate from InitProcess because we can't acquire LWLocks until Index: src/backend/tcop/postgres.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/tcop/postgres.c,v retrieving revision 1.557 diff -c -r1.557 postgres.c *** src/backend/tcop/postgres.c 30 Sep 2008 10:52:13 -0000 1.557 --- src/backend/tcop/postgres.c 1 Nov 2008 14:49:38 -0000 *************** *** 3261,3267 **** * We have to build the flat file for pg_database, but not for the * user and group tables, since we won't try to do authentication. */ ! BuildFlatFiles(true); } /* --- 3261,3267 ---- * We have to build the flat file for pg_database, but not for the * user and group tables, since we won't try to do authentication. */ ! BuildFlatFiles(true, false, false); } /* Index: src/backend/tcop/utility.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/tcop/utility.c,v retrieving revision 1.299 diff -c -r1.299 utility.c *** src/backend/tcop/utility.c 10 Oct 2008 13:48:05 -0000 1.299 --- src/backend/tcop/utility.c 1 Nov 2008 14:49:38 -0000 *************** *** 296,301 **** --- 296,302 ---- break; case TRANS_STMT_PREPARE: + PreventCommandDuringRecovery(); if (!PrepareTransactionBlock(stmt->gid)) { /* report unsuccessful commit in completionTag */ *************** *** 305,315 **** --- 306,318 ---- break; case TRANS_STMT_COMMIT_PREPARED: + PreventCommandDuringRecovery(); PreventTransactionChain(isTopLevel, "COMMIT PREPARED"); FinishPreparedTransaction(stmt->gid, true); break; case TRANS_STMT_ROLLBACK_PREPARED: + PreventCommandDuringRecovery(); PreventTransactionChain(isTopLevel, "ROLLBACK PREPARED"); FinishPreparedTransaction(stmt->gid, false); break; *************** *** 631,636 **** --- 634,640 ---- break; case T_GrantStmt: + PreventCommandDuringRecovery(); ExecuteGrantStmt((GrantStmt *) parsetree); break; *************** *** 801,806 **** --- 805,811 ---- case T_NotifyStmt: { NotifyStmt *stmt = (NotifyStmt *) parsetree; + PreventCommandDuringRecovery(); Async_Notify(stmt->conditionname); } *************** *** 809,814 **** --- 814,820 ---- case T_ListenStmt: { ListenStmt *stmt = (ListenStmt *) parsetree; + PreventCommandDuringRecovery(); Async_Listen(stmt->conditionname); } *************** *** 817,822 **** --- 823,829 ---- case T_UnlistenStmt: { UnlistenStmt *stmt = (UnlistenStmt *) parsetree; + PreventCommandDuringRecovery(); if (stmt->conditionname) Async_Unlisten(stmt->conditionname); *************** *** 836,845 **** --- 843,854 ---- break; case T_ClusterStmt: + PreventCommandDuringRecovery(); cluster((ClusterStmt *) parsetree, isTopLevel); break; case T_VacuumStmt: + PreventCommandDuringRecovery(); vacuum((VacuumStmt *) parsetree, InvalidOid, true, NULL, false, isTopLevel); break; *************** *** 950,961 **** --- 959,972 ---- ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), errmsg("must be superuser to do CHECKPOINT"))); + PreventCommandDuringRecovery(); RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); break; case T_ReindexStmt: { ReindexStmt *stmt = (ReindexStmt *) parsetree; + PreventCommandDuringRecovery(); switch (stmt->kind) { *************** *** 2386,2388 **** --- 2397,2408 ---- return lev; } + + void + PreventCommandDuringRecovery(void) + { + if (IsRecoveryProcessingMode()) + ereport(ERROR, + (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION), + errmsg("cannot be run until recovery completes"))); + } Index: src/backend/utils/adt/txid.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/adt/txid.c,v retrieving revision 1.7 diff -c -r1.7 txid.c *** src/backend/utils/adt/txid.c 12 May 2008 20:02:02 -0000 1.7 --- src/backend/utils/adt/txid.c 1 Nov 2008 14:49:38 -0000 *************** *** 338,343 **** --- 338,349 ---- txid val; TxidEpoch state; + if (IsRecoveryProcessingMode()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot assign txid while recovery is in progress"), + errhint("only read only queries can execute during recovery"))); + load_xid_epoch(&state); val = convert_xid(GetTopTransactionId(), &state); Index: src/backend/utils/cache/inval.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/cache/inval.c,v retrieving revision 1.87 diff -c -r1.87 inval.c *** src/backend/utils/cache/inval.c 9 Sep 2008 18:58:08 -0000 1.87 --- src/backend/utils/cache/inval.c 1 Nov 2008 14:49:38 -0000 *************** *** 86,95 **** --- 86,99 ---- */ #include "postgres.h" + #include + #include "access/twophase_rmgr.h" #include "access/xact.h" #include "catalog/catalog.h" #include "miscadmin.h" + #include "storage/lmgr.h" + #include "storage/procarray.h" #include "storage/sinval.h" #include "storage/smgr.h" #include "utils/inval.h" *************** *** 155,160 **** --- 159,172 ---- static TransInvalidationInfo *transInvalInfo = NULL; + static SharedInvalidationMessage *SharedInvalidMessagesArray; + static int numSharedInvalidMessagesArray; + static int maxSharedInvalidMessagesArray; + + static List *RecoveryLockList; + static MemoryContext RelationLockContext; + + /* * Dynamically-registered callback functions. Current implementation * assumes there won't be very many of these at once; could improve if needed. *************** *** 741,746 **** --- 753,760 ---- MemoryContextAllocZero(TopTransactionContext, sizeof(TransInvalidationInfo)); transInvalInfo->my_level = GetCurrentTransactionNestLevel(); + SharedInvalidMessagesArray = NULL; + numSharedInvalidMessagesArray = 0; } /* *************** *** 851,856 **** --- 865,987 ---- } } + static void + MakeSharedInvalidMessagesArray(const SharedInvalidationMessage *msgs, int n) + { + /* + * Initialise array first time through in each commit + */ + if (SharedInvalidMessagesArray == NULL) + { + maxSharedInvalidMessagesArray = FIRSTCHUNKSIZE; + numSharedInvalidMessagesArray = 0; + + /* + * Although this is being palloc'd we don't actually free it directly. + * We're so close to EOXact that we now we're going to lose it anyhow. + */ + SharedInvalidMessagesArray = palloc(maxSharedInvalidMessagesArray + * sizeof(SharedInvalidationMessage)); + } + + if ((numSharedInvalidMessagesArray + n) > maxSharedInvalidMessagesArray) + { + while ((numSharedInvalidMessagesArray + n) > maxSharedInvalidMessagesArray) + maxSharedInvalidMessagesArray *= 2; + + SharedInvalidMessagesArray = repalloc(SharedInvalidMessagesArray, + maxSharedInvalidMessagesArray + * sizeof(SharedInvalidationMessage)); + } + + /* + * Append the next chunk onto the array + */ + memcpy(SharedInvalidMessagesArray + numSharedInvalidMessagesArray, + msgs, n * sizeof(SharedInvalidationMessage)); + numSharedInvalidMessagesArray += n; + } + + /* + * xactGetCommittedInvalidationMessages() is executed by + * RecordTransactionCommit() to add invalidation messages onto the + * commit record. This applies only to commit message types, never to + * abort records. Must always run before AtEOXact_Inval(), since that + * removes the data we need to see. + * + * Remember that this runs before we have officially committed, so we + * must not do anything here to change what might occur *if* we should + * fail between here and the actual commit. + * + * Note that transactional validation does *not* write a invalidation + * WAL message using XLOG_RELATION_INVAL messages. Those are only used + * by non-transactional invalidation. see comments in + * EndNonTransactionalInvalidation(). + * + * see also xact_redo_commit() and xact_desc_commit() + */ + int + xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs, + bool *RelcacheInitFileInval) + { + MemoryContext oldcontext; + + /* Must be at top of stack */ + Assert(transInvalInfo != NULL && transInvalInfo->parent == NULL); + + /* + * Relcache init file invalidation requires processing both before and + * after we send the SI messages. However, we need not do anything + * unless we committed. + */ + if (transInvalInfo->RelcacheInitFileInval) + *RelcacheInitFileInval = true; + else + *RelcacheInitFileInval = false; + + /* + * Walk through TransInvalidationInfo to collect all the messages + * into a single contiguous array of invalidation messages. It must + * be contiguous so we can copy directly into WAL message. Maintain the + * order that they would be processed in by AtEOXact_Inval(), to ensure + * emulated behaviour in redo is as similar as possible to original. + * We want the same bugs, if any, not new ones. + */ + oldcontext = MemoryContextSwitchTo(CurTransactionContext); + + ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs, + MakeSharedInvalidMessagesArray); + ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs, + MakeSharedInvalidMessagesArray); + MemoryContextSwitchTo(oldcontext); + + #ifdef STANDBY_INVAL_DEBUG + if (numSharedInvalidMessagesArray > 0) + { + int i; + + elog(LOG, "numSharedInvalidMessagesArray = %d", numSharedInvalidMessagesArray); + + Assert(SharedInvalidMessagesArray != NULL); + + for (i = 0; i < numSharedInvalidMessagesArray; i++) + { + SharedInvalidationMessage *msg = SharedInvalidMessagesArray + i; + + if (msg->id >= 0) + elog(LOG, "catcache id %d", msg->id); + else if (msg->id == SHAREDINVALRELCACHE_ID) + elog(LOG, "relcache id %d", msg->id); + else if (msg->id == SHAREDINVALSMGR_ID) + elog(LOG, "smgr cache id %d", msg->id); + } + } + #endif + + *msgs = SharedInvalidMessagesArray; + + return numSharedInvalidMessagesArray; + } /* * AtEOXact_Inval *************** *** 1041,1046 **** --- 1172,1213 ---- Assert(transInvalInfo->CurrentCmdInvalidMsgs.cclist == NULL); Assert(transInvalInfo->CurrentCmdInvalidMsgs.rclist == NULL); Assert(transInvalInfo->RelcacheInitFileInval == false); + + SharedInvalidMessagesArray = NULL; + numSharedInvalidMessagesArray = 0; + } + + /* + * General function to log the SharedInvalidMessagesArray. Only current + * caller is EndNonTransactionalInvalidation(), but that may change. + */ + static void + LogSharedInvalidMessagesArray(void) + { + XLogRecData rdata[2]; + xl_rel_inval xlrec; + + if (numSharedInvalidMessagesArray == 0) + return; + + START_CRIT_SECTION(); + + xlrec.nmsgs = numSharedInvalidMessagesArray; + + rdata[0].data = (char *) (&xlrec); + rdata[0].len = MinSizeOfRelationInval; + rdata[0].buffer = InvalidBuffer; + + rdata[0].next = &(rdata[1]); + rdata[1].data = (char *) SharedInvalidMessagesArray; + rdata[1].len = numSharedInvalidMessagesArray * + sizeof(SharedInvalidationMessage); + rdata[1].buffer = InvalidBuffer; + rdata[1].next = NULL; + + (void) XLogInsert(RM_RELATION_ID, XLOG_RELATION_INVAL, rdata); + + END_CRIT_SECTION(); } /* *************** *** 1081,1087 **** --- 1248,1274 ---- ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs, SendSharedInvalidMessages); + /* + * Write invalidation messages to WAL. This is not required for + * recovery, it is only required for standby servers. It's fairly + * low overhead so don't worry. This allows us to trigger inval + * messages on the standby as soon as we see these records. + * see relation_redo_inval() + * + * Note that transactional validation uses an array attached to + * a WAL commit record, so these messages are rare. + */ + ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs, + MakeSharedInvalidMessagesArray); + LogSharedInvalidMessagesArray(); + /* Clean up and release memory */ + + /* XXX: some questions and thoughts here: + * not sure where/how to allocate memory correctly in this case + * and how to free it afterwards. Think some more on this. + */ + for (chunk = transInvalInfo->CurrentCmdInvalidMsgs.cclist; chunk != NULL; chunk = next) *************** *** 1235,1237 **** --- 1422,1818 ---- ++relcache_callback_count; } + + /* + * ----------------------------------------------------- + * Standby wait timers and backend cancel logic + * ----------------------------------------------------- + */ + + static void + InitStandbyDelayTimers(int *currentDelay_ms, int *standbyWait_ms) + { + *currentDelay_ms = GetLatestReplicationDelay(); + + /* + * If replication delay is enormously huge, just treat that as + * zero and work up from there. This prevents us from acting + * foolishly when replaying old log files. + */ + if (*currentDelay_ms < 0) + *currentDelay_ms = 0; + + #define STANDBY_INITIAL_WAIT_MS 1 + *standbyWait_ms = STANDBY_INITIAL_WAIT_MS; + } + + /* + * Standby wait logic for XactResolveRedoVisibilityConflicts(). + * We wait here for a while then return. If we decide wecan't wait any + * more then we return true, if we can wait some more return false. + */ + static bool + WaitExceedsMaxStandbyDelay(int *currentDelay_ms, int *standbyWait_ms) + { + int maxStandbyDelay_ms = maxStandbyDelay * 1000; + + /* + * If the server is already further behind than we would + * like then no need to wait or do more complex logic. + * max_standby_delay = 0 means wait for ever, if necessary + */ + if (maxStandbyDelay >= 0 && + *currentDelay_ms > maxStandbyDelay_ms) + return true; + + /* + * Sleep, then do bookkeeping. + */ + pg_usleep(*standbyWait_ms * 1000L); + *currentDelay_ms += *standbyWait_ms; + + /* + * Progressively increase the sleep times. + */ + *standbyWait_ms *= 2; + if (*standbyWait_ms > 1000) + *standbyWait_ms = 1000; + + /* + * Re-test our exit criteria + */ + if (maxStandbyDelay >= 0 && + *currentDelay_ms > maxStandbyDelay_ms) + return true; + + return false; + } + + void + ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist, + char *reason) + { + int standbyWait_ms; + int currentDelay_ms; + bool logged; + int wontDieWait = 1; + + InitStandbyDelayTimers(¤tDelay_ms, &standbyWait_ms); + logged = false; + + while (VirtualTransactionIdIsValid(*waitlist)) + { + /* + * log that we have been waiting for a while now... + */ + if (!logged && standbyWait_ms > 500) + { + elog(trace_recovery(DEBUG5), + "virtual transaction %u/%u is blocking %s", + waitlist->backendId, + waitlist->localTransactionId, + reason); + logged = true; + } + + if (ConditionalVirtualXactLockTableWait(*waitlist)) + { + waitlist++; + InitStandbyDelayTimers(¤tDelay_ms, &standbyWait_ms); + logged = false; + } + else if (WaitExceedsMaxStandbyDelay(¤tDelay_ms, + &standbyWait_ms)) + { + /* + * Now find out who to throw out of the balloon. + */ + int pid; + + Assert(VirtualTransactionIdIsValid(*waitlist)); + pid = VirtualTransactionIdGetPid(*waitlist); + + /* + * Kill the pid if it's still here. If not, that's what we wanted + * so ignore any errors. + */ + if (pid != 0) + { + elog(LOG, + "recovery cancels activity of virtual transaction %u/%u pid %d " + "because it blocks %s (current delay now %d secs)", + waitlist->backendId, + waitlist->localTransactionId, + pid, reason, + currentDelay_ms / 1000); + kill(pid, SIGINT); + + /* wait awhile for it to die */ + pg_usleep(wontDieWait * 5000L); + wontDieWait *= 2; + } + } + } + } + + /* + * Locking in Recovery Mode + * + * All locks are held by the Startup process using a single virtual + * transaction. This implementation is both simpler and in some senses, + * more corrrect. The locks held mean "some original transaction held + * this lock, so query access is not allowed at this time". So the Startup + * process is the proxy by which the original locks are implemented. + * + * We only keep track of AccessExclusiveLocks, which are only ever held by + * one transaction on one relation. So we don't worry too much about keeping + * track of which xid holds which lock, we just track which slot holds the + * lock. This makes this scheme self-cleaning in case lock holders die + * without leaving a trace in the WAL. + * + * We keep a single dynamically expandible locks list in local memory. + * List elements use type xl_rel_lock, since the WAL record type exactly + * matches the information that we need to keep track of. + * + * We use session locks rather than normal locks so we don't need owners. + */ + + /* called by relation_redo_lock() */ + static void + RelationAddRecoveryLock(xl_rel_lock *lockRequest) + { + xl_rel_lock *newlock; + LOCKTAG locktag; + MemoryContext old_context; + + elog(trace_recovery(DEBUG4), + "adding recovery lock: slot %d db %d rel %d", + lockRequest->slotId, lockRequest->dbOid, lockRequest->relOid); + + Assert(OidIsValid(lockRequest->dbOid) && OidIsValid(lockRequest->relOid)); + + if (RelationLockContext == NULL) + RelationLockContext = AllocSetContextCreate(TopMemoryContext, + "RelationLocks", + ALLOCSET_DEFAULT_MINSIZE, + ALLOCSET_DEFAULT_INITSIZE, + ALLOCSET_DEFAULT_MAXSIZE); + + old_context = MemoryContextSwitchTo(RelationLockContext); + newlock = palloc(sizeof(xl_rel_lock)); + MemoryContextSwitchTo(old_context); + + newlock->slotId = lockRequest->slotId; + newlock->dbOid = lockRequest->dbOid; + newlock->relOid = lockRequest->relOid; + RecoveryLockList = lappend(RecoveryLockList, newlock); + + /* + * Attempt to acquire the lock as requested. + */ + SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid); + + /* + * Waiting for lock to clear or kill anyone in our way. Not a + * completely foolproof way of getting the lock, but we cannot + * aafford to sit and wait for the lock indefinitely. This is + * one reason to reduce strength of various locks in 8.4. + */ + while (LockAcquire(&locktag, AccessExclusiveLock, true, true) + == LOCKACQUIRE_NOT_AVAIL) + { + VirtualTransactionId *old_lockholders; + + old_lockholders = GetLockConflicts(&locktag, AccessExclusiveLock); + ResolveRecoveryConflictWithVirtualXIDs(old_lockholders, + "exclusive locks"); + } + } + + /* + * Called during xact_commit_redo() and xact_commit_abort when InArchiveRecovery + * to remove any AccessExclusiveLocks requested by a transaction. + * + * Remove all locks for this slotId from the RecoveryLockList. + */ + void + RelationReleaseRecoveryLocks(int slotId) + { + ListCell *l; + LOCKTAG locktag; + List *deletionList = NIL; + + elog(trace_recovery(DEBUG4), + "removing recovery locks: slot %d", slotId); + + /* + * Release all matching locks and identify list elements to remove + */ + foreach(l, RecoveryLockList) + { + xl_rel_lock *lock = (xl_rel_lock *) lfirst(l); + + elog(trace_recovery(DEBUG4), + "releasing recovery lock: slot %d db %d rel %d", + lock->slotId, lock->dbOid, lock->relOid); + + if (lock->slotId == slotId) + { + SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid); + if (!LockRelease(&locktag, AccessExclusiveLock, true)) + elog(trace_recovery(LOG), + "RecoveryLockList contains entry for lock " + "no longer recorded by lock manager " + "slot %d database %d relation %d", + lock->slotId, lock->dbOid, lock->relOid); + deletionList = lappend(deletionList, lock); + } + } + + /* + * Now remove the elements from RecoveryLockList. We can't navigate + * the list at the same time as deleting multiple elements from it. + */ + foreach(l, deletionList) + { + xl_rel_lock *lock = (xl_rel_lock *) lfirst(l); + + elog(trace_recovery(DEBUG4), + "removing recovery lock from list: slot %d db %d rel %d", + lock->slotId, lock->dbOid, lock->relOid); + + RecoveryLockList = list_delete_ptr(RecoveryLockList, lock); + pfree(lock); + } + } + + /* + * Called at end of recovery and when we see a shutdown checkpoint. + */ + void + RelationClearRecoveryLocks(void) + { + ListCell *l; + LOCKTAG locktag; + + elog(LOG, "clearing recovery locks"); + + foreach(l, RecoveryLockList) + { + xl_rel_lock *lock = (xl_rel_lock *) lfirst(l); + + SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid); + if (!LockRelease(&locktag, AccessExclusiveLock, true)) + elog(trace_recovery(LOG), + "RecoveryLockList contains entry for lock " + "no longer recorded by lock manager " + "slot %d database %d relation %d", + lock->slotId, lock->dbOid, lock->relOid); + RecoveryLockList = list_delete_ptr(RecoveryLockList, lock); + pfree(lock); + } + } + + /* + * -------------------------------------------------- + * Recovery handling for Rmgr RM_RELATION_ID + * -------------------------------------------------- + */ + + /* + * Redo for relation lock messages + */ + static void + relation_redo_lock(xl_rel_lock *xlrec) + { + RelationAddRecoveryLock(xlrec); + } + + /* + * Redo for relation invalidation messages + */ + static void + relation_redo_inval(xl_rel_inval *xlrec) + { + SharedInvalidationMessage *msgs = &(xlrec->msgs[0]); + int nmsgs = xlrec->nmsgs; + + Assert(nmsgs > 0); /* else we should not have written a record */ + + /* + * Smack them straight onto the queue and we're done. This is safe + * because the only writer of these messages is non-transactional + * invalidation. + */ + SendSharedInvalidMessages(msgs, nmsgs); + } + + void + relation_redo(XLogRecPtr lsn, XLogRecord *record) + { + uint8 info = record->xl_info & ~XLR_INFO_MASK; + + if (info == XLOG_RELATION_INVAL) + { + xl_rel_inval *xlrec = (xl_rel_inval *) XLogRecGetData(record); + + relation_redo_inval(xlrec); + } + else if (info == XLOG_RELATION_LOCK) + { + xl_rel_lock *xlrec = (xl_rel_lock *) XLogRecGetData(record); + + relation_redo_lock(xlrec); + } + else + elog(PANIC, "relation_redo: unknown op code %u", info); + } + + static void + relation_desc_inval(StringInfo buf, xl_rel_inval *xlrec) + { + SharedInvalidationMessage *msgs = &(xlrec->msgs[0]); + int nmsgs = xlrec->nmsgs; + + appendStringInfo(buf, "nmsgs %d;", nmsgs); + + if (nmsgs > 0) + { + int i; + + for (i = 0; i < nmsgs; i++) + { + SharedInvalidationMessage *msg = msgs + i; + + if (msg->id >= 0) + appendStringInfo(buf, "catcache id %d", msg->id); + else if (msg->id == SHAREDINVALRELCACHE_ID) + appendStringInfo(buf, "relcache "); + else if (msg->id == SHAREDINVALSMGR_ID) + appendStringInfo(buf, "smgr "); + } + } + } + + void + relation_desc(StringInfo buf, uint8 xl_info, char *rec) + { + uint8 info = xl_info & ~XLR_INFO_MASK; + + if (info == XLOG_RELATION_INVAL) + { + xl_rel_inval *xlrec = (xl_rel_inval *) rec; + + appendStringInfo(buf, "inval: "); + relation_desc_inval(buf, xlrec); + } + else if (info == XLOG_RELATION_LOCK) + { + xl_rel_lock *xlrec = (xl_rel_lock *) rec; + + appendStringInfo(buf, "exclusive relation lock: slot %d db %d rel %d", + xlrec->slotId, xlrec->dbOid, xlrec->relOid); + } + else + appendStringInfo(buf, "UNKNOWN"); + } Index: src/backend/utils/error/elog.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/error/elog.c,v retrieving revision 1.209 diff -c -r1.209 elog.c *** src/backend/utils/error/elog.c 27 Oct 2008 19:37:21 -0000 1.209 --- src/backend/utils/error/elog.c 1 Nov 2008 14:49:38 -0000 *************** *** 2579,2581 **** --- 2579,2598 ---- return false; } + + /* + * If trace_recovery_messages is set to make this visible, then show as LOG, + * else display as whatever level is set. It may still be shown, but only + * if log_min_messages is set lower than trace_recovery_messages. + * + * Intention is to keep this for at least the whole of the 8.4 production + * release, so we can more easily diagnose production problems in the field. + */ + int + trace_recovery(int trace_level) + { + if (trace_level >= trace_recovery_messages) + return LOG; + + return trace_level; + } Index: src/backend/utils/init/flatfiles.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/init/flatfiles.c,v retrieving revision 1.35 diff -c -r1.35 flatfiles.c *** src/backend/utils/init/flatfiles.c 12 Jun 2008 09:12:31 -0000 1.35 --- src/backend/utils/init/flatfiles.c 1 Nov 2008 14:49:38 -0000 *************** *** 678,686 **** /* * This routine is called once during database startup, after completing * WAL replay if needed. Its purpose is to sync the flat files with the ! * current state of the database tables. This is particularly important ! * during PITR operation, since the flat files will come from the ! * base backup which may be far out of sync with the current state. * * In theory we could skip rebuilding the flat files if no WAL replay * occurred, but it seems best to just do it always. We have to --- 678,687 ---- /* * This routine is called once during database startup, after completing * WAL replay if needed. Its purpose is to sync the flat files with the ! * current state of the database tables. ! * ! * In 8.4 we also run this during xact_redo_commit() if the transaction ! * wrote a new database or auth flat file. * * In theory we could skip rebuilding the flat files if no WAL replay * occurred, but it seems best to just do it always. We have to *************** *** 696,702 **** * something corrupt in the authid/authmem catalogs. */ void ! BuildFlatFiles(bool database_only) { ResourceOwner owner; RelFileNode rnode; --- 697,703 ---- * something corrupt in the authid/authmem catalogs. */ void ! BuildFlatFiles(bool database_only, bool acquire_locks, bool release_locks) { ResourceOwner owner; RelFileNode rnode; *************** *** 713,723 **** rnode.dbNode = 0; rnode.relNode = DatabaseRelationId; /* * We don't have any hope of running a real relcache, but we can use the * same fake-relcache facility that WAL replay uses. - * - * No locking is needed because no one else is alive yet. */ rel_db = CreateFakeRelcacheEntry(rnode); write_database_file(rel_db, true); --- 714,736 ---- rnode.dbNode = 0; rnode.relNode = DatabaseRelationId; + if (!acquire_locks && release_locks) + elog(FATAL, "BuildFlatFiles called with invalid parameters"); + + if (acquire_locks) + { + #ifdef HAVE_RECOVERY_LOCKING + LockSharedObject(DatabaseRelationId, InvalidOid, 0, + AccessExclusiveLock); + + LockSharedObject(AuthIdRelationId, InvalidOid, 0, + AccessExclusiveLock); + #endif + } + /* * We don't have any hope of running a real relcache, but we can use the * same fake-relcache facility that WAL replay uses. */ rel_db = CreateFakeRelcacheEntry(rnode); write_database_file(rel_db, true); *************** *** 744,749 **** --- 757,778 ---- CurrentResourceOwner = NULL; ResourceOwnerDelete(owner); + + /* + * If we don't release locks it is because we presume that all + * locks will be released by the end of xact_redo_commit(). + */ + if (release_locks) + { + #ifdef HAVE_RECOVERY_LOCKING + XXXR change these to lock releases + LockSharedObject(DatabaseRelationId, InvalidOid, 0, + AccessExclusiveLock); + + LockSharedObject(AuthIdRelationId, InvalidOid, 0, + AccessExclusiveLock); + #endif + } } *************** *** 859,864 **** --- 888,907 ---- ForceSyncCommit(); } + /* + * Exported to allow transaction commit to set flags to perform this in redo + */ + bool + AtEOXact_Database_FlatFile_Update_Needed(void) + { + return TransactionIdIsValid(database_file_update_subid); + } + + bool + AtEOXact_Auth_FlatFile_Update_Needed(void) + { + return TransactionIdIsValid(auth_file_update_subid); + } /* * This routine is called during transaction prepare. Index: src/backend/utils/init/postinit.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/init/postinit.c,v retrieving revision 1.186 diff -c -r1.186 postinit.c *** src/backend/utils/init/postinit.c 23 Sep 2008 09:20:36 -0000 1.186 --- src/backend/utils/init/postinit.c 1 Nov 2008 14:49:38 -0000 *************** *** 440,446 **** */ MyBackendId = InvalidBackendId; ! SharedInvalBackendInit(); if (MyBackendId > MaxBackends || MyBackendId <= 0) elog(FATAL, "bad backend id: %d", MyBackendId); --- 440,446 ---- */ MyBackendId = InvalidBackendId; ! SharedInvalBackendInit(false); if (MyBackendId > MaxBackends || MyBackendId <= 0) elog(FATAL, "bad backend id: %d", MyBackendId); *************** *** 489,497 **** --- 489,503 ---- * Start a new transaction here before first access to db, and get a * snapshot. We don't have a use for the snapshot itself, but we're * interested in the secondary effect that it sets RecentGlobalXmin. + * If we are connecting during recovery, make sure the initial + * transaction is read only and force all subsequent transactions + * that way also. */ if (!bootstrap) { + if (IsRecoveryProcessingMode()) + SetConfigOption("default_transaction_read_only", "true", + PGC_POSTMASTER, PGC_S_OVERRIDE); StartTransactionCommand(); (void) GetTransactionSnapshot(); } *************** *** 515,521 **** */ if (!bootstrap) LockSharedObject(DatabaseRelationId, MyDatabaseId, 0, ! RowExclusiveLock); /* * Recheck the flat file copy of pg_database to make sure the target --- 521,527 ---- */ if (!bootstrap) LockSharedObject(DatabaseRelationId, MyDatabaseId, 0, ! (IsRecoveryProcessingMode() ? AccessShareLock : RowExclusiveLock)); /* * Recheck the flat file copy of pg_database to make sure the target Index: src/backend/utils/misc/guc.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/misc/guc.c,v retrieving revision 1.475 diff -c -r1.475 guc.c *** src/backend/utils/misc/guc.c 6 Oct 2008 13:05:36 -0000 1.475 --- src/backend/utils/misc/guc.c 1 Nov 2008 14:49:38 -0000 *************** *** 114,119 **** --- 114,121 ---- extern bool synchronize_seqscans; extern bool fullPageWrites; + int trace_recovery_messages = DEBUG1; + #ifdef TRACE_SORT extern bool trace_sort; #endif *************** *** 2588,2593 **** --- 2590,2605 ---- }, { + {"trace_recovery_messages", PGC_SUSET, LOGGING_WHEN, + gettext_noop("Sets the message levels that are logged during recovery."), + gettext_noop("Each level includes all the levels that follow it. The later" + " the level, the fewer messages are sent.") + }, + &trace_recovery_messages, + DEBUG1, server_message_level_options, NULL, NULL + }, + + { {"track_functions", PGC_SUSET, STATS_COLLECTOR, gettext_noop("Collects function-level statistics on database activity."), NULL Index: src/backend/utils/time/tqual.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/utils/time/tqual.c,v retrieving revision 1.110 diff -c -r1.110 tqual.c *** src/backend/utils/time/tqual.c 26 Mar 2008 16:20:47 -0000 1.110 --- src/backend/utils/time/tqual.c 1 Nov 2008 14:49:38 -0000 *************** *** 86,92 **** SetHintBits(HeapTupleHeader tuple, Buffer buffer, uint16 infomask, TransactionId xid) { ! if (TransactionIdIsValid(xid)) { /* NB: xid must be known committed here! */ XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid); --- 86,92 ---- SetHintBits(HeapTupleHeader tuple, Buffer buffer, uint16 infomask, TransactionId xid) { ! if (!IsRecoveryProcessingMode() && TransactionIdIsValid(xid)) { /* NB: xid must be known committed here! */ XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid); *************** *** 1238,1263 **** return true; /* ! * If the snapshot contains full subxact data, the fastest way to check ! * things is just to compare the given XID against both subxact XIDs and ! * top-level XIDs. If the snapshot overflowed, we have to use pg_subtrans ! * to convert a subxact XID to its parent XID, but then we need only look ! * at top-level XIDs not subxacts. */ - if (snapshot->subxcnt >= 0) - { - /* full data, so search subxip */ - int32 j; ! for (j = 0; j < snapshot->subxcnt; j++) ! { ! if (TransactionIdEquals(xid, snapshot->subxip[j])) return true; } ! /* not there, fall through to search xip[] */ ! } ! else { /* overflowed, so convert xid to top-level */ xid = SubTransGetTopmostTransaction(xid); --- 1238,1289 ---- return true; /* ! * Our strategy for checking xids changed in 8.4. Prior to 8.4 ! * we either checked the subxid cache on the snapshot or we ! * checked subtrans. That was much more efficient than just using ! * subtrans but it has some problems. First, as soon as *any* ! * transaction had more than 64 transactions we forced *all* ! * snapshots to check against subtrans, giving a sharp modal ! * change in behaviour. Second because we either checked subtrans ! * or the snapshot, we were forced to place entries in subtrans ! * in case the snapshot later overflowed, even if we never ! * actually checked subtrans. ! * ! * In 8.4 we improve on that scheme in a number of ways. As before ! * we check subtrans if the snapshot has overflowed. We *also* ! * check the subxid cache. This has two benefits: first the ! * behaviour degrades gracefully when the cache overflows, so we ! * retain much of its benefit if it has only just overflowed. ! * Second, a transaction doesn't need to insert entries into ! * subtrans until its own personal subxid cache overflows. This ! * means entries into subtrans become significantly rarer, ! * perhaps less than 1% of the previous insert rate, giving ! * considerable benefit for transactions using only a few ! * subtransactions. ! * ! * This behaviour is also necessary for allowing snapshots to work ! * correctly on a standby server. By this subtle change of behaviour ! * we can now utilise the subxid cache to store "unobserved xids" ! * of which we can infer their existence from watching the ! * arrival sequence of newly observed transactionids in the WAL. */ ! /* ! * First, compare the given XID against cached subxact XIDs. ! */ ! for (i = 0; i < snapshot->subxcnt; i++) ! { ! if (TransactionIdEquals(xid, snapshot->subxip[i])) return true; } ! /* ! * If the snapshot overflowed and we haven't already located the xid ! * we also have to consult pg_subtrans. We use subtrans to convert a ! * subxact XID to its parent XID, so that we can then check the status ! * of the top-level TransactionId. ! */ ! if (snapshot->suboverflowed) { /* overflowed, so convert xid to top-level */ xid = SubTransGetTopmostTransaction(xid); *************** *** 1270,1275 **** --- 1296,1305 ---- return false; } + /* + * By now xid is either not present, or a top-level xid. So now + * we just need to check the main transaction ids. + */ for (i = 0; i < snapshot->xcnt; i++) { if (TransactionIdEquals(xid, snapshot->xip[i])) Index: src/bin/pg_controldata/pg_controldata.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/bin/pg_controldata/pg_controldata.c,v retrieving revision 1.41 diff -c -r1.41 pg_controldata.c *** src/bin/pg_controldata/pg_controldata.c 24 Sep 2008 08:59:42 -0000 1.41 --- src/bin/pg_controldata/pg_controldata.c 1 Nov 2008 14:49:38 -0000 *************** *** 197,202 **** --- 197,205 ---- printf(_("Minimum recovery ending location: %X/%X\n"), ControlFile.minRecoveryPoint.xlogid, ControlFile.minRecoveryPoint.xrecoff); + printf(_("Minimum safe starting location: %X/%X\n"), + ControlFile.minSafeStartPoint.xlogid, + ControlFile.minSafeStartPoint.xrecoff); printf(_("Maximum data alignment: %u\n"), ControlFile.maxAlign); /* we don't print floatFormat since can't say much useful about it */ Index: src/bin/pg_resetxlog/pg_resetxlog.c =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/bin/pg_resetxlog/pg_resetxlog.c,v retrieving revision 1.68 diff -c -r1.68 pg_resetxlog.c *** src/bin/pg_resetxlog/pg_resetxlog.c 24 Sep 2008 09:00:44 -0000 1.68 --- src/bin/pg_resetxlog/pg_resetxlog.c 1 Nov 2008 14:49:38 -0000 *************** *** 595,600 **** --- 595,602 ---- ControlFile.prevCheckPoint.xrecoff = 0; ControlFile.minRecoveryPoint.xlogid = 0; ControlFile.minRecoveryPoint.xrecoff = 0; + ControlFile.minSafeStartPoint.xlogid = 0; + ControlFile.minSafeStartPoint.xrecoff = 0; /* Now we can force the recorded xlog seg size to the right thing. */ ControlFile.xlog_seg_size = XLogSegSize; Index: src/include/miscadmin.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/miscadmin.h,v retrieving revision 1.203 diff -c -r1.203 miscadmin.h *** src/include/miscadmin.h 9 Oct 2008 17:24:05 -0000 1.203 --- src/include/miscadmin.h 1 Nov 2008 14:49:38 -0000 *************** *** 221,226 **** --- 221,232 ---- /* in tcop/postgres.c */ extern void check_stack_depth(void); + /* in tcop/utility.c */ + extern void PreventCommandDuringRecovery(void); + + /* in utils/misc/guc.c */ + extern int trace_recovery_messages; + int trace_recovery(int trace_level); /***************************************************************************** * pdir.h -- * Index: src/include/access/heapam.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/heapam.h,v retrieving revision 1.139 diff -c -r1.139 heapam.h *** src/include/access/heapam.h 8 Oct 2008 01:14:44 -0000 1.139 --- src/include/access/heapam.h 1 Nov 2008 14:49:38 -0000 *************** *** 121,131 **** extern XLogRecPtr log_heap_move(Relation reln, Buffer oldbuf, ItemPointerData from, Buffer newbuf, HeapTuple newtup); extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! bool redirect_move); extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid, OffsetNumber *offsets, int offcnt); --- 121,133 ---- extern XLogRecPtr log_heap_move(Relation reln, Buffer oldbuf, ItemPointerData from, Buffer newbuf, HeapTuple newtup); + extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode, + TransactionId latestRemovedXid); extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! TransactionId latestRemovedXid, bool redirect_move); extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid, OffsetNumber *offsets, int offcnt); Index: src/include/access/htup.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/htup.h,v retrieving revision 1.102 diff -c -r1.102 htup.h *** src/include/access/htup.h 28 Oct 2008 15:51:03 -0000 1.102 --- src/include/access/htup.h 1 Nov 2008 14:49:38 -0000 *************** *** 580,585 **** --- 580,586 ---- #define XLOG_HEAP2_FREEZE 0x00 #define XLOG_HEAP2_CLEAN 0x10 #define XLOG_HEAP2_CLEAN_MOVE 0x20 + #define XLOG_HEAP2_CLEANUP_INFO 0x30 /* * All what we need to find changed tuple *************** *** 664,669 **** --- 665,671 ---- { RelFileNode node; BlockNumber block; + TransactionId latestRemovedXid; uint16 nredirected; uint16 ndead; /* OFFSET NUMBERS FOLLOW */ *************** *** 671,676 **** --- 673,691 ---- #define SizeOfHeapClean (offsetof(xl_heap_clean, ndead) + sizeof(uint16)) + /* + * Cleanup_info is required in some cases during a lazy VACUUM. + * Used for reporting the results of HeapTupleHeaderAdvanceLatestRemovedXid() + * see vacuumlazy.c for full explanation + */ + typedef struct xl_heap_cleanup_info + { + RelFileNode node; + TransactionId latestRemovedXid; + } xl_heap_cleanup_info; + + #define SizeOfHeapCleanupInfo (sizeof(xl_heap_cleanup_info)) + /* This is for replacing a page's contents in toto */ /* NB: this is used for indexes as well as heaps */ typedef struct xl_heap_newpage *************** *** 714,719 **** --- 729,737 ---- #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_xid) + sizeof(TransactionId)) + extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple, + TransactionId *latestRemovedXid); + /* HeapTupleHeader functions implemented in utils/time/combocid.c */ extern CommandId HeapTupleHeaderGetCmin(HeapTupleHeader tup); extern CommandId HeapTupleHeaderGetCmax(HeapTupleHeader tup); Index: src/include/access/rmgr.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/rmgr.h,v retrieving revision 1.18 diff -c -r1.18 rmgr.h *** src/include/access/rmgr.h 30 Sep 2008 10:52:13 -0000 1.18 --- src/include/access/rmgr.h 1 Nov 2008 14:49:38 -0000 *************** *** 24,29 **** --- 24,30 ---- #define RM_TBLSPC_ID 5 #define RM_MULTIXACT_ID 6 #define RM_FREESPACE_ID 7 + #define RM_RELATION_ID 8 #define RM_HEAP2_ID 9 #define RM_HEAP_ID 10 #define RM_BTREE_ID 11 Index: src/include/access/xact.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xact.h,v retrieving revision 1.95 diff -c -r1.95 xact.h *** src/include/access/xact.h 11 Aug 2008 11:05:11 -0000 1.95 --- src/include/access/xact.h 1 Nov 2008 14:49:38 -0000 *************** *** 17,22 **** --- 17,23 ---- #include "access/xlog.h" #include "nodes/pg_list.h" #include "storage/relfilenode.h" + #include "utils/snapshot.h" #include "utils/timestamp.h" *************** *** 84,111 **** --- 85,156 ---- #define XLOG_XACT_ABORT 0x20 #define XLOG_XACT_COMMIT_PREPARED 0x30 #define XLOG_XACT_ABORT_PREPARED 0x40 + #define XLOG_XACT_ASSIGNMENT 0x50 + #define XLOG_XACT_RUNNING_XACTS 0x60 + /* 0x70 can also be used, if required */ + + typedef struct xl_xact_assignment + { + TransactionId xassign; /* assigned xid */ + TransactionId xparent; /* assigned xids parent, if any */ + bool isSubXact; /* is a subtransaction */ + int slotId; /* slotId in procarray */ + } xl_xact_assignment; + + /* + * xl_xact_running_xacts is in utils/snapshot.h so it can be passed + * around to the same places as snapshots. Not snapmgr.h + */ typedef struct xl_xact_commit { TimestampTz xact_time; /* time of commit */ + int slotId; /* slotId in procarray */ + uint xinfo; /* info flags */ int nrels; /* number of RelFileForks */ int nsubxacts; /* number of subtransaction XIDs */ + int nmsgs; /* number of shared inval msgs */ /* Array of RelFileFork(s) to drop at commit */ RelFileFork xnodes[1]; /* VARIABLE LENGTH ARRAY */ /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */ + /* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */ } xl_xact_commit; #define MinSizeOfXactCommit offsetof(xl_xact_commit, xnodes) + /* + * These flags are set in the xinfo fields of transaction + * completion WAL records. They indicate a number of actions + * that need to occur when emulating transaction completion. + * They are named XactCompletion... to differentiate them from + * EOXact... routines which run at the end of the original + * transaction completion. + */ + #define XACT_COMPLETION_UNMARKED_SUBXIDS 0x01 + + /* These next states only occur on commit record types */ + #define XACT_COMPLETION_UPDATE_DB_FILE 0x02 + #define XACT_COMPLETION_UPDATE_AUTH_FILE 0x04 + #define XACT_COMPLETION_UPDATE_RELCACHE_FILE 0x08 + + /* Access macros for above flags */ + #define XactCompletionHasUnMarkedSubxids(xlrec) ((xlrec)->xinfo & XACT_COMPLETION_UNMARKED_SUBXIDS) + #define XactCompletionUpdateDBFile(xlrec) ((xlrec)->xinfo & XACT_COMPLETION_UPDATE_DB_FILE) + #define XactCompletionUpdateAuthFile(xlrec) ((xlrec)->xinfo & XACT_COMPLETION_UPDATE_AUTH_FILE) + #define XactCompletionRelcacheInitFileInval(xlrec) ((xlrec)->xinfo & XACT_COMPLETION_UPDATE_RELCACHE_FILE) + typedef struct xl_xact_abort { TimestampTz xact_time; /* time of abort */ + int slotId; /* slotId in procarray */ + uint xinfo; /* info flags */ int nrels; /* number of RelFileForks */ int nsubxacts; /* number of subtransaction XIDs */ /* Array of RelFileFork(s) to drop at abort */ RelFileFork xnodes[1]; /* VARIABLE LENGTH ARRAY */ /* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */ } xl_xact_abort; + /* Note the intentional lack of an invalidation message array c.f. commit */ #define MinSizeOfXactAbort offsetof(xl_xact_abort, xnodes) *************** *** 185,190 **** --- 230,246 ---- extern int xactGetCommittedChildren(TransactionId **ptr); + extern void LogCurrentRunningXacts(void); + extern bool IsRunningXactDataIsValid(void); + extern void GetStandbyInfoForTransaction(RmgrId rmid, uint8 info, + XLogRecData *rdata, + TransactionId *xid2, + uint16 *info2); + + extern void InitRecoveryTransactionEnvironment(void); + extern void XactResolveRedoVisibilityConflicts(XLogRecPtr lsn, XLogRecord *record); + extern void RecordKnownAssignedTransactionIds(XLogRecPtr lsn, XLogRecord *record); + extern void xact_redo(XLogRecPtr lsn, XLogRecord *record); extern void xact_desc(StringInfo buf, uint8 xl_info, char *rec); Index: src/include/access/xlog.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlog.h,v retrieving revision 1.88 diff -c -r1.88 xlog.h *** src/include/access/xlog.h 12 May 2008 08:35:05 -0000 1.88 --- src/include/access/xlog.h 1 Nov 2008 14:49:38 -0000 *************** *** 46,55 **** TransactionId xl_xid; /* xact id */ uint32 xl_tot_len; /* total len of entire record */ uint32 xl_len; /* total len of rmgr data */ ! uint8 xl_info; /* flag bits, see below */ RmgrId xl_rmid; /* resource manager for this record */ ! /* Depending on MAXALIGN, there are either 2 or 6 wasted bytes here */ /* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */ --- 46,57 ---- TransactionId xl_xid; /* xact id */ uint32 xl_tot_len; /* total len of entire record */ uint32 xl_len; /* total len of rmgr data */ ! uint8 xl_info; /* flag bits, see below (XLR_ entries) */ RmgrId xl_rmid; /* resource manager for this record */ + uint16 xl_info2; /* more flag bits, see below (XLR2_ entries) */ + TransactionId xl_xid2; /* parent_xid if XLR2_FIRST_SUBXID_RECORD is set */ ! /* Above structure has 8 byte alignment */ /* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */ *************** *** 85,90 **** --- 87,125 ---- */ #define XLR_BKP_REMOVABLE 0x01 + /* + * XLOG uses only high 4 bits of xl_info2. + * + * Other 12 bits are the slotId, allowing up to XLOG_MAX_SLOT_ID + * slotIds in the WAL record. This doesn't prevent having more than + * that number of backends, it just means all backends with a slotId + * higher than XLOG_MAX_SLOT_ID need to write a specific WAL record + * during AssignTransactionId() + */ + #define XLR2_INFO2_MASK 0x0FFF + #define XLOG_MAX_SLOT_ID 4096 + /* + * xl_info2 records + */ + #define XLR2_INVALID_SLOT_ID 0x8000 + #define XLR2_FIRST_XID_RECORD 0x4000 + #define XLR2_FIRST_SUBXID_RECORD 0x2000 + #define XLR2_MARK_SUBTRANS 0x1000 + + #define XLR2_XID_MASK 0x6000 + + #define XLogRecGetSlotId(record) \ + ( \ + ((record)->xl_info2 & XLR2_INVALID_SLOT_ID) ? \ + -1 : \ + (int)((record)->xl_info2 & XLR2_INFO2_MASK) \ + ) + + #define XLogRecIsFirstXidRecord(record) ((record)->xl_info2 & XLR2_FIRST_XID_RECORD) + #define XLogRecIsFirstSubXidRecord(record) ((record)->xl_info2 & XLR2_FIRST_SUBXID_RECORD) + #define XLogRecIsFirstUseOfXid(record) ((record)->xl_info2 & XLR2_XID_MASK) + #define XLogRecMustMarkSubtrans(record) ((record)->xl_info2 & XLR2_MARK_SUBTRANS) + /* Sync methods */ #define SYNC_METHOD_FSYNC 0 #define SYNC_METHOD_FDATASYNC 1 *************** *** 133,139 **** } XLogRecData; extern TimeLineID ThisTimeLineID; /* current TLI */ ! extern bool InRecovery; extern XLogRecPtr XactLastRecEnd; /* these variables are GUC parameters related to XLOG */ --- 168,182 ---- } XLogRecData; extern TimeLineID ThisTimeLineID; /* current TLI */ ! /* ! * Prior to 8.4, all activity during recovery were carried out by Startup ! * process. This local variable continues to be used in many parts of the ! * code to indicate actions taken by RecoveryManagers. Other processes who ! * potentially perform work during recovery should check ! * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c ! */ ! extern bool InRecovery; ! extern bool InArchiveRecovery; extern XLogRecPtr XactLastRecEnd; /* these variables are GUC parameters related to XLOG */ *************** *** 143,148 **** --- 186,192 ---- extern char *XLogArchiveCommand; extern int XLogArchiveTimeout; extern bool log_checkpoints; + extern int maxStandbyDelay; #define XLogArchivingActive() (XLogArchiveMode) #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0') *************** *** 166,171 **** --- 210,216 ---- /* These indicate the cause of a checkpoint request */ #define CHECKPOINT_CAUSE_XLOG 0x0010 /* XLOG consumption */ #define CHECKPOINT_CAUSE_TIME 0x0020 /* Elapsed time */ + #define CHECKPOINT_RESTARTPOINT 0x0040 /* Restartpoint during recovery */ /* Checkpoint statistics */ typedef struct CheckpointStatsData *************** *** 197,202 **** --- 242,250 ---- extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record); extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec); + extern bool IsRecoveryProcessingMode(void); + extern int GetLatestReplicationDelay(void); + extern void UpdateControlFile(void); extern Size XLOGShmemSize(void); extern void XLOGShmemInit(void); Index: src/include/access/xlog_internal.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlog_internal.h,v retrieving revision 1.24 diff -c -r1.24 xlog_internal.h *** src/include/access/xlog_internal.h 11 Aug 2008 11:05:11 -0000 1.24 --- src/include/access/xlog_internal.h 1 Nov 2008 14:49:38 -0000 *************** *** 17,22 **** --- 17,23 ---- #define XLOG_INTERNAL_H #include "access/xlog.h" + #include "catalog/pg_control.h" #include "fmgr.h" #include "pgtime.h" #include "storage/block.h" *************** *** 71,77 **** /* * Each page of XLOG file has a header like this: */ ! #define XLOG_PAGE_MAGIC 0xD063 /* can be used as WAL version indicator */ typedef struct XLogPageHeaderData { --- 72,78 ---- /* * Each page of XLOG file has a header like this: */ ! #define XLOG_PAGE_MAGIC 0x5352 /* can be used as WAL version indicator */ typedef struct XLogPageHeaderData { *************** *** 245,250 **** --- 246,254 ---- extern pg_time_t GetLastSegSwitchTime(void); extern XLogRecPtr RequestXLogSwitch(void); + extern void CreateRestartPoint(const XLogRecPtr ReadPtr, + const CheckPoint *restartPoint, int flags); + /* * These aren't in xlog.h because I'd rather not include fmgr.h there. */ *************** *** 255,259 **** --- 259,273 ---- extern Datum pg_current_xlog_insert_location(PG_FUNCTION_ARGS); extern Datum pg_xlogfile_name_offset(PG_FUNCTION_ARGS); extern Datum pg_xlogfile_name(PG_FUNCTION_ARGS); + extern Datum pg_recovery_continue(PG_FUNCTION_ARGS); + extern Datum pg_recovery_pause(PG_FUNCTION_ARGS); + extern Datum pg_recovery_pause_cleanup(PG_FUNCTION_ARGS); + extern Datum pg_recovery_pause_xid(PG_FUNCTION_ARGS); + extern Datum pg_recovery_pause_time(PG_FUNCTION_ARGS); + extern Datum pg_recovery_advance(PG_FUNCTION_ARGS); + extern Datum pg_recovery_stop(PG_FUNCTION_ARGS); + extern Datum pg_is_in_recovery(PG_FUNCTION_ARGS); + extern Datum pg_last_completed_xact_timestamp(PG_FUNCTION_ARGS); + extern Datum pg_last_completed_xid(PG_FUNCTION_ARGS); #endif /* XLOG_INTERNAL_H */ Index: src/include/access/xlogutils.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlogutils.h,v retrieving revision 1.27 diff -c -r1.27 xlogutils.h *** src/include/access/xlogutils.h 31 Oct 2008 15:05:00 -0000 1.27 --- src/include/access/xlogutils.h 1 Nov 2008 15:11:29 -0000 *************** *** 26,33 **** BlockNumber nblocks); extern Buffer XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init); extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, ! BlockNumber blkno, ReadBufferMode mode); extern Relation CreateFakeRelcacheEntry(RelFileNode rnode); extern void FreeFakeRelcacheEntry(Relation fakerel); --- 26,34 ---- BlockNumber nblocks); extern Buffer XLogReadBuffer(RelFileNode rnode, BlockNumber blkno, bool init); + extern Buffer XLogReadBufferForCleanup(RelFileNode rnode, BlockNumber blkno, bool init); extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, ! BlockNumber blkno, ReadBufferMode mode, int lockmode); extern Relation CreateFakeRelcacheEntry(RelFileNode rnode); extern void FreeFakeRelcacheEntry(Relation fakerel); Index: src/include/catalog/pg_control.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/catalog/pg_control.h,v retrieving revision 1.42 diff -c -r1.42 pg_control.h *** src/include/catalog/pg_control.h 23 Sep 2008 09:20:39 -0000 1.42 --- src/include/catalog/pg_control.h 1 Nov 2008 14:49:38 -0000 *************** *** 21,27 **** /* Version identifier for this pg_control format */ ! #define PG_CONTROL_VERSION 843 /* * Body of CheckPoint XLOG records. This is declared here because we keep --- 21,28 ---- /* Version identifier for this pg_control format */ ! #define PG_CONTROL_VERSION 847 ! // xxx change me /* * Body of CheckPoint XLOG records. This is declared here because we keep *************** *** 46,52 **** #define XLOG_NOOP 0x20 #define XLOG_NEXTOID 0x30 #define XLOG_SWITCH 0x40 ! /* System status indicator */ typedef enum DBState --- 47,58 ---- #define XLOG_NOOP 0x20 #define XLOG_NEXTOID 0x30 #define XLOG_SWITCH 0x40 ! /* ! * Prior to 8.4 we wrote a shutdown checkpoint when recovery completed. ! * Now we write an XLOG_RECOVERY_END record, which helps differentiate ! * between a checkpoint-at-shutdown and the startup case. ! */ ! #define XLOG_RECOVERY_END 0x50 /* System status indicator */ typedef enum DBState *************** *** 101,107 **** --- 107,118 ---- CheckPoint checkPointCopy; /* copy of last check point record */ + /* + * Next two sound very similar, yet are distinct and necessary. + * Check comments in xlog.c for a full explanation not easily repeated. + */ XLogRecPtr minRecoveryPoint; /* must replay xlog to here */ + XLogRecPtr minSafeStartPoint; /* safe point after recovery crashes */ /* * This data is used to check for hardware-architecture compatibility of Index: src/include/catalog/pg_proc.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/catalog/pg_proc.h,v retrieving revision 1.520 diff -c -r1.520 pg_proc.h *** src/include/catalog/pg_proc.h 14 Oct 2008 17:12:33 -0000 1.520 --- src/include/catalog/pg_proc.h 1 Nov 2008 14:49:38 -0000 *************** *** 3199,3204 **** --- 3199,3226 ---- DATA(insert OID = 2851 ( pg_xlogfile_name PGNSP PGUID 12 1 0 0 f f t f i 1 25 "25" _null_ _null_ _null_ pg_xlogfile_name _null_ _null_ _null_ )); DESCR("xlog filename, given an xlog location"); + DATA(insert OID = 3101 ( pg_recovery_continue PGNSP PGUID 12 1 0 0 f f t f v 0 2278 "" _null_ _null_ _null_ pg_recovery_continue _null_ _null_ _null_ )); + DESCR("if recovery is paused, continue with recovery"); + DATA(insert OID = 3102 ( pg_recovery_pause PGNSP PGUID 12 1 0 0 f f t f v 0 2278 "" _null_ _null_ _null_ pg_recovery_pause _null_ _null_ _null_ )); + DESCR("pause recovery until recovery target reset"); + DATA(insert OID = 3103 ( pg_recovery_pause_cleanup PGNSP PGUID 12 1 0 0 f f t f v 0 2278 "" _null_ _null_ _null_ pg_recovery_pause_cleanup _null_ _null_ _null_ )); + DESCR("continue recovery until cleanup record arrives, then pause recovery"); + DATA(insert OID = 3104 ( pg_recovery_pause_xid PGNSP PGUID 12 1 0 0 f f t f v 1 2278 "23" _null_ _null_ _null_ pg_recovery_pause_xid _null_ _null_ _null_ )); + DESCR("continue recovery until specified xid completes, if ever seen, then pause recovery"); + DATA(insert OID = 3105 ( pg_recovery_pause_time PGNSP PGUID 12 1 0 0 f f t f v 1 2278 "1184" _null_ _null_ _null_ pg_recovery_pause_time _null_ _null_ _null_ )); + DESCR("continue recovery until a transaction with specified timestamp completes, if ever seen, then pause recovery"); + DATA(insert OID = 3106 ( pg_recovery_advance PGNSP PGUID 12 1 0 0 f f t f v 1 2278 "23" _null_ _null_ _null_ pg_recovery_advance _null_ _null_ _null_ )); + DESCR("continue recovery exactly specified number of records, then pause recovery"); + DATA(insert OID = 3107 ( pg_recovery_stop PGNSP PGUID 12 1 0 0 f f t f v 0 2278 "" _null_ _null_ _null_ pg_recovery_stop _null_ _null_ _null_ )); + DESCR("stop recovery immediately"); + + DATA(insert OID = 3110 ( pg_is_in_recovery PGNSP PGUID 12 1 0 0 f f t f v 0 16 "" _null_ _null_ _null_ pg_is_in_recovery _null_ _null_ _null_ )); + DESCR("true if server is in recovery"); + DATA(insert OID = 3111 ( pg_last_completed_xact_timestamp PGNSP PGUID 12 1 0 0 f f t f v 0 1184 "" _null_ _null_ _null_ pg_last_completed_xact_timestamp _null_ _null_ _null_ )); + DESCR("timestamp of last commit or abort record that arrived during recovery, if any"); + DATA(insert OID = 3112 ( pg_last_completed_xid PGNSP PGUID 12 1 0 0 f f t f v 0 28 "" _null_ _null_ _null_ pg_last_completed_xid _null_ _null_ _null_ )); + DESCR("xid of last commit or abort record that arrived during recovery, if any"); + DATA(insert OID = 2621 ( pg_reload_conf PGNSP PGUID 12 1 0 0 f f t f v 0 16 "" _null_ _null_ _null_ pg_reload_conf _null_ _null_ _null_ )); DESCR("reload configuration files"); DATA(insert OID = 2622 ( pg_rotate_logfile PGNSP PGUID 12 1 0 0 f f t f v 0 16 "" _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ )); Index: src/include/postmaster/bgwriter.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/postmaster/bgwriter.h,v retrieving revision 1.12 diff -c -r1.12 bgwriter.h *** src/include/postmaster/bgwriter.h 11 Aug 2008 11:05:11 -0000 1.12 --- src/include/postmaster/bgwriter.h 1 Nov 2008 14:49:38 -0000 *************** *** 12,17 **** --- 12,18 ---- #ifndef _BGWRITER_H #define _BGWRITER_H + #include "catalog/pg_control.h" #include "storage/block.h" #include "storage/relfilenode.h" *************** *** 25,30 **** --- 26,36 ---- extern void BackgroundWriterMain(void); extern void RequestCheckpoint(int flags); + extern void RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter); + extern void RequestRestartPointCompletion(void); + extern XLogRecPtr GetRedoLocationForArchiveCheckpoint(void); + extern bool SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo); + extern void CheckpointWriteDelay(int flags, double progress); extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, Index: src/include/storage/bufmgr.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/bufmgr.h,v retrieving revision 1.116 diff -c -r1.116 bufmgr.h *** src/include/storage/bufmgr.h 31 Oct 2008 15:05:00 -0000 1.116 --- src/include/storage/bufmgr.h 1 Nov 2008 15:03:05 -0000 *************** *** 66,71 **** --- 66,74 ---- #define BUFFER_LOCK_SHARE 1 #define BUFFER_LOCK_EXCLUSIVE 2 + /* Not used by LockBuffer, but is used by XLogReadBuffer... */ + #define BUFFER_LOCK_CLEANUP 3 + /* * These routines are beaten on quite heavily, hence the macroization. */ *************** *** 197,202 **** --- 200,209 ---- extern void LockBufferForCleanup(Buffer buffer); extern bool ConditionalLockBufferForCleanup(Buffer buffer); + extern void StartCleanupDelayStats(void); + extern void EndCleanupDelayStats(void); + extern void ReportCleanupDelayStats(void); + extern void AbortBufferIO(void); extern void BufmgrCommit(void); Index: src/include/storage/pmsignal.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/pmsignal.h,v retrieving revision 1.20 diff -c -r1.20 pmsignal.h *** src/include/storage/pmsignal.h 19 Jun 2008 21:32:56 -0000 1.20 --- src/include/storage/pmsignal.h 1 Nov 2008 14:49:38 -0000 *************** *** 22,27 **** --- 22,28 ---- */ typedef enum { + PMSIGNAL_RECOVERY_START, /* move to PM_RECOVERY state */ PMSIGNAL_PASSWORD_CHANGE, /* pg_auth file has changed */ PMSIGNAL_WAKEN_ARCHIVER, /* send a NOTIFY signal to xlog archiver */ PMSIGNAL_ROTATE_LOGFILE, /* send SIGUSR1 to syslogger to rotate logfile */ Index: src/include/storage/proc.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/proc.h,v retrieving revision 1.106 diff -c -r1.106 proc.h *** src/include/storage/proc.h 15 Apr 2008 20:28:47 -0000 1.106 --- src/include/storage/proc.h 1 Nov 2008 14:49:38 -0000 *************** *** 14,19 **** --- 14,20 ---- #ifndef _PROC_H_ #define _PROC_H_ + #include "access/xlog.h" #include "storage/lock.h" #include "storage/pg_sema.h" *************** *** 93,98 **** --- 94,112 ---- uint8 vacuumFlags; /* vacuum-related flags, see above */ + /* + * Next two fields exist to allow procs to be used during recovery + * for managing snapshot data for standby servers. The lsn allows + * us to disambiguate any incoming information so we always respect + * the latest info. The slotId is a very important concept. It allows + * us to gracefully handle the situation where a backend dies with a + * FATAL error before it can write a WAL record. In that case we use + * the slotId to prove that the old transaction is dead because only + * one TransactionId can ever exist on one slotId at any one time. + */ + XLogRecPtr lsn; /* Last LSN which maintained state of Recovery Proc */ + int slotId; /* slot number in procarray, *never* changes once set */ + /* Info about LWLock the process is currently waiting for, if any. */ bool lwWaiting; /* true if waiting for an LW lock */ bool lwExclusive; /* true if waiting for exclusive access */ *************** *** 157,162 **** --- 171,177 ---- extern Size ProcGlobalShmemSize(void); extern void InitProcGlobal(void); extern void InitProcess(void); + extern PGPROC *InitRecoveryProcess(void); extern void InitProcessPhase2(void); extern void InitAuxiliaryProcess(void); extern bool HaveNFreeProcs(int n); Index: src/include/storage/procarray.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/procarray.h,v retrieving revision 1.23 diff -c -r1.23 procarray.h *** src/include/storage/procarray.h 4 Aug 2008 18:03:46 -0000 1.23 --- src/include/storage/procarray.h 1 Nov 2008 14:49:38 -0000 *************** *** 14,19 **** --- 14,20 ---- #ifndef PROCARRAY_H #define PROCARRAY_H + #include "access/xact.h" #include "storage/lock.h" #include "utils/snapshot.h" *************** *** 23,31 **** extern void ProcArrayAdd(PGPROC *proc); extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid); ! extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid); extern void ProcArrayClearTransaction(PGPROC *proc); extern Snapshot GetSnapshotData(Snapshot snapshot); extern bool TransactionIdIsInProgress(TransactionId xid); --- 24,41 ---- extern void ProcArrayAdd(PGPROC *proc); extern void ProcArrayRemove(PGPROC *proc, TransactionId latestXid); ! extern void ProcArrayStartRecoveryTransaction(PGPROC *proc, TransactionId xid, ! XLogRecPtr lsn, bool isSubXact); ! extern void ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid, ! int nsubxids, TransactionId *subxids); extern void ProcArrayClearTransaction(PGPROC *proc); + extern void ProcArrayClearRecoveryTransactions(void); + extern bool XidInRecoveryProcs(TransactionId xid); + extern void ProcArrayDisplay(int trace_level); + extern void ProcArrayUpdateRecoveryTransactions(XLogRecPtr lsn, + xl_xact_running_xacts *xlrec); + extern RunningTransactions GetRunningTransactionData(void); extern Snapshot GetSnapshotData(Snapshot snapshot); extern bool TransactionIdIsInProgress(TransactionId xid); *************** *** 39,46 **** extern int BackendXidGetPid(TransactionId xid); extern bool IsBackendPid(int pid); ! extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin, ! bool allDbs, int excludeVacuum); extern int CountActiveBackends(void); extern int CountDBBackends(Oid databaseid); extern int CountUserBackends(Oid roleid); --- 49,59 ---- extern int BackendXidGetPid(TransactionId xid); extern bool IsBackendPid(int pid); ! extern VirtualTransactionId *GetCurrentVirtualXIDs(TransactionId limitXmin, ! Oid dbOid, int excludeVacuum); ! extern int VirtualTransactionIdGetPid(VirtualTransactionId vxid); ! extern PGPROC *SlotIdGetRecoveryProc(int slotid); ! extern int CountActiveBackends(void); extern int CountDBBackends(Oid databaseid); extern int CountUserBackends(Oid roleid); *************** *** 51,54 **** --- 64,77 ---- int nxids, const TransactionId *xids, TransactionId latestXid); + /* Primitives for UnobservedXids array handling for standby */ + extern void UnobservedTransactionsAddXids(TransactionId firstXid, + TransactionId lastXid); + extern void UnobservedTransactionsRemoveXid(TransactionId xid, + bool missing_is_error); + extern void UnobservedTransactionsPruneXids(TransactionId limitXid); + extern void UnobservedTransactionsClearXids(void); + extern void UnobservedTransactionsDisplay(int trace_level); + extern bool XidInUnobservedTransactions(TransactionId xid); + #endif /* PROCARRAY_H */ Index: src/include/storage/sinval.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/sinval.h,v retrieving revision 1.48 diff -c -r1.48 sinval.h *** src/include/storage/sinval.h 19 Jun 2008 21:32:56 -0000 1.48 --- src/include/storage/sinval.h 1 Nov 2008 14:49:38 -0000 *************** *** 89,94 **** --- 89,132 ---- void (*invalFunction) (SharedInvalidationMessage *msg), void (*resetFunction) (void)); + extern int xactGetCommittedInvalidationMessages(SharedInvalidationMessage **msgs, + bool *RelcacheInitFileInval); + + /* + * Relation Rmgr (RM_RELATION_ID) + * + * Relation recovery manager exists to allow locks and certain kinds of + * invalidation message to be passed across to a standby server. + */ + + extern void RelationReleaseRecoveryLocks(int slotId); + extern void RelationClearRecoveryLocks(void); + + /* Recovery handlers for the Relation Rmgr (RM_RELATION_ID) */ + extern void relation_redo(XLogRecPtr lsn, XLogRecord *record); + extern void relation_desc(StringInfo buf, uint8 xl_info, char *rec); + + /* + * XLOG message types + */ + #define XLOG_RELATION_INVAL 0x00 + #define XLOG_RELATION_LOCK 0x10 + + typedef struct xl_rel_inval + { + int nmsgs; /* number of shared inval msgs */ + SharedInvalidationMessage msgs[1]; /* VARIABLE LENGTH ARRAY */ + } xl_rel_inval; + + #define MinSizeOfRelationInval offsetof(xl_rel_inval, msgs) + + typedef struct xl_rel_lock + { + int slotId; + Oid dbOid; + Oid relOid; + } xl_rel_lock; + /* signal handler for catchup events (SIGUSR1) */ extern void CatchupInterruptHandler(SIGNAL_ARGS); Index: src/include/storage/sinvaladt.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/sinvaladt.h,v retrieving revision 1.49 diff -c -r1.49 sinvaladt.h *** src/include/storage/sinvaladt.h 1 Jul 2008 02:09:34 -0000 1.49 --- src/include/storage/sinvaladt.h 1 Nov 2008 14:49:38 -0000 *************** *** 29,35 **** */ extern Size SInvalShmemSize(void); extern void CreateSharedInvalidationState(void); ! extern void SharedInvalBackendInit(void); extern bool BackendIdIsActive(int backendID); extern void SIInsertDataEntries(const SharedInvalidationMessage *data, int n); --- 29,35 ---- */ extern Size SInvalShmemSize(void); extern void CreateSharedInvalidationState(void); ! extern void SharedInvalBackendInit(bool sendOnly); extern bool BackendIdIsActive(int backendID); extern void SIInsertDataEntries(const SharedInvalidationMessage *data, int n); Index: src/include/utils/flatfiles.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/utils/flatfiles.h,v retrieving revision 1.6 diff -c -r1.6 flatfiles.h *** src/include/utils/flatfiles.h 15 Oct 2005 02:49:46 -0000 1.6 --- src/include/utils/flatfiles.h 1 Nov 2008 14:49:38 -0000 *************** *** 19,25 **** extern char *database_getflatfilename(void); extern char *auth_getflatfilename(void); ! extern void BuildFlatFiles(bool database_only); extern void AtPrepare_UpdateFlatFiles(void); extern void AtEOXact_UpdateFlatFiles(bool isCommit); --- 19,25 ---- extern char *database_getflatfilename(void); extern char *auth_getflatfilename(void); ! extern void BuildFlatFiles(bool database_only, bool acquire_locks, bool release_locks); extern void AtPrepare_UpdateFlatFiles(void); extern void AtEOXact_UpdateFlatFiles(bool isCommit); *************** *** 27,32 **** --- 27,39 ---- SubTransactionId mySubid, SubTransactionId parentSubid); + /* + * Called by RecordTransactionCommit to allow it to set xinfo flags + * on the commit record. Used for standby invalidation of flat files. + */ + extern bool AtEOXact_Database_FlatFile_Update_Needed(void); + extern bool AtEOXact_Auth_FlatFile_Update_Needed(void); + extern Datum flatfile_update_trigger(PG_FUNCTION_ARGS); extern void flatfile_twophase_postcommit(TransactionId xid, uint16 info, Index: src/include/utils/inval.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/utils/inval.h,v retrieving revision 1.44 diff -c -r1.44 inval.h *** src/include/utils/inval.h 9 Sep 2008 18:58:09 -0000 1.44 --- src/include/utils/inval.h 1 Nov 2008 14:49:38 -0000 *************** *** 15,20 **** --- 15,21 ---- #define INVAL_H #include "access/htup.h" + #include "storage/lock.h" #include "utils/relcache.h" *************** *** 60,63 **** --- 61,67 ---- extern void inval_twophase_postcommit(TransactionId xid, uint16 info, void *recdata, uint32 len); + extern void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist, + char *reason); + #endif /* INVAL_H */ Index: src/include/utils/snapshot.h =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/utils/snapshot.h,v retrieving revision 1.3 diff -c -r1.3 snapshot.h *** src/include/utils/snapshot.h 12 May 2008 20:02:02 -0000 1.3 --- src/include/utils/snapshot.h 1 Nov 2008 14:49:38 -0000 *************** *** 49,55 **** uint32 xcnt; /* # of xact ids in xip[] */ TransactionId *xip; /* array of xact IDs in progress */ /* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */ ! int32 subxcnt; /* # of xact ids in subxip[], -1 if overflow */ TransactionId *subxip; /* array of subxact IDs in progress */ /* --- 49,65 ---- uint32 xcnt; /* # of xact ids in xip[] */ TransactionId *xip; /* array of xact IDs in progress */ /* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */ ! ! /* ! * Prior to 8.4 we represented an overflowed subxid cache with subxcnt -1. ! * In 8.4+ we separate the two concepts because when checking the xids ! * in the snapshot we check *both* subxid cache and subtrans, if subxid ! * cache has overflowed. So we still need the count, even if overflowed. ! * We do this to allow unobserved xids to be placed into the snapshot ! * even when snapshot overflows. It is also a performance gain. ! */ ! uint32 subxcnt; /* # of xact ids in subxip[] */ ! bool suboverflowed; /* true means at least one subxid cache overflowed */ TransactionId *subxip; /* array of subxact IDs in progress */ /* *************** *** 63,68 **** --- 73,148 ---- } SnapshotData; /* + * Declarations for GetRunningTransactionData(). Similar to Snapshots, but + * not quite. This has nothing at all to do with visibility on this server, + * so this is completely separate from snapmgr.c and snapmgr.h + * This data is important for creating the initial snapshot state on a + * standby server. We need lots more information than a normal snapshot, + * hence we use a specific data structure for our needs. This data + * is written to WAL as a separate record immediately after each + * checkpoint. That means that wherever we start a standby from we will + * almost immediately see the data we need to begin executing queries. + */ + typedef struct RunningXact + { + /* Items matching PGPROC entries */ + TransactionId xid; /* xact ID in progress */ + int pid; /* backend's process id, or 0 */ + int slotId; /* backend's slotId */ + Oid databaseId; /* OID of database this backend is using */ + Oid roleId; /* OID of role using this backend */ + uint8 vacuumFlags; /* vacuum-related flags, see above */ + + /* Items matching XidCache */ + bool overflowed; + int nsubxids; /* # of subxact ids for this xact only */ + + /* Additional info */ + uint32 subx_offset; /* array offset of start of subxip, + * zero if nsubxids == 0 + */ + } RunningXact; + + typedef struct RunningXactsData + { + uint32 xcnt; /* # of xact ids in xrun[] */ + uint32 subxcnt; /* total # of xact ids in subxip[] */ + TransactionId latestRunningXid; /* Initial setting of LatestObservedXid */ + TransactionId latestCompletedXid; + + RunningXact *xrun; /* array of RunningXact structs */ + + /* + * subxip is held as a single contiguous array, so no space is wasted, + * plus it helps it fit into one XLogRecord. We continue to keep track + * of which subxids go with each top-level xid by tracking the start + * offset, held on each RunningXact struct. + */ + TransactionId *subxip; /* array of subxact IDs in progress */ + + } RunningXactsData; + + typedef RunningXactsData *RunningTransactions; + + /* + * When we write running xact data to WAL, we use this structure. + */ + typedef struct xl_xact_running_xacts + { + int xcnt; /* # of xact ids in xrun[] */ + int subxcnt; /* # of xact ids in subxip[] */ + TransactionId latestRunningXid; /* Initial setting of LatestObservedXid */ + TransactionId latestCompletedXid; + + /* Array of RunningXact(s) */ + RunningXact xrun[1]; /* VARIABLE LENGTH ARRAY */ + + /* ARRAY OF RUNNING SUBTRANSACTION XIDs FOLLOWS */ + } xl_xact_running_xacts; + + #define MinSizeOfXactRunningXacts offsetof(xl_xact_running_xacts, xrun) + + /* * Result codes for HeapTupleSatisfiesUpdate. This should really be in * tqual.h, but we want to avoid including that file elsewhere. */ Index: src/test/regress/parallel_schedule =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/test/regress/parallel_schedule,v retrieving revision 1.50 diff -c -r1.50 parallel_schedule *** src/test/regress/parallel_schedule 31 Oct 2008 09:17:16 -0000 1.50 --- src/test/regress/parallel_schedule 1 Nov 2008 14:49:38 -0000 *************** *** 67,75 **** ignore: random # ---------- ! # Another group of parallel tests # ---------- ! test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update namespace prepared_xacts delete test: privileges test: misc --- 67,75 ---- ignore: random # ---------- ! # Another group of parallel tests test removed=prepared_xacts # ---------- ! test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update namespace delete test: privileges test: misc Index: src/test/regress/serial_schedule =================================================================== RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/test/regress/serial_schedule,v retrieving revision 1.47 diff -c -r1.47 serial_schedule *** src/test/regress/serial_schedule 31 Oct 2008 09:17:16 -0000 1.47 --- src/test/regress/serial_schedule 1 Nov 2008 14:49:38 -0000 *************** *** 84,90 **** test: update test: delete test: namespace ! test: prepared_xacts test: privileges test: misc test: select_views --- 84,90 ---- test: update test: delete test: namespace ! #test: prepared_xacts test: privileges test: misc test: select_views