Thread: In-placre persistance change of a relation
Hello. This is a thread for an alternative solution to wal_level=none [*1] for bulk data loading. *1: https://www.postgresql.org/message-id/TYAPR01MB29901EBE5A3ACCE55BA99186FE320%40TYAPR01MB2990.jpnprd01.prod.outlook.com At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in > Greetings, > > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some > > trials of several ways, I drifted to the following way after poking > > several ways. > > > > 1. Flip BM_PERMANENT of active buffers > > 2. adding/removing init fork > > 3. sync files, > > 4. Flip pg_class.relpersistence. > > > > It always skips table copy in the SET UNLOGGED case, and only when > > wal_level=minimal in the SET LOGGED case. Crash recovery seems > > working by some brief testing by hand. > > Somehow missed that this patch more-or-less does what I was referring to > down-thread, but I did want to mention that it looks like it's missing a > necessary FlushRelationBuffers() call before the sync, otherwise there > could be dirty buffers for the relation that's being set to LOGGED (with > wal_level=minimal), which wouldn't be good. See the comments above > smgrimmedsync(). Right. Thanks. However, since SetRelFileNodeBuffersPersistence() called just above scans shared buffers so I don't want to just call FlushRelationBuffers() separately. Instead, I added buffer-flush to SetRelFileNodeBuffersPersistence(). FWIW this is a revised version of the PoC, which has some known problems. - Flipping of Buffer persistence is not WAL-logged nor even be able to be safely roll-backed. (It might be better to drop buffers). - This version handles indexes but not yet handle toast relatins. - tableAMs are supposed to support this feature. (but I'm not sure it's worth allowing them not to do so). > > Of course, I haven't performed intensive test on it. > > Reading through the thread, it didn't seem very clear, but we should > definitely make sure that it does the right thing on replicas when going > between unlogged and logged (and between logged and unlogged too), of > course. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index dcaea7135f..0c6ce70484 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -613,6 +613,27 @@ heapam_relation_set_new_filenode(Relation rel, smgrclose(srel); } +static void +heapam_relation_set_persistence(Relation rel, char persistence) +{ + Assert(rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT || + rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED); + + Assert (rel->rd_rel->relpersistence != persistence); + + if (persistence == RELPERSISTENCE_UNLOGGED) + { + Assert(rel->rd_rel->relkind == RELKIND_RELATION || + rel->rd_rel->relkind == RELKIND_MATVIEW || + rel->rd_rel->relkind == RELKIND_TOASTVALUE); + + RelationCreateInitFork(rel->rd_node, false); + } + else + RelationDropInitFork(rel->rd_node); +} + + static void heapam_relation_nontransactional_truncate(Relation rel) { @@ -2540,6 +2561,7 @@ static const TableAmRoutine heapam_methods = { .compute_xid_horizon_for_tuples = heap_compute_xid_horizon_for_tuples, .relation_set_new_filenode = heapam_relation_set_new_filenode, + .relation_set_persistence = heapam_relation_set_persistence, .relation_nontransactional_truncate = heapam_relation_nontransactional_truncate, .relation_copy_data = heapam_relation_copy_data, .relation_copy_for_cluster = heapam_relation_copy_for_cluster, diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index a7c0cb1bc3..8397002613 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,14 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } } const char * @@ -55,6 +63,9 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; } return id; diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index d538f25726..ac5aea3d38 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -60,6 +60,8 @@ int wal_skip_threshold = 2048; /* in kilobytes */ typedef struct PendingRelDelete { RelFileNode relnode; /* relation that may need to be deleted */ + bool deleteinitfork; /* delete only init fork if true */ + bool createinitfork; /* create init fork if true */ BackendId backend; /* InvalidBackendId if not a temp rel */ bool atCommit; /* T=delete at commit; F=delete at abort */ int nestLevel; /* xact nesting level of request */ @@ -153,6 +155,8 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rnode; + pending->deleteinitfork = false; + pending->createinitfork = false; pending->backend = backend; pending->atCommit = false; /* delete if abort */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -168,6 +172,95 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for a relation. + * + * Create the underlying disk file storage for the relation. This only + * creates the main fork; additional forks are created lazily by the + * modules that need them. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the storage will be destroyed. + */ +void +RelationCreateInitFork(RelFileNode rnode, bool isRedo) +{ + PendingRelDelete *pending; + SMgrRelation srel; + PendingRelDelete *prev; + PendingRelDelete *next; + + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->deleteinitfork && pending->atCommit) + { + /* unlink and delete list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + return; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + srel = smgropen(rnode, InvalidBackendId); + smgrcreate(srel, INIT_FORKNUM, isRedo); + if (!isRedo) + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + + /* Add the relation to the list of stuff to delete at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->deleteinitfork = true; + pending->createinitfork = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; /* delete if abort */ + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + +void +RelationDropInitFork(RelFileNode rnode) +{ + PendingRelDelete *pending; + PendingRelDelete *next; + + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->deleteinitfork && pending->atCommit) + { + /* We're done. */ + return; + } + } + + /* Add the relation to the list of stuff to delete at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->deleteinitfork = true; + pending->createinitfork = false; + pending->backend = InvalidBackendId; + pending->atCommit = true; /* create if abort */ + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +280,25 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -200,6 +312,8 @@ RelationDropStorage(Relation rel) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rel->rd_node; + pending->createinitfork = false; + pending->deleteinitfork = false; pending->backend = rel->rd_backend; pending->atCommit = true; /* delete if commit */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -626,19 +740,27 @@ smgrDoPendingDeletes(bool isCommit) srel = smgropen(pending->relnode, pending->backend); - /* allocate the initial array, or extend it, if needed */ - if (maxrels == 0) + if (pending->deleteinitfork) { - maxrels = 8; - srels = palloc(sizeof(SMgrRelation) * maxrels); + log_smgrunlink(&pending->relnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); } - else if (maxrels <= nrels) + else { - maxrels *= 2; - srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); - } + /* allocate the initial array, or extend it, if needed */ + if (maxrels == 0) + { + maxrels = 8; + srels = palloc(sizeof(SMgrRelation) * maxrels); + } + else if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } - srels[nrels++] = srel; + srels[nrels++] = srel; + } } /* must explicitly free the list entry */ pfree(pending); @@ -917,6 +1039,14 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index e3cfaf8b07..e358174b01 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -4918,6 +4918,137 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +static bool +try_inplace_persistence_change(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + /* + * When wal_level is replica or higher we need that the initial state of + * the relation be recoverable from WAL. When wal_level >= replica + * switching to PERMANENT needs to emit the WAL records to reconstruct the + * current data. This could be done by writing XLOG_FPI for all pages but + * it is not obvious that that is performant than normal rewriting. + * Otherwise what we need for the relation data is just establishing + * initial state on storage and no need of WAL to reconstruct it. + */ + if (tab->newrelpersistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + return false; + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform im-place persistnce change"); + + RelationOpenSmgr(rel); + + /* Change persistence then flush-out buffers of the relation */ + + /* Get the list of index OIDs for this relation */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + table_close(rel, lockmode); + + /* Done change on storage. Update catalog including indexes. */ + /* add the heap oid to the relation ID list */ + + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + RelationOpenSmgr(r); + + if (persistence == RELPERSISTENCE_UNLOGGED) + { + RelationCreateInitFork(r->rd_node, false); + + if (r->rd_rel->relkind == RELKIND_INDEX || + r->rd_rel->relkind == RELKIND_PARTITIONED_INDEX) + r->rd_indam->ambuildempty(r); + else + { + Assert(r->rd_rel->relkind == RELKIND_RELATION || + r->rd_rel->relkind == RELKIND_MATVIEW || + r->rd_rel->relkind == RELKIND_TOASTVALUE); + } + } + else + RelationDropInitFork(r->rd_node); + + table_close(r, NoLock); + + /* + * This relation is now WAL-logged. Sync all files immediately to + * establish the initial state on storgae. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < MAX_FORKNUM ; i++) + { + if (smgrexists(r->rd_smgr, i)) + smgrimmedsync(r->rd_smgr, i); + } + } + + + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + } + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + RelationOpenSmgr(r); + SetRelationBuffersPersistence(r, persistence == RELPERSISTENCE_PERMANENT); + table_close(r, NoLock); + } + table_close(classRel, RowExclusiveLock); + + return true; +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5038,45 +5169,51 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, - lockmode); + if (tab->rewrite != AT_REWRITE_ALTER_PERSISTENCE || + !try_inplace_persistence_change(tab, persistence, lockmode)) + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, + lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); + } } else { diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index ad0d1a9abc..c71e1a5f92 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3033,6 +3033,80 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(Relation rel, bool permanent) +{ + int i; + RelFileNodeBackend rnode = rel->rd_smgr->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + ereport(LOG, (errmsg ("#%d: %d", i, (buf_state & BM_PERMANENT) == 0), errhidestmt(true))); + if (permanent) + { + Assert ((buf_state & BM_PERMANENT) == 0); + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when swithing to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == + (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, rel->rd_smgr); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + Assert ((buf_state & BM_PERMANENT) != 0); + buf_state &= ~BM_PERMANENT; + UnlockBufHdr(bufHdr, buf_state); + } + ereport(LOG, (errmsg ("#%d: -> %d", i, (buf_state & BM_PERMANENT) == 0), errhidestmt(true))); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index dcc09df0c7..5eb9e97b3d 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -645,6 +645,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 387eb34a61..1d19278a18 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -451,6 +451,15 @@ typedef struct TableAmRoutine TransactionId *freezeXid, MultiXactId *minmulti); + /* + * This callback needs to switch persistence of the relation between + * RELPERSISTENCE_PERMANENT and RELPERSISTENCE_UNLOGGED. Actual change on + * storage is performed elsewhere. + * + * See also table_relation_set_persistence(). + */ + void (*relation_set_persistence) (Relation rel, char persistence); + /* * This callback needs to remove all contents from `rel`'s current * relfilenode. No provisions for transactional behaviour need to be made. @@ -1404,6 +1413,18 @@ table_relation_set_new_filenode(Relation rel, freezeXid, minmulti); } +/* + * Switch storage persistence between RELPERSISTENCE_PERMANENT and + * RELPERSISTENCE_UNLOGGED. + * + * This is used during in-place persistence switching + */ +static inline void +table_relation_set_persistence(Relation rel, char persistence) +{ + rel->rd_tableam->relation_set_persistence(rel, persistence); +} + /* * Remove all table contents from `rel`, in a non-transactional manner. * Non-transactional meaning that there's no need to support rollbacks. This diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 30c38e0ca6..43d2eb0fb4 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(RelFileNode rel, bool isRedo); +extern void RelationDropInitFork(RelFileNode rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index 7b21cab2e0..73ad2ae89e 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -29,6 +29,7 @@ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 typedef struct xl_smgr_create { @@ -36,6 +37,12 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +58,7 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index ee91b8fa26..f65a273999 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -205,6 +205,7 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(Relation rnode, bool permanent); extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index f28a842401..5d74631006 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
Hi, I suggest outlining what you are trying to achieve here. Starting a new thread and expecting people to dig through another thread to infer what you are actually trying to achive isn't great. FWIW, I'm *extremely* doubtful it's worth adding features that depend on a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as I understand. If somebody added support for dynamically adapting wal_level (e.g. wal_level=auto, that increases wal_level to replica/logical depending on the presence of replication slots), it'd perhaps be different. On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote: > FWIW this is a revised version of the PoC, which has some known > problems. > > - Flipping of Buffer persistence is not WAL-logged nor even be able to > be safely roll-backed. (It might be better to drop buffers). That's obviously a no-go. I think you might be able to address this if you accept that the command cannot be run in a transaction (like CONCURRENTLY). Then you can first do the catalog changes, change the persistence level, and commit. Greetings, Andres Freund
At Wed, 11 Nov 2020 14:18:04 -0800, Andres Freund <andres@anarazel.de> wrote in > Hi, > > I suggest outlining what you are trying to achieve here. Starting a new > thread and expecting people to dig through another thread to infer what > you are actually trying to achive isn't great. Agreed. I'll post that. Thanks. > FWIW, I'm *extremely* doubtful it's worth adding features that depend on > a PGC_POSTMASTER wal_level=minimal being used. Which this does, a far as > I understand. If somebody added support for dynamically adapting > wal_level (e.g. wal_level=auto, that increases wal_level to > replica/logical depending on the presence of replication slots), it'd > perhaps be different. Yes, this depends on wal_level=minimal for switching from UNLOGGED to LOGGED, that's similar to COPY/INSERT-to-intransaction-created-tables optimization for wal_level=minimal. And it expands that optimization to COPY/INSERT-to-existent-tables, which seems worth doing. Switching to LOGGED needs to emit the initial state to WAL... Hmm.. I came to think that even in that case skipping table copy reduces I/O significantly, even though FPI-WAL is emitted. > On 2020-11-11 17:33:17 +0900, Kyotaro Horiguchi wrote: > > FWIW this is a revised version of the PoC, which has some known > > problems. > > > > - Flipping of Buffer persistence is not WAL-logged nor even be able to > > be safely roll-backed. (It might be better to drop buffers). > > That's obviously a no-go. I think you might be able to address this if > you accept that the command cannot be run in a transaction (like > CONCURRENTLY). Then you can first do the catalog changes, change the > persistence level, and commit. Of course. The next version reverts persistence change at abort. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. Before posting the next version, I'd like to explain what this patch is. 1. The Issue Bulk data loading is a long-time taking, I/O consuming task. Many DBAs want that task is faster, even at the cost of increasing risk of data-loss. wal_level=minimal is an answer to such a request. Data-loading onto a table that is created in the current transaction omits WAL-logging and synced at commit. However, the optimization doesn't benefit the case where the data-loading is performed onto existing tables. There are quite a few cases where data is loaded into tables that already contains a lot of data. Those cases don't take benefit of the optimization. Another possible solution for bulk data-loading is UNLOGGED tables. But when we switch LOGGED/UNLOGGED of a table, all the table content is copied to a newly created heap, which is costly. 2. Proposed Solutions. There are two proposed solutions are discussed on this mailing list. One is wal_level = none (*1), which omits WAL-logging almost at all. Another is extending the existing optimization to the ALTER TABLE SET LOGGED/UNLOGGED cases, which is to be discussed in this new thread. 3. In-place Persistence Change So the attached is a PoC patch of the "another" solution. When we want to change table persistence in-place, basically we need to do the following steps. (the talbe is exclusively locked) (1) Flip BM_PERMANENT flag of all shared buffer blocks for the heap. (2) Create or delete the init fork for existing heap. (3) Flush all buffers of the relation to file system. (4) Sync heap files. (5) Make catalog changes. 4. Transactionality The 1, 2 and 5 above need to be abort-able. 5 is rolled back by existing infrastructure, and rolling-back of 1 and 2 are achieved by piggybacking on the pendingDeletes mechanism. 5. Replication Furthermore, that changes ought to be replicable to standbys. Catalog changes are replicated as usual. On-the-fly creation of the init fork leads to recovery mess. Even though it is removed at abort, if the server crashed before transaction end, the file is left alone and corrupts database in the next recovery. I sought a way to create the init fork in smgrPendingDelete but that needs relcache and relcache is not available at that late of commit. Finally, I introduced the fifth fork kind "INITTMP"(_itmp) only to signal that the init file is not committed. I don't like that way but it seems working fine... 6. SQL Command The second file in the patchset adds a syntax that changes persistence of all tables in a tablespace. ALTER TABLE ALL IN TABLESPACE <tsp> SET LOGGED/UNLOGGED [ NOWAIT ]; 7. Testing I tried to write TAP test for this, but IPC::Run::harness (or interactive_psql) doesn't seem to work for me. I'm not sure what exactly is happening but pty redirection doesn't work. $in = "ls\n"; $out = ""; run ["/usr/bin/bash"], \$in, \$out; print $out; works but $in = "ls\n"; $out = ""; run ["/usr/bin/bash"], '<pty<', \$in, '>pty>', \$out; print $out; doesn't respond. The patch is attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 05d1971d0f4f0f42899f5d6857892128487eeb40 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v3 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. --- src/backend/access/rmgrdesc/smgrdesc.c | 23 ++ src/backend/catalog/storage.c | 355 +++++++++++++++++++++++-- src/backend/commands/tablecmds.c | 217 ++++++++++++--- src/backend/storage/buffer/bufmgr.c | 88 ++++++ src/backend/storage/file/reinit.c | 206 ++++++++------ src/backend/storage/smgr/smgr.c | 6 + src/common/relpath.c | 3 +- src/include/catalog/storage.h | 2 + src/include/catalog/storage_xlog.h | 16 ++ src/include/common/relpath.h | 5 +- src/include/storage/bufmgr.h | 4 + src/include/storage/smgr.h | 1 + 12 files changed, 784 insertions(+), 142 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index a7c0cb1bc3..097dacfee6 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +72,12 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index d538f25726..0f1649758f 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -57,9 +58,19 @@ int wal_skip_threshold = 2048; /* in kilobytes */ * but I'm being paranoid. */ + +/* This is bit-map, not ordianal numbers */ +#define PDOP_DELETE 0x00 +#define PDOP_UNLINK_FORK 0x01 +#define PDOP_SET_PERSISTENCE 0x02 + + typedef struct PendingRelDelete { RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ BackendId backend; /* InvalidBackendId if not a temp rel */ bool atCommit; /* T=delete at commit; F=delete at abort */ int nestLevel; /* xact nesting level of request */ @@ -153,6 +164,7 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rnode; + pending->op = PDOP_DELETE; pending->backend = backend; pending->atCommit = false; /* delete if abort */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -168,6 +180,209 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + SMgrRelation srel; + PendingRelDelete *prev; + PendingRelDelete *next; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, false, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have already registered pending sync entries to drop preexisting + * init fork since before the current transaction started. This function + * reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->op != PDOP_DELETE) + { + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + create = false; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (!create) + return; + + /* We don't have existing init fork, create it. */ + srel = smgropen(rnode, InvalidBackendId); + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by myself. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* + * We have created the init fork. If server crashes before the current + * transaction ends the init fork left alone corrupts data while recovery. + * The inittmp fork works as the sentinel to identify that situaton. + */ + smgrcreate(srel, INITTMP_FORKNUM, false); + log_smgrcreate(&rnode, INITTMP_FORKNUM); + smgrimmedsync(srel, INITTMP_FORKNUM); + + /* drop this init fork file at abort and revert persistence */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK | PDOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop inittmp fork at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INITTMP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop inittmp fork at commit*/ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INITTMP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + PendingRelDelete *prev; + PendingRelDelete *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, true, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have created the init fork in the current transaction. We + * immediately remove the init and inittmp forks immediately in that case. + * Otherwise just reister pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->op != PDOP_DELETE) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + inxact_created = true; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + smgrclose(srel); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + log_smgrunlink(&rnode, INITTMP_FORKNUM); + smgrunlink(srel, INITTMP_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +402,44 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -200,6 +453,7 @@ RelationDropStorage(Relation rel) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rel->rd_node; + pending->op = PDOP_DELETE; pending->backend = rel->rd_backend; pending->atCommit = true; /* delete if commit */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -606,43 +860,68 @@ smgrDoPendingDeletes(bool isCommit) prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) { + SMgrRelation srel; + next = pending->next; if (pending->nestLevel < nestLevel) { /* outer-level entries should not be processed yet */ prev = pending; + continue; } + + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; else + pendingDeletes = next; + + if (pending->atCommit != isCommit) { - /* unlink list entry first, so we don't retry on failure */ - if (prev) - prev->next = next; - else - pendingDeletes = next; - /* do deletion if called for */ - if (pending->atCommit == isCommit) - { - SMgrRelation srel; - - srel = smgropen(pending->relnode, pending->backend); - - /* allocate the initial array, or extend it, if needed */ - if (maxrels == 0) - { - maxrels = 8; - srels = palloc(sizeof(SMgrRelation) * maxrels); - } - else if (maxrels <= nrels) - { - maxrels *= 2; - srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); - } - - srels[nrels++] = srel; - } /* must explicitly free the list entry */ pfree(pending); /* prev does not change */ + continue; + } + + srel = smgropen(pending->relnode, pending->backend); + + if (pending->op != PDOP_DELETE) + { + if (pending->op & PDOP_UNLINK_FORK) + { + BlockNumber block = 0; + RelFileNodeBackend rbnode; + + rbnode.node = pending->relnode; + rbnode.backend = InvalidBackendId; + + DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1, + &block); + smgrclose(srel); + log_smgrunlink(&pending->relnode, pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + } + + if (pending->op & PDOP_SET_PERSISTENCE) + SetRelationBuffersPersistence(srel, pending->bufpersistence, + false); + } + else + { + /* allocate the initial array, or extend it, if needed */ + if (maxrels == 0) + { + maxrels = 8; + srels = palloc(sizeof(SMgrRelation) * maxrels); + } + else if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + srels[nrels++] = srel; } } @@ -824,7 +1103,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) nrels++; } if (nrels == 0) @@ -837,7 +1117,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) { *rptr = pending->relnode; rptr++; @@ -917,6 +1198,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrclose(reln); + smgrunlink(reln, xlrec->forkNum, true); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1005,6 +1295,15 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index e3cfaf8b07..29f786142a 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -4916,6 +4916,142 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, tab->afterStmts = list_concat(tab->afterStmts, afterStmts); return newcmd; +} + +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform im-place persistnce change"); + + RelationOpenSmgr(rel); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + RelationOpenSmgr(toastrel); + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, lockmode); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + RelationOpenSmgr(r); + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * alredy flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(r->rd_smgr, i)) + smgrimmedsync(r->rd_smgr, i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recovery the table. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(r->rd_smgr, fork)) + log_newpage_range(r, fork, + 0, smgrnblocks(r->rd_smgr, fork), false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); + + + + } /* @@ -5038,45 +5174,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, - lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, + lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); + } } else { diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index ad0d1a9abc..ddd0133cdf 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlog.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3033,6 +3034,93 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when swithing to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 0c2094f766..6524262a74 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -31,6 +31,7 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, typedef struct { char oid[OIDCHARS + 1]; + bool dirty; } unlogged_relation_entry; /* @@ -151,6 +152,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); @@ -160,62 +163,73 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create a ton of unlogged relations + * in the same database & tablespace, so we'd better use a hash table + * rather than an array or linked list to keep track of which files + * need to be reset. Otherwise, this cleanup operation would be + * O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(unlogged_relation_entry); + ctl.entrysize = sizeof(unlogged_relation_entry); + hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM); + + /* Scan the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + ForkNumber forkNum; + int oidchars; + bool found; + unlogged_relation_entry key; + unlogged_relation_entry *ent; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum)) + continue; + + /* Also skip it unless this is the init fork. */ + if (forkNum != INIT_FORKNUM && forkNum != INITTMP_FORKNUM) + continue; + + /* + * Put the OID portion of the name into the hash table, if it + * isn't already. + */ + memset(key.oid, 0, sizeof(key.oid)); + memcpy(key.oid, de->d_name, oidchars); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + ent->dirty = 0; + + /* + * If we have the inittmp fork, the transaction that created the + * corresponding init file was not committed nor aborted. Mark this + * init fork as dirty so that we can clean up them properly. + */ + if (forkNum == INITTMP_FORKNUM) + ent->dirty = true; + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* + * If we didn't find any init forks, there's no point in continuing; + * we can bail out now. + */ + if (hash_get_num_entries(hash) == 0) + { + hash_destroy(hash); + return; + } + if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - memset(&ctl, 0, sizeof(ctl)); - ctl.keysize = sizeof(unlogged_relation_entry); - ctl.entrysize = sizeof(unlogged_relation_entry); - hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM); - - /* Scan the directory. */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - memset(ent.oid, 0, sizeof(ent.oid)); - memcpy(ent.oid, de->d_name, oidchars); - hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - /* * Now, make a second pass and remove anything that matches. */ @@ -224,39 +238,48 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) { ForkNumber forkNum; int oidchars; - bool found; - unlogged_relation_entry ent; + unlogged_relation_entry key; + unlogged_relation_entry *ent; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, &forkNum)) continue; - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) - continue; - /* * See whether the OID portion of the name shows up in the hash * table. */ - memset(ent.oid, 0, sizeof(ent.oid)); - memcpy(ent.oid, de->d_name, oidchars); - hash_search(hash, &ent, HASH_FIND, &found); + memset(key.oid, 0, sizeof(key.oid)); + memcpy(key.oid, de->d_name, oidchars); + ent = hash_search(hash, &key, HASH_FIND, NULL); - /* If so, nuke it! */ - if (found) + /* Don't remove files if corresponding init fork is not found */ + if (!ent) + continue; + + if (!ent->dirty) + { + /* Don't remove clean init file */ + if (forkNum == INIT_FORKNUM) + continue; + }else { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); - else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + /* Remove dirty init file, together with inittmp file */ + if (forkNum != INIT_FORKNUM && forkNum != INITTMP_FORKNUM) + continue; } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + else + elog(DEBUG2, "unlinked file \"%s\"", rm_path); } /* Cleanup is complete. */ @@ -273,6 +296,9 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ if ((op & UNLOGGED_RELATION_INIT) != 0) { + unlogged_relation_entry key; + unlogged_relation_entry *ent; + /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) @@ -288,6 +314,38 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) &forkNum)) continue; + /* + * See whether the OID portion of the name shows up in the hash + * table. + */ + memset(key.oid, 0, sizeof(key.oid)); + memcpy(key.oid, de->d_name, oidchars); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + /* Don't init file that doesn't have the init fork. */ + if (!ent) + continue; + + if (ent->dirty && + (forkNum == INIT_FORKNUM || forkNum == INITTMP_FORKNUM)) + { + /* + * The init file is dirty. The files has been removed once at + * cleanup time but recovery can create them again. Remove both + * INIT and INITTMP files. + */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + else + elog(DEBUG2, "unlinked file \"%s\"", rm_path); + continue; + } + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index dcc09df0c7..5eb9e97b3d 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -645,6 +645,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/common/relpath.c b/src/common/relpath.c index ad733d1363..2a5e5fa990 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -34,7 +34,8 @@ const char *const forkNames[] = { "main", /* MAIN_FORKNUM */ "fsm", /* FSM_FORKNUM */ "vm", /* VISIBILITYMAP_FORKNUM */ - "init" /* INIT_FORKNUM */ + "init", /* INIT_FORKNUM */ + "itmp" /* INITTMP_FORKNUM */ }; StaticAssertDecl(lengthof(forkNames) == (MAX_FORKNUM + 1), diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 30c38e0ca6..c2259cd7e3 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index 7b21cab2e0..d48b5288ce 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -29,6 +29,8 @@ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_BUFPERSISTENCE 0x40 typedef struct xl_smgr_create { @@ -36,6 +38,18 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +65,8 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index 869cabcc0d..f6e1a74a38 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -43,7 +43,8 @@ typedef enum ForkNumber MAIN_FORKNUM = 0, FSM_FORKNUM, VISIBILITYMAP_FORKNUM, - INIT_FORKNUM + INIT_FORKNUM, + INITTMP_FORKNUM /* * NOTE: if you add a new fork, change MAX_FORKNUM and possibly @@ -52,7 +53,7 @@ typedef enum ForkNumber */ } ForkNumber; -#define MAX_FORKNUM INIT_FORKNUM +#define MAX_FORKNUM INITTMP_FORKNUM #define FORKNAMECHARS 4 /* max chars for a fork name */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index ee91b8fa26..e2496ed1c8 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount; */ #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer)) +struct SmgrRelationData; + /* * prototypes for functions in bufmgr.c */ @@ -205,6 +207,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index f28a842401..5d74631006 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, -- 2.18.4 From 5ce0551b9685dcd742bdcdf610ac80424327a9b5 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v3 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 ++++ src/backend/nodes/equalfuncs.c | 15 ++++ src/backend/parser/gram.y | 20 +++++ src/backend/tcop/utility.c | 11 +++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 ++ 8 files changed, 214 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 29f786142a..ec2a45357b 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13665,6 +13665,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(NIL); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 3031c52991..7bb8fc767b 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4120,6 +4120,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5419,6 +5432,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index 9aa853748d..55ab3d7039 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1856,6 +1856,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3474,6 +3486,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 051f1f1d49..08da69e32f 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1893,6 +1893,26 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index f398027fa6..8066e7a607 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -160,6 +160,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1733,6 +1734,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2616,6 +2623,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index c1581ad178..206de61154 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index 7ddd8c011b..74bf050b67 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -424,6 +424,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 7ef9b0eac0..f5b4976ae1 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2235,6 +2235,15 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- -- 2.18.4
Hi Horiguchi-san, Thank you for making a patch so quickly. I've started looking at it. What makes you think this is a PoC? Documentation and test cases? If there's something you think that doesn't work or areconcerned about, can you share it? Do you know the reason why data copy was done before? And, it may be odd for me to ask this, but I think I saw someone referredto the past discussion that eliminating data copy is difficult due to some processing at commit. I can't find it. (1) @@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount; */ #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer)) +struct SmgrRelationData; This declaration is already in the file: /* forward declared, to avoid having to expose buf_internals.h here */ struct WritebackContext; /* forward declared, to avoid including smgr.h here */ struct SMgrRelationData; Regards Takayuki Tsunakawa
Hello, Tsunakawa-San > Do you know the reason why data copy was done before? And, it may be > odd for me to ask this, but I think I saw someone referred to the past > discussion that eliminating data copy is difficult due to some processing at > commit. I can't find it. I can share 2 sources why to eliminate the data copy is difficult in hackers thread. Tom's remark and the context to copy relation's data. https://www.postgresql.org/message-id/flat/31724.1394163360%40sss.pgh.pa.us#31724.1394163360@sss.pgh.pa.us Amit-San quoted this thread and mentioned that point in another thread. https://www.postgresql.org/message-id/CAA4eK1%2BHDqS%2B1fhs5Jf9o4ZujQT%3DXBZ6sU0kOuEh2hqQAC%2Bt%3Dw%40mail.gmail.com Best, Takamichi Osumi
At Fri, 13 Nov 2020 06:43:13 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in > Hi Horiguchi-san, > > > Thank you for making a patch so quickly. I've started looking at it. > > What makes you think this is a PoC? Documentation and test cases? If there's something you think that doesn't work orare concerned about, can you share it? The latest version is heavily revised and is given much comment so it might have exited from PoC state. The necessity of documentation is doubtful since this patch doesn't user-facing behavior other than speed. Some tests are required especialy about recovery and replication perspective but I haven't been able to make it. (One of the tests needs to cause crash while a transaction is running.) > Do you know the reason why data copy was done before? And, it may be odd for me to ask this, but I think I saw someonereferred to the past discussion that eliminating data copy is difficult due to some processing at commit. I can'tfind it. To imagine that, just because it is simpler considering rollback and code sharing, and maybe no one have been complained that SET LOGGED/UNLOGGED looks taking a long time than required/expected. The current implement is simple. It's enough to just discard old or new relfilenode according to the current transaction ends with commit or abort. Tweaking of relfilenode under use leads-in some skews in some places. I used pendingDelete mechanism a bit complexified way and a violated an abstraction (I think, calling AM-routines from storage.c is not good.) and even introduce a new fork kind only to mark a init fork as "not committed yet". There might be better way, but I haven't find it. (The patch scans all shared buffer blocks for each relation). > (1) > @@ -168,6 +168,8 @@ extern PGDLLIMPORT int32 *LocalRefCount; > */ > #define BufferGetPage(buffer) ((Page)BufferGetBlock(buffer)) > > +struct SmgrRelationData; > > This declaration is already in the file: > > /* forward declared, to avoid having to expose buf_internals.h here */ > struct WritebackContext; > > /* forward declared, to avoid including smgr.h here */ > struct SMgrRelationData; Hmmm. Nice chatch. And will fix in the next version. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 13 Nov 2020 07:15:41 +0000, "osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com> wrote in > Hello, Tsunakawa-San > Thanks for sharing it! > > Do you know the reason why data copy was done before? And, it may be > > odd for me to ask this, but I think I saw someone referred to the past > > discussion that eliminating data copy is difficult due to some processing at > > commit. I can't find it. > I can share 2 sources why to eliminate the data copy is difficult in hackers thread. > > Tom's remark and the context to copy relation's data. > https://www.postgresql.org/message-id/flat/31724.1394163360%40sss.pgh.pa.us#31724.1394163360@sss.pgh.pa.us https://www.postgresql.org/message-id/CA+Tgmob44LNwwU73N1aJsGQyzQ61SdhKJRC_89wCm0+aLg=x2Q@mail.gmail.com > No, not really. The issue is more around what happens if we crash > part way through. At crash recovery time, the system catalogs are not > available, because the database isn't consistent yet and, anyway, the > startup process can't be bound to a database, let alone every database > that might contain unlogged tables. So the sentinel that's used to > decide whether to flush the contents of a table or index is the > presence or absence of an _init fork, which the startup process > obviously can see just fine. The _init fork also tells us what to > stick in the relation when we reset it; for a table, we can just reset > to an empty file, but that's not legal for indexes, so the _init fork > contains a pre-initialized empty index that we can just copy over. > > Now, to make an unlogged table logged, you've got to at some stage > remove those _init forks. But this is not a transactional operation. > If you remove the _init forks and then the transaction rolls back, > you've left the system an inconsistent state. If you postpone the > removal until commit time, then you have a problem if it fails, It's true. That are the cause of headache. > particularly if it works for the first file but fails for the second. > And if you crash at any point before you've fsync'd the containing > directory, you have no idea which files will still be on disk after a > hard reboot. This is not an issue in this patch *except* the case where init fork is failed to removed but the following removal of inittmp fork succeeds. Another idea is adding a "not-yet-committed" property to a fork. I added a new fork type for easiness of the patch but I could go that way if that is an issue. > Amit-San quoted this thread and mentioned that point in another thread. > https://www.postgresql.org/message-id/CAA4eK1%2BHDqS%2B1fhs5Jf9o4ZujQT%3DXBZ6sU0kOuEh2hqQAC%2Bt%3Dw%40mail.gmail.com This sounds like a bit differrent discussion. Making part-of-a-table UNLOGGED looks far difficult to me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> > > No, not really. The issue is more around what happens if we crash > > part way through. At crash recovery time, the system catalogs are not > > available, because the database isn't consistent yet and, anyway, the > > startup process can't be bound to a database, let alone every database > > that might contain unlogged tables. So the sentinel that's used to > > decide whether to flush the contents of a table or index is the > > presence or absence of an _init fork, which the startup process > > obviously can see just fine. The _init fork also tells us what to > > stick in the relation when we reset it; for a table, we can just reset > > to an empty file, but that's not legal for indexes, so the _init fork > > contains a pre-initialized empty index that we can just copy over. > > > > Now, to make an unlogged table logged, you've got to at some stage > > remove those _init forks. But this is not a transactional operation. > > If you remove the _init forks and then the transaction rolls back, > > you've left the system an inconsistent state. If you postpone the > > removal until commit time, then you have a problem if it fails, > > It's true. That are the cause of headache. ... > The current implement is simple. It's enough to just discard old or > new relfilenode according to the current transaction ends with commit > or abort. Tweaking of relfilenode under use leads-in some skews in > some places. I used pendingDelete mechanism a bit complexified way > and a violated an abstraction (I think, calling AM-routines from > storage.c is not good.) and even introduce a new fork kind only to > mark a init fork as "not committed yet". There might be better way, > but I haven't find it. I have no alternative idea yet, too. I agree that we want to avoid them, especially introducing inittmp fork... Anyway,below are the rest of my review comments for 0001. I want to review 0002 when we have decided to go with 0001. (2) XLOG_SMGR_UNLINK seems to necessitate modification of the following comments: [src/include/catalog/storage_xlog.h] /* * Declarations for smgr-related XLOG records * * Note: we log file creation and truncation here, but logging of deletion * actions is handled by xact.c, because it is part of transaction commit. */ [src/backend/access/transam/README] 3. Deleting a table, which requires an unlink() that could fail. Our approach here is to WAL-log the operation first, but to treat failure of the actual unlink() call as a warning rather than error condition. Again, this can leave an orphan file behind, but that's cheap compared to the alternatives. Since we can't actually do the unlink() until after we've committed the DROP TABLE transaction, throwing an error would be out of the question anyway. (It may be worth noting that the WAL entry about the file deletion is actually part of the commit record for the dropping transaction.) (3) +/* This is bit-map, not ordianal numbers */ There seems to be no comments using "bit-map". "Flags for ..." can be seen here and there. (4) Some wrong spellings: + /* we flush this buffer when swithing to PERMANENT */ swithing -> switching + * alredy flushed out by RelationCreate(Drop)InitFork called just alredy -> already + * relation content to be WAL-logged to recovery the table. recovery -> recover + * The inittmp fork works as the sentinel to identify that situaton. situaton -> situation (5) + table_close(classRel, NoLock); + + + + } These empty lines can be deleted. (6) +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) ... + * Make an XLOG entry reporting the file unlink. Not unlink but buffer persistence? (7) + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by myself. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* + * We have created the init fork. If server crashes before the current + * transaction ends the init fork left alone corrupts data while recovery. + * The inittmp fork works as the sentinel to identify that situaton. + */ + smgrcreate(srel, INITTMP_FORKNUM, false); + log_smgrcreate(&rnode, INITTMP_FORKNUM); + smgrimmedsync(srel, INITTMP_FORKNUM); If the server crashes between these two processings, only the init fork exists. Is it correct to create the inittmp forkfirst? (8) + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + smgrclose(srel); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + log_smgrunlink(&rnode, INITTMP_FORKNUM); + smgrunlink(srel, INITTMP_FORKNUM, false); + return; + } smgrclose() should be called just before return. Isn't it necessary here to revert buffer persistence state change? (9) +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} Maybe it's better to restore smgrdounlinkfork() that was removed in the older release. That function includes dropping sharedbuffers, which can clean up the shared buffers that may be cached by this transaction. (10) [RelationDropInitFork] + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} bufpersistence should be true. (11) + BlockNumber block = 0; ... + DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1, + &block); "block" is unnecessary and 0 can be passed directly. (12) - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) nrels++; It's better to put && at the beginning of the line to follow the existing code here. (13) + table_close(rel, lockmode); lockmode should be NoLock to retain the lock until transaction completion. (14) + ctl.keysize = sizeof(unlogged_relation_entry); + ctl.entrysize = sizeof(unlogged_relation_entry); + hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM); ... + memset(key.oid, 0, sizeof(key.oid)); + memcpy(key.oid, de->d_name, oidchars); + ent = hash_search(hash, &key, HASH_FIND, NULL); keysize should be the oid member of the struct. Regards Takayuki Tsunakawa
Thanks for the comment! Sorry for the late reply. At Fri, 4 Dec 2020 07:49:22 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in > From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> > > > No, not really. The issue is more around what happens if we crash > > > part way through. At crash recovery time, the system catalogs are not > > > available, because the database isn't consistent yet and, anyway, the > > > startup process can't be bound to a database, let alone every database > > > that might contain unlogged tables. So the sentinel that's used to > > > decide whether to flush the contents of a table or index is the > > > presence or absence of an _init fork, which the startup process > > > obviously can see just fine. The _init fork also tells us what to > > > stick in the relation when we reset it; for a table, we can just reset > > > to an empty file, but that's not legal for indexes, so the _init fork > > > contains a pre-initialized empty index that we can just copy over. > > > > > > Now, to make an unlogged table logged, you've got to at some stage > > > remove those _init forks. But this is not a transactional operation. > > > If you remove the _init forks and then the transaction rolls back, > > > you've left the system an inconsistent state. If you postpone the > > > removal until commit time, then you have a problem if it fails, > > > > It's true. That are the cause of headache. > ... > > The current implement is simple. It's enough to just discard old or > > new relfilenode according to the current transaction ends with commit > > or abort. Tweaking of relfilenode under use leads-in some skews in > > some places. I used pendingDelete mechanism a bit complexified way > > and a violated an abstraction (I think, calling AM-routines from > > storage.c is not good.) and even introduce a new fork kind only to > > mark a init fork as "not committed yet". There might be better way, > > but I haven't find it. > > I have no alternative idea yet, too. I agree that we want to avoid them, especially introducing inittmp fork... Anyway,below are the rest of my review comments for 0001. I want to review 0002 when we have decided to go with 0001. > > > (2) > XLOG_SMGR_UNLINK seems to necessitate modification of the following comments: > > [src/include/catalog/storage_xlog.h] > /* > * Declarations for smgr-related XLOG records > * > * Note: we log file creation and truncation here, but logging of deletion > * actions is handled by xact.c, because it is part of transaction commit. > */ Sure. Rewrote it. > [src/backend/access/transam/README] > 3. Deleting a table, which requires an unlink() that could fail. > > Our approach here is to WAL-log the operation first, but to treat failure > of the actual unlink() call as a warning rather than error condition. > Again, this can leave an orphan file behind, but that's cheap compared to > the alternatives. Since we can't actually do the unlink() until after > we've committed the DROP TABLE transaction, throwing an error would be out > of the question anyway. (It may be worth noting that the WAL entry about > the file deletion is actually part of the commit record for the dropping > transaction.) Mmm. I didn't touched theDROP TABLE (RelationDropStorage) path, but I added a brief description about INITTMP fork to the file. ==== The INITTMP fork file -------------------------------- An INITTMP fork is created when new relation file is created to mark the relfilenode needs to be cleaned up at recovery time. The file is removed at transaction end but is left when the process crashes before the transaction ends. In contrast to 4 above, failure to remove an INITTMP file will lead to data loss, in which case the server will shut down. ==== > (3) > +/* This is bit-map, not ordianal numbers */ > > There seems to be no comments using "bit-map". "Flags for ..." can be seen here and there. I revmoed the comment and use (1 << n) notation to show the fact instead. > (4) > Some wrong spellings: > > swithing -> switching > alredy -> already > recovery -> recover > situaton -> situation Oops! Fixed them. > (5) > + table_close(classRel, NoLock); > + > + > + > + > } > > These empty lines can be deleted. s/can/should/ :p. Fixed. > > (6) > +/* > + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. > + */ > +void > +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) > ... > + * Make an XLOG entry reporting the file unlink. > > Not unlink but buffer persistence? Silly copy-pasto. Fixed. > (7) > + /* > + * index-init fork needs further initialization. ambuildempty shoud do > + * WAL-log and file sync by itself but otherwise we do that by myself. > + */ > + if (rel->rd_rel->relkind == RELKIND_INDEX) > + rel->rd_indam->ambuildempty(rel); > + else > + { > + log_smgrcreate(&rnode, INIT_FORKNUM); > + smgrimmedsync(srel, INIT_FORKNUM); > + } > + > + /* > + * We have created the init fork. If server crashes before the current > + * transaction ends the init fork left alone corrupts data while recovery. > + * The inittmp fork works as the sentinel to identify that situaton. > + */ > + smgrcreate(srel, INITTMP_FORKNUM, false); > + log_smgrcreate(&rnode, INITTMP_FORKNUM); > + smgrimmedsync(srel, INITTMP_FORKNUM); > > If the server crashes between these two processings, only the init fork exists. Is it correct to create the inittmp forkfirst? Right. I change it that way, and did the same with the new code added to RelationCreateStorage. > (8) > + if (inxact_created) > + { > + SMgrRelation srel = smgropen(rnode, InvalidBackendId); > + smgrclose(srel); > + log_smgrunlink(&rnode, INIT_FORKNUM); > + smgrunlink(srel, INIT_FORKNUM, false); > + log_smgrunlink(&rnode, INITTMP_FORKNUM); > + smgrunlink(srel, INITTMP_FORKNUM, false); > + return; > + } > > smgrclose() should be called just before return. > Isn't it necessary here to revert buffer persistence state change? Mmm. it's a thinko. I was confused with the case of close/unlink. Fixed all instacnes of the same. > (9) > +void > +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) > +{ > + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); > +} > > Maybe it's better to restore smgrdounlinkfork() that was removed in the older release. That function includes droppingshared buffers, which can clean up the shared buffers that may be cached by this transaction. INITFORK/INITTMP forks cannot be loaded to shared buffer so it's no use to drop buffers. I added a comment like that. | /* | * INIT/INITTMP forks never be loaded to shared buffer so no point in | * dropping buffers for these files. | */ | log_smgrunlink(&rnode, INIT_FORKNUM); I removed DropRelFileNodeBuffers from PDOP_UNLINK_FORK branch in smgrDoPendingDeletes and added an assertion and a comment instead. | /* other forks needs to drop buffers */ | Assert(pending->unlink_forknum == INIT_FORKNUM || | pending->unlink_forknum == INITTMP_FORKNUM); | | log_smgrunlink(&pending->relnode, pending->unlink_forknum); | smgrunlink(srel, pending->unlink_forknum, false); > (10) > [RelationDropInitFork] > + /* revert buffer-persistence changes at abort */ > + pending = (PendingRelDelete *) > + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); > + pending->relnode = rnode; > + pending->op = PDOP_SET_PERSISTENCE; > + pending->bufpersistence = false; > + pending->backend = InvalidBackendId; > + pending->atCommit = true; > + pending->nestLevel = GetCurrentTransactionNestLevel(); > + pending->next = pendingDeletes; > + pendingDeletes = pending; > +} > > bufpersistence should be true. RelationDropInitFork() chnages the relation persisitence to "persistent" so it shoud be reverted to "non-persistent (= false)" at abort. (I agree that the function name is somewhat confusing...) > (11) > + BlockNumber block = 0; > ... > + DropRelFileNodeBuffers(rbnode, &pending->unlink_forknum, 1, > + &block); > > "block" is unnecessary and 0 can be passed directly. I removed the entire function call. But, I don't think you're right here. | DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, | int nforks, BlockNumber *firstDelBlock) Doesn't just passing 0 lead to SEGV? > (12) > - && pending->backend == InvalidBackendId) > + && pending->backend == InvalidBackendId && > + pending->op == PDOP_DELETE) > nrels++; > > It's better to put && at the beginning of the line to follow the existing code here. It's terrible.. Fixed. > (13) > + table_close(rel, lockmode); > > lockmode should be NoLock to retain the lock until transaction completion. I tried to recall the reason for that, but didn't come up with anything. Fixed. > (14) > + ctl.keysize = sizeof(unlogged_relation_entry); > + ctl.entrysize = sizeof(unlogged_relation_entry); > + hash = hash_create("unlogged hash", 32, &ctl, HASH_ELEM); > ... > + memset(key.oid, 0, sizeof(key.oid)); > + memcpy(key.oid, de->d_name, oidchars); > + ent = hash_search(hash, &key, HASH_FIND, NULL); > > keysize should be the oid member of the struct. It's not a problem since the first member is the oid and perhaps it seems that I thougth to do someting more on that. Now that I don't recall what is it and in the first place the key should be just Oid in the context above. Fixed. The patch is attached to the next message. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. At Thu, 24 Dec 2020 17:02:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > The patch is attached to the next message. The reason for separating this message is that I modified this so that it could solve another issue. There's a complain about orphan files after crash. [1] 1: https://www.postgresql.org/message-id/16771-cbef7d97ba93f4b9@postgresql.org That is, the case where a relation file is left alone after a server crash that happened before the end of the transaction that has created a relation. As I read this, I noticed this feature can solve the issue with a small change. This version gets changes in RelationCreateStorage and smgrDoPendingDeletes. Previously inittmp fork is created only along with an init fork. This version creates one always when a relation storage file is created. As the result ResetUnloggedRelationsInDbspaceDir removes all forks if the inttmp fork of a logged relations is found. Now that pendingDeletes can contain multiple entries for the same relation, it has been modified not to close the same smgr multiple times. - It might be better to split 0001 into two peaces. - The function name ResetUnloggedRelationsInDbspaceDir is no longer represents the function correctly. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From dbe9ef477df8570b0b0def2b5f089b0001aa2eab Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v2 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 23 ++ src/backend/access/transam/README | 10 + src/backend/catalog/storage.c | 394 +++++++++++++++++++++++-- src/backend/commands/tablecmds.c | 213 ++++++++++--- src/backend/storage/buffer/bufmgr.c | 88 ++++++ src/backend/storage/file/reinit.c | 164 +++++----- src/backend/storage/smgr/md.c | 4 +- src/backend/storage/smgr/smgr.c | 6 + src/common/relpath.c | 3 +- src/include/catalog/storage.h | 2 + src/include/catalog/storage_xlog.h | 22 +- src/include/common/relpath.h | 5 +- src/include/storage/bufmgr.h | 2 + src/include/storage/smgr.h | 1 + 14 files changed, 800 insertions(+), 137 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index a7c0cb1bc3..097dacfee6 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +72,12 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..51616b2458 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,16 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The INITTMP fork file +-------------------------------- + +An INITTMP fork is created when new relation file is created to mark +the relfilenode needs to be cleaned up at recovery time. The file is +removed at transaction end but is left when the process crashes before +the transaction ends. In contrast to 4 above, failure to remove an +INITTMP file will lead to data loss, in which case the server will +shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index d538f25726..f4dddbad55 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -57,9 +58,16 @@ int wal_skip_threshold = 2048; /* in kilobytes */ * but I'm being paranoid. */ +#define PDOP_DELETE (0) +#define PDOP_UNLINK_FORK (1 << 0) +#define PDOP_SET_PERSISTENCE (1 << 1) + typedef struct PendingRelDelete { RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ BackendId backend; /* InvalidBackendId if not a temp rel */ bool atCommit; /* T=delete at commit; F=delete at abort */ int nestLevel; /* xact nesting level of request */ @@ -143,7 +151,17 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up but there's no + * clue to the orphan files. The inittmp fork works as the sentinel to + * identify that situation. + */ srel = smgropen(rnode, backend); + smgrcreate(srel, INITTMP_FORKNUM, false); + log_smgrcreate(&rnode, INITTMP_FORKNUM); + smgrimmedsync(srel, INITTMP_FORKNUM); + smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) @@ -153,12 +171,37 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rnode; + pending->op = PDOP_DELETE; pending->backend = backend; pending->atCommit = false; /* delete if abort */ pending->nestLevel = GetCurrentTransactionNestLevel(); pending->next = pendingDeletes; pendingDeletes = pending; + /* drop inittmp fork at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INITTMP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop inittmp fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INITTMP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { Assert(backend == InvalidBackendId); @@ -168,6 +211,215 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + SMgrRelation srel; + PendingRelDelete *prev; + PendingRelDelete *next; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, false, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have already registered pending sync entries to drop preexisting + * init fork since before the current transaction started. This function + * reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->op != PDOP_DELETE) + { + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + create = false; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (!create) + return; + + /* + * We are going to create the init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The inittmp fork works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + smgrcreate(srel, INITTMP_FORKNUM, false); + log_smgrcreate(&rnode, INITTMP_FORKNUM); + smgrimmedsync(srel, INITTMP_FORKNUM); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by myself. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop this init fork file at abort and revert persistence */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK | PDOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop inittmp fork at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INITTMP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop inittmp fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INITTMP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + PendingRelDelete *prev; + PendingRelDelete *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, true, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have created the init fork in the current transaction. We + * immediately remove the init and inittmp forks immediately in that case. + * Otherwise just reister pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->op != PDOP_DELETE) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + inxact_created = true; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT/INITTMP forks never be loaded to shared buffer so no point in + * dropping buffers for these files. + */ + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + log_smgrunlink(&rnode, INITTMP_FORKNUM); + smgrunlink(srel, INITTMP_FORKNUM, false); + smgrclose(srel); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +439,44 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -200,6 +490,7 @@ RelationDropStorage(Relation rel) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rel->rd_node; + pending->op = PDOP_DELETE; pending->backend = rel->rd_backend; pending->atCommit = true; /* delete if commit */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -606,43 +897,70 @@ smgrDoPendingDeletes(bool isCommit) prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) { + SMgrRelation srel; + next = pending->next; if (pending->nestLevel < nestLevel) { /* outer-level entries should not be processed yet */ prev = pending; + continue; } + + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; else + pendingDeletes = next; + + if (pending->atCommit != isCommit) { - /* unlink list entry first, so we don't retry on failure */ - if (prev) - prev->next = next; - else - pendingDeletes = next; - /* do deletion if called for */ - if (pending->atCommit == isCommit) - { - SMgrRelation srel; - - srel = smgropen(pending->relnode, pending->backend); - - /* allocate the initial array, or extend it, if needed */ - if (maxrels == 0) - { - maxrels = 8; - srels = palloc(sizeof(SMgrRelation) * maxrels); - } - else if (maxrels <= nrels) - { - maxrels *= 2; - srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); - } - - srels[nrels++] = srel; - } /* must explicitly free the list entry */ pfree(pending); /* prev does not change */ + continue; + } + + srel = smgropen(pending->relnode, pending->backend); + + if (pending->op != PDOP_DELETE) + { + if (pending->op & PDOP_UNLINK_FORK) + { + BlockNumber block = 0; + RelFileNodeBackend rbnode; + + rbnode.node = pending->relnode; + rbnode.backend = InvalidBackendId; + + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM || + pending->unlink_forknum == INITTMP_FORKNUM); + + log_smgrunlink(&pending->relnode, pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + smgrclose(srel); + } + + if (pending->op & PDOP_SET_PERSISTENCE) + SetRelationBuffersPersistence(srel, pending->bufpersistence, + false); + } + else + { + /* allocate the initial array, or extend it, if needed */ + if (maxrels == 0) + { + maxrels = 8; + srels = palloc(sizeof(SMgrRelation) * maxrels); + } + else if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + srels[nrels++] = srel; } } @@ -824,7 +1142,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId + && pending->op == PDOP_DELETE) nrels++; } if (nrels == 0) @@ -837,7 +1156,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) { *rptr = pending->relnode; rptr++; @@ -917,6 +1237,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1005,6 +1334,15 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 1fa9f19f08..45be633d9f 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -4917,6 +4917,138 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform im-place persistnce change"); + + RelationOpenSmgr(rel); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + RelationOpenSmgr(toastrel); + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + RelationOpenSmgr(r); + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(r->rd_smgr, i)) + smgrimmedsync(r->rd_smgr, i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(r->rd_smgr, fork)) + log_newpage_range(r, fork, + 0, smgrnblocks(r->rd_smgr, fork), false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5037,45 +5169,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, - lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, + lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); + } } else { diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index c5e8707151..6ff46fb86d 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlog.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3032,6 +3033,93 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 8700f7f19a..80a1e61408 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -31,7 +31,8 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool dirty; /* to be removed */ +} relfile_entry; /* * Reset unlogged relations from before the last restart. @@ -151,6 +152,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); @@ -160,88 +163,86 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create a ton of unlogged relations + * in the same database & tablespace, so we'd better use a hash table + * rather than an array or linked list to keep track of which files + * need to be reset. Otherwise, this cleanup operation would be + * O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("inittmp hash", 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect inttmp forks in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum)) + continue; + + /* Record init and inittmp forks */ + if (forkNum == INIT_FORKNUM || forkNum == INITTMP_FORKNUM) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode. If it has INITTMP fork, the all files + * needs to be cleaned up. Otherwise the relfilenode is cleaned up + * according to the unloggedness. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + ent->dirty = false; + + if (forkNum == INITTMP_FORKNUM) + ent->dirty = true; + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - /* * Now, make a second pass and remove anything that matches. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + int oidchars; + Oid key; + relfile_entry *ent; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, &forkNum)) continue; - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) - continue; - /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + /* we don't remove clean init file */ + if (ent && (ent->dirty || forkNum != INIT_FORKNUM)) { + /* so, nuke it! */ snprintf(rm_path, sizeof(rm_path), "%s/%s", dbspacedirname, de->d_name); if (unlink(rm_path) < 0) @@ -250,13 +251,12 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) errmsg("could not remove file \"%s\": %m", rm_path))); else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + elog(LOG, "unlinked file \"%s\"", rm_path); } } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } /* @@ -277,12 +277,42 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; char dstpath[MAXPGPATH]; + Oid key; + relfile_entry *ent; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, &forkNum)) continue; + /* + * See whether the OID portion of the name shows up in the hash + * table. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + /* we don't remove clean init file */ + if (ent && (ent->dirty || forkNum != INIT_FORKNUM)) + { + /* + * The file is dirty. It shoudl have been removed once at + * cleanup time but recovery can create them again. Remove + * them. + */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + else + elog(LOG, "unlinked file \"%s\"", rm_path); + + continue; + } + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -351,6 +381,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ fsync_fname(dbspacedirname, true); } + + hash_destroy(hash); } /* diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 9889ad6ad8..32dad72ed3 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -338,8 +338,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) if (ret == 0 || errno != ENOENT) { ret = unlink(path); + + /* failure of removing inittmp fork leads to a data loss. */ if (ret < 0 && errno != ENOENT) - ereport(WARNING, + ereport((forkNum != INITTMP_FORKNUM ? WARNING : ERROR), (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 072bdd118f..2a1d87dc33 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -644,6 +644,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/common/relpath.c b/src/common/relpath.c index ad733d1363..2a5e5fa990 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -34,7 +34,8 @@ const char *const forkNames[] = { "main", /* MAIN_FORKNUM */ "fsm", /* FSM_FORKNUM */ "vm", /* VISIBILITYMAP_FORKNUM */ - "init" /* INIT_FORKNUM */ + "init", /* INIT_FORKNUM */ + "itmp" /* INITTMP_FORKNUM */ }; StaticAssertDecl(lengthof(forkNames) == (MAX_FORKNUM + 1), diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 30c38e0ca6..c2259cd7e3 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index 7b21cab2e0..dcf1e605c0 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -22,13 +22,17 @@ /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation, deletion and persistence change + * here. logging of deletion actions is mainly handled by xact.c, because it is + * part of transaction commit, but we log deletions happens outside of a + * transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_BUFPERSISTENCE 0x40 typedef struct xl_smgr_create { @@ -36,6 +40,18 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +67,8 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index 869cabcc0d..f6e1a74a38 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -43,7 +43,8 @@ typedef enum ForkNumber MAIN_FORKNUM = 0, FSM_FORKNUM, VISIBILITYMAP_FORKNUM, - INIT_FORKNUM + INIT_FORKNUM, + INITTMP_FORKNUM /* * NOTE: if you add a new fork, change MAX_FORKNUM and possibly @@ -52,7 +53,7 @@ typedef enum ForkNumber */ } ForkNumber; -#define MAX_FORKNUM INIT_FORKNUM +#define MAX_FORKNUM INITTMP_FORKNUM #define FORKNAMECHARS 4 /* max chars for a fork name */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index ee91b8fa26..9697449938 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index f28a842401..5d74631006 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, -- 2.27.0 From 421e0652fe94753921ad382e27da4010ce5db520 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v2 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 ++++ src/backend/nodes/equalfuncs.c | 15 ++++ src/backend/parser/gram.y | 20 +++++ src/backend/tcop/utility.c | 11 +++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 ++ 8 files changed, 214 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 45be633d9f..002749094b 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13663,6 +13663,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(NIL); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 70f8b718e0..222b81724a 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4124,6 +4124,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5424,6 +5437,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index 541e0e6b48..898f78d899 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1860,6 +1860,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3479,6 +3491,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 8f341ac006..afc4ff0447 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1885,6 +1885,26 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index a42ead7d69..f866b8cab2 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -161,6 +161,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1732,6 +1733,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2615,6 +2622,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index c1581ad178..206de61154 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index 3684f87a88..7fb6437973 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -423,6 +423,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 48a79a7657..5d549b2476 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2234,6 +2234,15 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- -- 2.27.0
At Fri, 25 Dec 2020 09:12:52 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Hello. > > At Thu, 24 Dec 2020 17:02:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > The patch is attached to the next message. > > The reason for separating this message is that I modified this so that > it could solve another issue. > > There's a complain about orphan files after crash. [1] > > 1: https://www.postgresql.org/message-id/16771-cbef7d97ba93f4b9@postgresql.org > > That is, the case where a relation file is left alone after a server > crash that happened before the end of the transaction that has created > a relation. As I read this, I noticed this feature can solve the > issue with a small change. > > This version gets changes in RelationCreateStorage and > smgrDoPendingDeletes. > > Previously inittmp fork is created only along with an init fork. This > version creates one always when a relation storage file is created. As > the result ResetUnloggedRelationsInDbspaceDir removes all forks if the > inttmp fork of a logged relations is found. Now that pendingDeletes > can contain multiple entries for the same relation, it has been > modified not to close the same smgr multiple times. > > - It might be better to split 0001 into two peaces. > > - The function name ResetUnloggedRelationsInDbspaceDir is no longer > represents the function correctly. As pointed by Robert in another thread [1], persisntence of (at least) GiST index cannot be flipped in-place due to incompatibility of fake LSNs with real ones. This version RelationChangePersistence() is changed not to choose in-place method for indexes other than btree. It seems to be usable with all kind of indexes other than Gist, but at the mement it applies only to btrees. 1: https://www.postgresql.org/message-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 1d47e7872d1e7ef18007f752e55cec9772373cc9 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v3 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 23 ++ src/backend/access/transam/README | 10 + src/backend/catalog/storage.c | 420 +++++++++++++++++++++++-- src/backend/commands/tablecmds.c | 246 ++++++++++++--- src/backend/storage/buffer/bufmgr.c | 88 ++++++ src/backend/storage/file/reinit.c | 162 ++++++---- src/backend/storage/smgr/md.c | 4 +- src/backend/storage/smgr/smgr.c | 6 + src/common/relpath.c | 3 +- src/include/catalog/storage.h | 2 + src/include/catalog/storage_xlog.h | 22 +- src/include/common/relpath.h | 5 +- src/include/storage/bufmgr.h | 2 + src/include/storage/smgr.h | 1 + 14 files changed, 854 insertions(+), 140 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7755553d57..2c109b8ca4 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +72,12 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..51616b2458 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,16 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The INITTMP fork file +-------------------------------- + +An INITTMP fork is created when new relation file is created to mark +the relfilenode needs to be cleaned up at recovery time. The file is +removed at transaction end but is left when the process crashes before +the transaction ends. In contrast to 4 above, failure to remove an +INITTMP file will lead to data loss, in which case the server will +shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index cba7a9ada0..bd9680583b 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -27,6 +28,7 @@ #include "access/xlogutils.h" #include "catalog/storage.h" #include "catalog/storage_xlog.h" +#include "common/hashfn.h" #include "miscadmin.h" #include "storage/freespace.h" #include "storage/smgr.h" @@ -57,9 +59,16 @@ int wal_skip_threshold = 2048; /* in kilobytes */ * but I'm being paranoid. */ +#define PDOP_DELETE (0) +#define PDOP_UNLINK_FORK (1 << 0) +#define PDOP_SET_PERSISTENCE (1 << 1) + typedef struct PendingRelDelete { RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ BackendId backend; /* InvalidBackendId if not a temp rel */ bool atCommit; /* T=delete at commit; F=delete at abort */ int nestLevel; /* xact nesting level of request */ @@ -75,6 +84,24 @@ typedef struct PendingRelSync static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; +typedef struct SRelHashEntry +{ + SMgrRelation srel; + char status; /* for simplehash use */ +} SRelHashEntry; + +/* define hashtable for workarea for pending deletes */ +#define SH_PREFIX srelhash +#define SH_ELEMENT_TYPE SRelHashEntry +#define SH_KEY_TYPE SMgrRelation +#define SH_KEY srel +#define SH_HASH_KEY(tb, key) \ + hash_bytes((unsigned char *)&key, sizeof(SMgrRelation)) +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0) +#define SH_SCOPE static inline +#define SH_DEFINE +#define SH_DECLARE +#include "lib/simplehash.h" /* * AddPendingSync @@ -143,7 +170,17 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up but there's no + * clue to the orphan files. The inittmp fork works as the sentinel to + * identify that situation. + */ srel = smgropen(rnode, backend); + smgrcreate(srel, INITTMP_FORKNUM, false); + log_smgrcreate(&rnode, INITTMP_FORKNUM); + smgrimmedsync(srel, INITTMP_FORKNUM); + smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) @@ -153,12 +190,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rnode; + pending->op = PDOP_DELETE; pending->backend = backend; pending->atCommit = false; /* delete if abort */ pending->nestLevel = GetCurrentTransactionNestLevel(); pending->next = pendingDeletes; pendingDeletes = pending; + /* drop inittmp fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INITTMP_FORKNUM; + pending->backend = backend; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { Assert(backend == InvalidBackendId); @@ -168,6 +218,215 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + SMgrRelation srel; + PendingRelDelete *prev; + PendingRelDelete *next; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, false, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have already registered pending sync entries to drop preexisting + * init fork since before the current transaction started. This function + * reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->op != PDOP_DELETE) + { + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + create = false; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (!create) + return; + + /* + * We are going to create the init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The inittmp fork works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + smgrcreate(srel, INITTMP_FORKNUM, false); + log_smgrcreate(&rnode, INITTMP_FORKNUM); + smgrimmedsync(srel, INITTMP_FORKNUM); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by myself. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop this init fork file at abort and revert persistence */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK | PDOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop inittmp fork at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INITTMP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop inittmp fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INITTMP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + PendingRelDelete *prev; + PendingRelDelete *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, true, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have created the init fork in the current transaction. We + * immediately remove the init and inittmp forks immediately in that case. + * Otherwise just reister pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->op != PDOP_DELETE) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + inxact_created = true; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT/INITTMP forks never be loaded to shared buffer so no point in + * dropping buffers for these files. + */ + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + log_smgrunlink(&rnode, INITTMP_FORKNUM); + smgrunlink(srel, INITTMP_FORKNUM, false); + smgrclose(srel); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +446,44 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -200,6 +497,7 @@ RelationDropStorage(Relation rel) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rel->rd_node; + pending->op = PDOP_DELETE; pending->backend = rel->rd_backend; pending->atCommit = true; /* delete if commit */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -602,59 +900,97 @@ smgrDoPendingDeletes(bool isCommit) int nrels = 0, maxrels = 0; SMgrRelation *srels = NULL; + srelhash_hash *close_srels = NULL; + bool found; + prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) { + SMgrRelation srel; + next = pending->next; if (pending->nestLevel < nestLevel) { /* outer-level entries should not be processed yet */ prev = pending; + continue; } + + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; else + pendingDeletes = next; + + if (pending->atCommit != isCommit) { - /* unlink list entry first, so we don't retry on failure */ - if (prev) - prev->next = next; - else - pendingDeletes = next; - /* do deletion if called for */ - if (pending->atCommit == isCommit) - { - SMgrRelation srel; - - srel = smgropen(pending->relnode, pending->backend); - - /* allocate the initial array, or extend it, if needed */ - if (maxrels == 0) - { - maxrels = 8; - srels = palloc(sizeof(SMgrRelation) * maxrels); - } - else if (maxrels <= nrels) - { - maxrels *= 2; - srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); - } - - srels[nrels++] = srel; - } /* must explicitly free the list entry */ pfree(pending); /* prev does not change */ + continue; + } + + if (close_srels == NULL) + close_srels = srelhash_create(CurrentMemoryContext, 32, NULL); + + srel = smgropen(pending->relnode, pending->backend); + + /* Uniquify the smgr relations */ + srelhash_insert(close_srels, srel, &found); + + if (pending->op != PDOP_DELETE) + { + if (pending->op & PDOP_UNLINK_FORK) + { + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM || + pending->unlink_forknum == INITTMP_FORKNUM); + + log_smgrunlink(&pending->relnode, pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + + } + + if (pending->op & PDOP_SET_PERSISTENCE) + SetRelationBuffersPersistence(srel, pending->bufpersistence, + false); + } + else + { + /* allocate the initial array, or extend it, if needed */ + if (maxrels == 0) + { + maxrels = 8; + srels = palloc(sizeof(SMgrRelation) * maxrels); + } + else if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + srels[nrels++] = srel; } } if (nrels > 0) { smgrdounlinkall(srels, nrels, false); - - for (int i = 0; i < nrels; i++) - smgrclose(srels[i]); - pfree(srels); } + + if (close_srels) + { + srelhash_iterator i; + SRelHashEntry *ent; + + /* close smgr relatoins */ + srelhash_start_iterate(close_srels, &i); + while ((ent = srelhash_iterate(close_srels, &i)) != NULL) + smgrclose(ent->srel); + srelhash_destroy(close_srels); + } } /* @@ -824,7 +1160,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId + && pending->op == PDOP_DELETE) nrels++; } if (nrels == 0) @@ -837,7 +1174,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) { *rptr = pending->relnode; rptr++; @@ -917,6 +1255,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1005,6 +1352,15 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 993da56d43..37a15d31ee 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -50,6 +50,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -4917,6 +4918,170 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform im-place persistnce change"); + + RelationOpenSmgr(rel); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + RelationOpenSmgr(toastrel); + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * Some access methods do not accept in-place persistence change. For + * example, GiST uses page LSNs to figure out whether a block has + * changed, where UNLOGGED GiST indexes use fake LSNs that are + * incompatible with real LSNs used for LOGGED ones. + * + * XXXX: We don't bother allowing in-place persistence change for index + * methods other than btree for now. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + r->rd_rel->relam != BTREE_AM_OID) + { + int reindex_flags; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, 0); + + continue; + } + + RelationOpenSmgr(r); + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(r->rd_smgr, i)) + smgrimmedsync(r->rd_smgr, i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(r->rd_smgr, fork)) + log_newpage_range(r, fork, + 0, smgrnblocks(r->rd_smgr, fork), false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5037,45 +5202,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, - lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, + lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); + } } else { diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 71b5852224..b730b4417c 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlog.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3032,6 +3033,93 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 40c758d789..adcb54b0fa 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -31,7 +31,8 @@ static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool dirty; /* to be removed */ +} relfile_entry; /* * Reset unlogged relations from before the last restart. @@ -151,6 +152,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); @@ -160,88 +163,86 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create a ton of unlogged relations + * in the same database & tablespace, so we'd better use a hash table + * rather than an array or linked list to keep track of which files + * need to be reset. Otherwise, this cleanup operation would be + * O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("inittmp hash", 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect inttmp forks in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum)) + continue; + + /* Record init and inittmp forks */ + if (forkNum == INIT_FORKNUM || forkNum == INITTMP_FORKNUM) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode. If it has INITTMP fork, the all files + * needs to be cleaned up. Otherwise the relfilenode is cleaned up + * according to the unloggedness. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + ent->dirty = false; + + if (forkNum == INITTMP_FORKNUM) + ent->dirty = true; + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - /* * Now, make a second pass and remove anything that matches. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + int oidchars; + Oid key; + relfile_entry *ent; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, &forkNum)) continue; - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) - continue; - /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + /* we don't remove clean init file */ + if (ent && (ent->dirty || forkNum != INIT_FORKNUM)) { + /* so, nuke it! */ snprintf(rm_path, sizeof(rm_path), "%s/%s", dbspacedirname, de->d_name); if (unlink(rm_path) < 0) @@ -256,7 +257,6 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } /* @@ -277,12 +277,42 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; char dstpath[MAXPGPATH]; + Oid key; + relfile_entry *ent; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, &forkNum)) continue; + /* + * See whether the OID portion of the name shows up in the hash + * table. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + /* we don't remove clean init file */ + if (ent && (ent->dirty || forkNum != INIT_FORKNUM)) + { + /* + * The file is dirty. It shoudl have been removed once at + * cleanup time but recovery can create them again. Remove + * them. + */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + else + elog(DEBUG2, "unlinked file \"%s\"", rm_path); + + continue; + } + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -351,6 +381,8 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ fsync_fname(dbspacedirname, true); } + + hash_destroy(hash); } /* diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 0643d714fb..416fd859e6 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -338,8 +338,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) if (ret == 0 || errno != ENOENT) { ret = unlink(path); + + /* failure of removing inittmp fork leads to a data loss. */ if (ret < 0 && errno != ENOENT) - ereport(WARNING, + ereport((forkNum != INITTMP_FORKNUM ? WARNING : ERROR), (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 0f31ff3822..4102d3d59c 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -644,6 +644,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/common/relpath.c b/src/common/relpath.c index 1f5c426ec0..2954cd9c24 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -34,7 +34,8 @@ const char *const forkNames[] = { "main", /* MAIN_FORKNUM */ "fsm", /* FSM_FORKNUM */ "vm", /* VISIBILITYMAP_FORKNUM */ - "init" /* INIT_FORKNUM */ + "init", /* INIT_FORKNUM */ + "itmp" /* INITTMP_FORKNUM */ }; StaticAssertDecl(lengthof(forkNames) == (MAX_FORKNUM + 1), diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 0ab32b44e9..382623159c 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index f0814f1458..0fd0832a8b 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -22,13 +22,17 @@ /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation, deletion and persistence change + * here. logging of deletion actions is mainly handled by xact.c, because it is + * part of transaction commit, but we log deletions happens outside of a + * transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_BUFPERSISTENCE 0x40 typedef struct xl_smgr_create { @@ -36,6 +40,18 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +67,8 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a44be11ca0..4305bdbe96 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -43,7 +43,8 @@ typedef enum ForkNumber MAIN_FORKNUM = 0, FSM_FORKNUM, VISIBILITYMAP_FORKNUM, - INIT_FORKNUM + INIT_FORKNUM, + INITTMP_FORKNUM /* * NOTE: if you add a new fork, change MAX_FORKNUM and possibly @@ -52,7 +53,7 @@ typedef enum ForkNumber */ } ForkNumber; -#define MAX_FORKNUM INIT_FORKNUM +#define MAX_FORKNUM INITTMP_FORKNUM #define FORKNAMECHARS 4 /* max chars for a fork name */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index ff6cd0fc54..d9752a8317 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index ebf4a199dc..8be17d9afc 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, -- 2.27.0 From d5dfe5943ea790384faf431fc0bdfeff6efd49fd Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v3 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 ++++ src/backend/nodes/equalfuncs.c | 15 ++++ src/backend/parser/gram.y | 20 +++++ src/backend/tcop/utility.c | 11 +++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 ++ 8 files changed, 214 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 37a15d31ee..2f65abb19b 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13696,6 +13696,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(NIL); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index ba3ccc712c..127da5151d 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4138,6 +4138,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5441,6 +5454,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index a2ef853dc2..4f13a1762b 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1872,6 +1872,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3494,6 +3506,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 31c95443a5..2222fd8fe3 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1934,6 +1934,26 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 53a511f1da..16606448bf 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -161,6 +161,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1732,6 +1733,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2619,6 +2626,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 08c463d3c4..646928466d 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index caed683ba9..16d91d3e1d 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -424,6 +424,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index dc2bb40926..c3eab6f1ab 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2253,6 +2253,15 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- -- 2.27.0
At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > This version RelationChangePersistence() is changed not to choose > in-place method for indexes other than btree. It seems to be usable > with all kind of indexes other than Gist, but at the mement it applies > only to btrees. > > 1: https://www.postgresql.org/message-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com Hmm. This is not wroking correctly. I'll repost after fixint that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > This version RelationChangePersistence() is changed not to choose > > in-place method for indexes other than btree. It seems to be usable > > with all kind of indexes other than Gist, but at the mement it applies > > only to btrees. > > > > 1: https://www.postgresql.org/message-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com > > Hmm. This is not wroking correctly. I'll repost after fixint that. I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir() handles file operations in the wrong order and with the wrong logic. It also needed to drop buffers and forget fsync requests. I thought that the two cases that this patch is expected to fix (orphan relation files and uncommited init files) can share the same "cleanup" fork but that is wrong. I had to add one more additional fork to differentiate the cases of SET UNLOGGED and of creation of UNLOGGED tables... The attached is a new version, that seems working correctly but looks somewhat messy. I'll continue working. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 88e9374529cbd8f983f2c82baadea94b475e46dd Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v4 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 23 ++ src/backend/access/transam/README | 8 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 436 +++++++++++++++++++++++-- src/backend/commands/tablecmds.c | 246 +++++++++++--- src/backend/storage/buffer/bufmgr.c | 88 +++++ src/backend/storage/file/reinit.c | 322 ++++++++++++------ src/backend/storage/smgr/md.c | 13 +- src/backend/storage/smgr/smgr.c | 6 + src/common/relpath.c | 4 +- src/include/catalog/storage.h | 2 + src/include/catalog/storage_xlog.h | 22 +- src/include/common/relpath.h | 6 +- src/include/storage/bufmgr.h | 2 + src/include/storage/md.h | 2 + src/include/storage/reinit.h | 3 +- src/include/storage/smgr.h | 1 + 17 files changed, 1034 insertions(+), 167 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7755553d57..2c109b8ca4 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +72,12 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..547107a771 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The CLEANUP fork file +-------------------------------- + +An CLEANUP fork is created when a new relation file is created to mark +the relfilenode needs to be cleaned up at recovery time. In contrast +to 4 above, failure to remove an CLEANUP fork file will lead to data +loss, in which case the server will shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index b18257c198..6dcbcbe387 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -40,6 +40,7 @@ #include "catalog/catversion.h" #include "catalog/pg_control.h" #include "catalog/pg_database.h" +#include "catalog/storage.h" #include "commands/progress.h" #include "commands/tablespace.h" #include "common/controldata_utils.h" @@ -4442,6 +4443,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -7455,6 +7464,14 @@ StartupXLOG(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index cba7a9ada0..c54d70747f 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -27,6 +28,7 @@ #include "access/xlogutils.h" #include "catalog/storage.h" #include "catalog/storage_xlog.h" +#include "common/hashfn.h" #include "miscadmin.h" #include "storage/freespace.h" #include "storage/smgr.h" @@ -57,9 +59,16 @@ int wal_skip_threshold = 2048; /* in kilobytes */ * but I'm being paranoid. */ +#define PDOP_DELETE (0) +#define PDOP_UNLINK_FORK (1 << 0) +#define PDOP_SET_PERSISTENCE (1 << 1) + typedef struct PendingRelDelete { RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ BackendId backend; /* InvalidBackendId if not a temp rel */ bool atCommit; /* T=delete at commit; F=delete at abort */ int nestLevel; /* xact nesting level of request */ @@ -75,6 +84,24 @@ typedef struct PendingRelSync static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; +typedef struct SRelHashEntry +{ + SMgrRelation srel; + char status; /* for simplehash use */ +} SRelHashEntry; + +/* define hashtable for workarea for pending deletes */ +#define SH_PREFIX srelhash +#define SH_ELEMENT_TYPE SRelHashEntry +#define SH_KEY_TYPE SMgrRelation +#define SH_KEY srel +#define SH_HASH_KEY(tb, key) \ + hash_bytes((unsigned char *)&key, sizeof(SMgrRelation)) +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0) +#define SH_SCOPE static inline +#define SH_DEFINE +#define SH_DECLARE +#include "lib/simplehash.h" /* * AddPendingSync @@ -143,7 +170,17 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up but there's no + * clue to the orphan files. The cleanup fork works as the sentinel to + * identify that situation. + */ srel = smgropen(rnode, backend); + smgrcreate(srel, CLEANUP2_FORKNUM, false); + log_smgrcreate(&rnode, CLEANUP2_FORKNUM); + smgrimmedsync(srel, CLEANUP2_FORKNUM); + smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) @@ -153,12 +190,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rnode; + pending->op = PDOP_DELETE; pending->backend = backend; pending->atCommit = false; /* delete if abort */ pending->nestLevel = GetCurrentTransactionNestLevel(); pending->next = pendingDeletes; pendingDeletes = pending; + /* drop cleanup fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = CLEANUP2_FORKNUM; + pending->backend = backend; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { Assert(backend == InvalidBackendId); @@ -168,6 +218,218 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + SMgrRelation srel; + PendingRelDelete *prev; + PendingRelDelete *next; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, false, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have already registered pending delete entries to drop + * preexisting init fork since before the current transaction started. This + * function reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->op != PDOP_DELETE && + ((pending->op & PDOP_UNLINK_FORK) != 0 && + pending->unlink_forknum == CLEANUP_FORKNUM)) + { + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + create = false; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (!create) + return; + + /* + * We are going to create the init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The cleanup fork works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + smgrcreate(srel, CLEANUP_FORKNUM, false); + log_smgrcreate(&rnode, CLEANUP_FORKNUM); + smgrimmedsync(srel, CLEANUP_FORKNUM); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by myself. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop this init fork file at abort and revert persistence */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK | PDOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop cleanup fork at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = CLEANUP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop cleanup fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = CLEANUP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + PendingRelDelete *prev; + PendingRelDelete *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, true, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have created the init fork in the current transaction. We + * immediately remove the init and cleanup forks immediately in that case. + * Otherwise just reister pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->op != PDOP_DELETE && + ((pending->op & PDOP_UNLINK_FORK) != 0 && + pending->unlink_forknum == CLEANUP_FORKNUM)) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + inxact_created = true; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT/CLEANUP forks never be loaded to shared buffer so no point in + * dropping buffers for these files. + */ + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + log_smgrunlink(&rnode, CLEANUP_FORKNUM); + smgrunlink(srel, CLEANUP_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +449,44 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -200,6 +500,7 @@ RelationDropStorage(Relation rel) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rel->rd_node; + pending->op = PDOP_DELETE; pending->backend = rel->rd_backend; pending->atCommit = true; /* delete if commit */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -602,59 +903,97 @@ smgrDoPendingDeletes(bool isCommit) int nrels = 0, maxrels = 0; SMgrRelation *srels = NULL; + srelhash_hash *close_srels = NULL; + bool found; prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) { + SMgrRelation srel; + next = pending->next; if (pending->nestLevel < nestLevel) { /* outer-level entries should not be processed yet */ prev = pending; + continue; } + + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; else + pendingDeletes = next; + + if (pending->atCommit != isCommit) { - /* unlink list entry first, so we don't retry on failure */ - if (prev) - prev->next = next; - else - pendingDeletes = next; - /* do deletion if called for */ - if (pending->atCommit == isCommit) - { - SMgrRelation srel; - - srel = smgropen(pending->relnode, pending->backend); - - /* allocate the initial array, or extend it, if needed */ - if (maxrels == 0) - { - maxrels = 8; - srels = palloc(sizeof(SMgrRelation) * maxrels); - } - else if (maxrels <= nrels) - { - maxrels *= 2; - srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); - } - - srels[nrels++] = srel; - } /* must explicitly free the list entry */ pfree(pending); /* prev does not change */ + continue; + } + + if (close_srels == NULL) + close_srels = srelhash_create(CurrentMemoryContext, 32, NULL); + + srel = smgropen(pending->relnode, pending->backend); + + /* Uniquify the smgr relations */ + srelhash_insert(close_srels, srel, &found); + + if (pending->op != PDOP_DELETE) + { + if (pending->op & PDOP_UNLINK_FORK) + { + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM || + pending->unlink_forknum == CLEANUP_FORKNUM || + pending->unlink_forknum == CLEANUP2_FORKNUM); + + log_smgrunlink(&pending->relnode, pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + + } + + if (pending->op & PDOP_SET_PERSISTENCE) + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); + } + else + { + /* allocate the initial array, or extend it, if needed */ + if (maxrels == 0) + { + maxrels = 8; + srels = palloc(sizeof(SMgrRelation) * maxrels); + } + else if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + srels[nrels++] = srel; } } if (nrels > 0) { smgrdounlinkall(srels, nrels, false); - - for (int i = 0; i < nrels; i++) - smgrclose(srels[i]); - pfree(srels); } + + if (close_srels) + { + srelhash_iterator i; + SRelHashEntry *ent; + + /* close smgr relatoins */ + srelhash_start_iterate(close_srels, &i); + while ((ent = srelhash_iterate(close_srels, &i)) != NULL) + smgrclose(ent->srel); + srelhash_destroy(close_srels); + } } /* @@ -824,7 +1163,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId + && pending->op == PDOP_DELETE) nrels++; } if (nrels == 0) @@ -837,7 +1177,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) { *rptr = pending->relnode; rptr++; @@ -917,6 +1258,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1005,6 +1355,28 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingRelDelete *pending; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = xlrec->rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 993da56d43..37a15d31ee 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -50,6 +50,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -4917,6 +4918,170 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform im-place persistnce change"); + + RelationOpenSmgr(rel); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + RelationOpenSmgr(toastrel); + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * Some access methods do not accept in-place persistence change. For + * example, GiST uses page LSNs to figure out whether a block has + * changed, where UNLOGGED GiST indexes use fake LSNs that are + * incompatible with real LSNs used for LOGGED ones. + * + * XXXX: We don't bother allowing in-place persistence change for index + * methods other than btree for now. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + r->rd_rel->relam != BTREE_AM_OID) + { + int reindex_flags; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, 0); + + continue; + } + + RelationOpenSmgr(r); + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(r->rd_smgr, i)) + smgrimmedsync(r->rd_smgr, i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(r->rd_smgr, fork)) + log_newpage_range(r, fork, + 0, smgrnblocks(r->rd_smgr, fork), false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5037,45 +5202,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, - lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, + lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); + } } else { diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 71b5852224..b730b4417c 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlog.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3032,6 +3033,93 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 40c758d789..b07709bc4f 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,50 @@ #include <unistd.h> +#include "access/xlog.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. * + * If CLEANUP_FORKNUM (clup) is present, we remove the init fork of the same + * relation along with the clup fork. + * + * If CLEANUP2_FORKNUM (cln2) is present we remove the whole relation along + * with the cln2 fork. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. + * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. */ @@ -68,7 +89,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -77,13 +98,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + Assert(tspid != 0); + + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -99,7 +126,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -126,6 +154,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -136,7 +166,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) snprintf(dbspace_path, sizeof(dbspace_path), "%s/%s", tsdirname, de->d_name); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -146,125 +179,232 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create a ton of unlogged relations + * in the same database & tablespace, so we'd better use a hash table + * rather than an array or linked list to keep track of which files + * need to be reset. Otherwise, this cleanup operation would be + * O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT and CLEANUP forks in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum)) + continue; + + if (forkNum == INIT_FORKNUM || + forkNum == CLEANUP_FORKNUM || forkNum == CLEANUP2_FORKNUM) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has the CLEANUP fork, + * the relfilenode is in dirty state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == CLEANUP_FORKNUM) + ent->dirty_init = true; + else if (forkNum == CLEANUP2_FORKNUM) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + RelFileNodeBackend *rnodes; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + rnodes = palloc(sizeof(RelFileNodeBackend) * nrels); + + for (i = 0 ; i < nrels ; i++) + rnodes[i] = srels[i]->smgr_rnode; + + DropRelFileNodesAllBuffers(rnodes, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, &forkNum)) continue; - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) - continue; - /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM && forkNum != CLEANUP_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 0643d714fb..6b37195c52 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -338,8 +338,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) if (ret == 0 || errno != ENOENT) { ret = unlink(path); + + /* failure of removing cleanup fork leads to a data loss. */ if (ret < 0 && errno != ENOENT) - ereport(WARNING, + ereport((forkNum != CLEANUP_FORKNUM ? WARNING : ERROR), (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } @@ -1024,6 +1026,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 0f31ff3822..4102d3d59c 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -644,6 +644,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/common/relpath.c b/src/common/relpath.c index 1f5c426ec0..479dcc248e 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -34,7 +34,9 @@ const char *const forkNames[] = { "main", /* MAIN_FORKNUM */ "fsm", /* FSM_FORKNUM */ "vm", /* VISIBILITYMAP_FORKNUM */ - "init" /* INIT_FORKNUM */ + "init", /* INIT_FORKNUM */ + "clup", /* CLEANUP_FORKNUM */ + "cln2" /* CLEANUP2_FORKNUM */ }; StaticAssertDecl(lengthof(forkNames) == (MAX_FORKNUM + 1), diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 0ab32b44e9..382623159c 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index f0814f1458..0fd0832a8b 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -22,13 +22,17 @@ /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation, deletion and persistence change + * here. logging of deletion actions is mainly handled by xact.c, because it is + * part of transaction commit, but we log deletions happens outside of a + * transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_BUFPERSISTENCE 0x40 typedef struct xl_smgr_create { @@ -36,6 +40,18 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +67,8 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a44be11ca0..040070aa2b 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -43,7 +43,9 @@ typedef enum ForkNumber MAIN_FORKNUM = 0, FSM_FORKNUM, VISIBILITYMAP_FORKNUM, - INIT_FORKNUM + INIT_FORKNUM, + CLEANUP_FORKNUM, + CLEANUP2_FORKNUM /* * NOTE: if you add a new fork, change MAX_FORKNUM and possibly @@ -52,7 +54,7 @@ typedef enum ForkNumber */ } ForkNumber; -#define MAX_FORKNUM INIT_FORKNUM +#define MAX_FORKNUM CLEANUP2_FORKNUM #define FORKNAMECHARS 4 /* max chars for a fork name */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index ff6cd0fc54..d9752a8317 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 752b440864..3cbbbf2edd 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -41,6 +41,8 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index fad1e5c473..b969ba8e86 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -23,6 +23,7 @@ extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, ForkNumber *fork); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index ebf4a199dc..8be17d9afc 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, -- 2.27.0 From 70d300969fbd2aae6c66b36f6100d3d2516a0dab Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v4 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 ++++ src/backend/nodes/equalfuncs.c | 15 ++++ src/backend/parser/gram.y | 20 +++++ src/backend/tcop/utility.c | 11 +++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 ++ 8 files changed, 214 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 37a15d31ee..2f65abb19b 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13696,6 +13696,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(NIL); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index ba3ccc712c..127da5151d 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4138,6 +4138,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5441,6 +5454,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index a2ef853dc2..4f13a1762b 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1872,6 +1872,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3494,6 +3506,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 31c95443a5..2222fd8fe3 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1934,6 +1934,26 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 53a511f1da..16606448bf 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -161,6 +161,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1732,6 +1733,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2619,6 +2626,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 08c463d3c4..646928466d 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index caed683ba9..16d91d3e1d 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -424,6 +424,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index dc2bb40926..c3eab6f1ab 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2253,6 +2253,15 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- -- 2.27.0
At Tue, 12 Jan 2021 18:58:08 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > This version RelationChangePersistence() is changed not to choose > > > in-place method for indexes other than btree. It seems to be usable > > > with all kind of indexes other than Gist, but at the mement it applies > > > only to btrees. > > > > > > 1: https://www.postgresql.org/message-id/CA+TgmoZEZ5RONS49C7mEpjhjndqMQtVrz_LCQUkpRWdmRevDnQ@mail.gmail.com > > > > Hmm. This is not wroking correctly. I'll repost after fixint that. > > I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir() > handles file operations in the wrong order and with the wrong logic. > It also needed to drop buffers and forget fsync requests. > > I thought that the two cases that this patch is expected to fix > (orphan relation files and uncommited init files) can share the same > "cleanup" fork but that is wrong. I had to add one more additional > fork to differentiate the cases of SET UNLOGGED and of creation of > UNLOGGED tables... > > The attached is a new version, that seems working correctly but looks > somewhat messy. I'll continue working. Commit bea449c635 conflicts with this on the change of the definition of DropRelFileNodeBuffers. The change simplified this patch by a bit:p regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 5f785f181acdac18952f504ec45ce41f285c05bc Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v5 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 23 ++ src/backend/access/transam/README | 8 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 436 +++++++++++++++++++++++-- src/backend/commands/tablecmds.c | 246 +++++++++++--- src/backend/storage/buffer/bufmgr.c | 88 +++++ src/backend/storage/file/reinit.c | 316 ++++++++++++------ src/backend/storage/smgr/md.c | 13 +- src/backend/storage/smgr/smgr.c | 6 + src/common/relpath.c | 4 +- src/include/catalog/storage.h | 2 + src/include/catalog/storage_xlog.h | 22 +- src/include/common/relpath.h | 6 +- src/include/storage/bufmgr.h | 2 + src/include/storage/md.h | 2 + src/include/storage/reinit.h | 3 +- src/include/storage/smgr.h | 1 + 17 files changed, 1028 insertions(+), 167 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7755553d57..2c109b8ca4 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +72,12 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..547107a771 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The CLEANUP fork file +-------------------------------- + +An CLEANUP fork is created when a new relation file is created to mark +the relfilenode needs to be cleaned up at recovery time. In contrast +to 4 above, failure to remove an CLEANUP fork file will lead to data +loss, in which case the server will shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index b18257c198..6dcbcbe387 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -40,6 +40,7 @@ #include "catalog/catversion.h" #include "catalog/pg_control.h" #include "catalog/pg_database.h" +#include "catalog/storage.h" #include "commands/progress.h" #include "commands/tablespace.h" #include "common/controldata_utils.h" @@ -4442,6 +4443,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -7455,6 +7464,14 @@ StartupXLOG(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index cba7a9ada0..c54d70747f 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -27,6 +28,7 @@ #include "access/xlogutils.h" #include "catalog/storage.h" #include "catalog/storage_xlog.h" +#include "common/hashfn.h" #include "miscadmin.h" #include "storage/freespace.h" #include "storage/smgr.h" @@ -57,9 +59,16 @@ int wal_skip_threshold = 2048; /* in kilobytes */ * but I'm being paranoid. */ +#define PDOP_DELETE (0) +#define PDOP_UNLINK_FORK (1 << 0) +#define PDOP_SET_PERSISTENCE (1 << 1) + typedef struct PendingRelDelete { RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ BackendId backend; /* InvalidBackendId if not a temp rel */ bool atCommit; /* T=delete at commit; F=delete at abort */ int nestLevel; /* xact nesting level of request */ @@ -75,6 +84,24 @@ typedef struct PendingRelSync static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; +typedef struct SRelHashEntry +{ + SMgrRelation srel; + char status; /* for simplehash use */ +} SRelHashEntry; + +/* define hashtable for workarea for pending deletes */ +#define SH_PREFIX srelhash +#define SH_ELEMENT_TYPE SRelHashEntry +#define SH_KEY_TYPE SMgrRelation +#define SH_KEY srel +#define SH_HASH_KEY(tb, key) \ + hash_bytes((unsigned char *)&key, sizeof(SMgrRelation)) +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0) +#define SH_SCOPE static inline +#define SH_DEFINE +#define SH_DECLARE +#include "lib/simplehash.h" /* * AddPendingSync @@ -143,7 +170,17 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up but there's no + * clue to the orphan files. The cleanup fork works as the sentinel to + * identify that situation. + */ srel = smgropen(rnode, backend); + smgrcreate(srel, CLEANUP2_FORKNUM, false); + log_smgrcreate(&rnode, CLEANUP2_FORKNUM); + smgrimmedsync(srel, CLEANUP2_FORKNUM); + smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) @@ -153,12 +190,25 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rnode; + pending->op = PDOP_DELETE; pending->backend = backend; pending->atCommit = false; /* delete if abort */ pending->nestLevel = GetCurrentTransactionNestLevel(); pending->next = pendingDeletes; pendingDeletes = pending; + /* drop cleanup fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = CLEANUP2_FORKNUM; + pending->backend = backend; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { Assert(backend == InvalidBackendId); @@ -168,6 +218,218 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + SMgrRelation srel; + PendingRelDelete *prev; + PendingRelDelete *next; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, false, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have already registered pending delete entries to drop + * preexisting init fork since before the current transaction started. This + * function reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->op != PDOP_DELETE && + ((pending->op & PDOP_UNLINK_FORK) != 0 && + pending->unlink_forknum == CLEANUP_FORKNUM)) + { + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + create = false; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (!create) + return; + + /* + * We are going to create the init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The cleanup fork works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + smgrcreate(srel, CLEANUP_FORKNUM, false); + log_smgrcreate(&rnode, CLEANUP_FORKNUM); + smgrimmedsync(srel, CLEANUP_FORKNUM); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by myself. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop this init fork file at abort and revert persistence */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK | PDOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop cleanup fork at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = CLEANUP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop cleanup fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = CLEANUP_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + PendingRelDelete *prev; + PendingRelDelete *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, true, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have created the init fork in the current transaction. We + * immediately remove the init and cleanup forks immediately in that case. + * Otherwise just reister pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->op != PDOP_DELETE && + ((pending->op & PDOP_UNLINK_FORK) != 0 && + pending->unlink_forknum == CLEANUP_FORKNUM)) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + inxact_created = true; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT/CLEANUP forks never be loaded to shared buffer so no point in + * dropping buffers for these files. + */ + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + log_smgrunlink(&rnode, CLEANUP_FORKNUM); + smgrunlink(srel, CLEANUP_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +449,44 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -200,6 +500,7 @@ RelationDropStorage(Relation rel) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rel->rd_node; + pending->op = PDOP_DELETE; pending->backend = rel->rd_backend; pending->atCommit = true; /* delete if commit */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -602,59 +903,97 @@ smgrDoPendingDeletes(bool isCommit) int nrels = 0, maxrels = 0; SMgrRelation *srels = NULL; + srelhash_hash *close_srels = NULL; + bool found; prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) { + SMgrRelation srel; + next = pending->next; if (pending->nestLevel < nestLevel) { /* outer-level entries should not be processed yet */ prev = pending; + continue; } + + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; else + pendingDeletes = next; + + if (pending->atCommit != isCommit) { - /* unlink list entry first, so we don't retry on failure */ - if (prev) - prev->next = next; - else - pendingDeletes = next; - /* do deletion if called for */ - if (pending->atCommit == isCommit) - { - SMgrRelation srel; - - srel = smgropen(pending->relnode, pending->backend); - - /* allocate the initial array, or extend it, if needed */ - if (maxrels == 0) - { - maxrels = 8; - srels = palloc(sizeof(SMgrRelation) * maxrels); - } - else if (maxrels <= nrels) - { - maxrels *= 2; - srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); - } - - srels[nrels++] = srel; - } /* must explicitly free the list entry */ pfree(pending); /* prev does not change */ + continue; + } + + if (close_srels == NULL) + close_srels = srelhash_create(CurrentMemoryContext, 32, NULL); + + srel = smgropen(pending->relnode, pending->backend); + + /* Uniquify the smgr relations */ + srelhash_insert(close_srels, srel, &found); + + if (pending->op != PDOP_DELETE) + { + if (pending->op & PDOP_UNLINK_FORK) + { + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM || + pending->unlink_forknum == CLEANUP_FORKNUM || + pending->unlink_forknum == CLEANUP2_FORKNUM); + + log_smgrunlink(&pending->relnode, pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + + } + + if (pending->op & PDOP_SET_PERSISTENCE) + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); + } + else + { + /* allocate the initial array, or extend it, if needed */ + if (maxrels == 0) + { + maxrels = 8; + srels = palloc(sizeof(SMgrRelation) * maxrels); + } + else if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + srels[nrels++] = srel; } } if (nrels > 0) { smgrdounlinkall(srels, nrels, false); - - for (int i = 0; i < nrels; i++) - smgrclose(srels[i]); - pfree(srels); } + + if (close_srels) + { + srelhash_iterator i; + SRelHashEntry *ent; + + /* close smgr relatoins */ + srelhash_start_iterate(close_srels, &i); + while ((ent = srelhash_iterate(close_srels, &i)) != NULL) + smgrclose(ent->srel); + srelhash_destroy(close_srels); + } } /* @@ -824,7 +1163,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId + && pending->op == PDOP_DELETE) nrels++; } if (nrels == 0) @@ -837,7 +1177,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) { *rptr = pending->relnode; rptr++; @@ -917,6 +1258,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1005,6 +1355,28 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingRelDelete *pending; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = xlrec->rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 993da56d43..37a15d31ee 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -50,6 +50,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -4917,6 +4918,170 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform im-place persistnce change"); + + RelationOpenSmgr(rel); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + RelationOpenSmgr(toastrel); + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * Some access methods do not accept in-place persistence change. For + * example, GiST uses page LSNs to figure out whether a block has + * changed, where UNLOGGED GiST indexes use fake LSNs that are + * incompatible with real LSNs used for LOGGED ones. + * + * XXXX: We don't bother allowing in-place persistence change for index + * methods other than btree for now. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + r->rd_rel->relam != BTREE_AM_OID) + { + int reindex_flags; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, 0); + + continue; + } + + RelationOpenSmgr(r); + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(r->rd_smgr, i)) + smgrimmedsync(r->rd_smgr, i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(r->rd_smgr, fork)) + log_newpage_range(r, fork, + 0, smgrnblocks(r->rd_smgr, fork), false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5037,45 +5202,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, - lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, + lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); + } } else { diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 561c212092..eacbdc6447 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlog.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3094,6 +3095,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 40c758d789..0eac1956cc 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,50 @@ #include <unistd.h> +#include "access/xlog.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. * + * If CLEANUP_FORKNUM (clup) is present, we remove the init fork of the same + * relation along with the clup fork. + * + * If CLEANUP2_FORKNUM (cln2) is present we remove the whole relation along + * with the cln2 fork. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. + * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. */ @@ -68,7 +89,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -77,13 +98,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + Assert(tspid != 0); + + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -99,7 +126,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -126,6 +154,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -136,7 +166,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) snprintf(dbspace_path, sizeof(dbspace_path), "%s/%s", tsdirname, de->d_name); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -146,125 +179,226 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create a ton of unlogged relations + * in the same database & tablespace, so we'd better use a hash table + * rather than an array or linked list to keep track of which files + * need to be reset. Otherwise, this cleanup operation would be + * O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT and CLEANUP forks in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum)) + continue; + + if (forkNum == INIT_FORKNUM || + forkNum == CLEANUP_FORKNUM || forkNum == CLEANUP2_FORKNUM) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has the CLEANUP fork, + * the relfilenode is in dirty state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == CLEANUP_FORKNUM) + ent->dirty_init = true; + else if (forkNum == CLEANUP2_FORKNUM) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + DropRelFileNodesAllBuffers(srels, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, &forkNum)) continue; - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) - continue; - /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM && forkNum != CLEANUP_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 0643d714fb..6b37195c52 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -338,8 +338,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) if (ret == 0 || errno != ENOENT) { ret = unlink(path); + + /* failure of removing cleanup fork leads to a data loss. */ if (ret < 0 && errno != ENOENT) - ereport(WARNING, + ereport((forkNum != CLEANUP_FORKNUM ? WARNING : ERROR), (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } @@ -1024,6 +1026,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 4dc24649df..96480e321d 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -662,6 +662,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/common/relpath.c b/src/common/relpath.c index 1f5c426ec0..479dcc248e 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -34,7 +34,9 @@ const char *const forkNames[] = { "main", /* MAIN_FORKNUM */ "fsm", /* FSM_FORKNUM */ "vm", /* VISIBILITYMAP_FORKNUM */ - "init" /* INIT_FORKNUM */ + "init", /* INIT_FORKNUM */ + "clup", /* CLEANUP_FORKNUM */ + "cln2" /* CLEANUP2_FORKNUM */ }; StaticAssertDecl(lengthof(forkNames) == (MAX_FORKNUM + 1), diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 0ab32b44e9..382623159c 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index f0814f1458..0fd0832a8b 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -22,13 +22,17 @@ /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation, deletion and persistence change + * here. logging of deletion actions is mainly handled by xact.c, because it is + * part of transaction commit, but we log deletions happens outside of a + * transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_BUFPERSISTENCE 0x40 typedef struct xl_smgr_create { @@ -36,6 +40,18 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +67,8 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a44be11ca0..040070aa2b 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -43,7 +43,9 @@ typedef enum ForkNumber MAIN_FORKNUM = 0, FSM_FORKNUM, VISIBILITYMAP_FORKNUM, - INIT_FORKNUM + INIT_FORKNUM, + CLEANUP_FORKNUM, + CLEANUP2_FORKNUM /* * NOTE: if you add a new fork, change MAX_FORKNUM and possibly @@ -52,7 +54,7 @@ typedef enum ForkNumber */ } ForkNumber; -#define MAX_FORKNUM INIT_FORKNUM +#define MAX_FORKNUM CLEANUP2_FORKNUM #define FORKNAMECHARS 4 /* max chars for a fork name */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index fb00fda6a7..ccb0a388f6 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 752b440864..3cbbbf2edd 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -41,6 +41,8 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index fad1e5c473..b969ba8e86 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -23,6 +23,7 @@ extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, ForkNumber *fork); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a6fbf7b6a6..1ac3e4a74a 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -86,6 +86,7 @@ extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, -- 2.27.0 From 89dbb62355befa7dde815030c95cf4902a8941f1 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v5 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 ++++ src/backend/nodes/equalfuncs.c | 15 ++++ src/backend/parser/gram.y | 20 +++++ src/backend/tcop/utility.c | 11 +++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 ++ 8 files changed, 214 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 37a15d31ee..2f65abb19b 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13696,6 +13696,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(NIL); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index ba3ccc712c..127da5151d 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4138,6 +4138,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5441,6 +5454,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index a2ef853dc2..4f13a1762b 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1872,6 +1872,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3494,6 +3506,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 31c95443a5..2222fd8fe3 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1934,6 +1934,26 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 53a511f1da..16606448bf 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -161,6 +161,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1732,6 +1733,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2619,6 +2626,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 08c463d3c4..646928466d 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index caed683ba9..16d91d3e1d 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -424,6 +424,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index dc2bb40926..c3eab6f1ab 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2253,6 +2253,15 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- -- 2.27.0
(I'm not sure when the subject was broken..) At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Commit bea449c635 conflicts with this on the change of the definition > of DropRelFileNodeBuffers. The change simplified this patch by a bit:p In this version, I got rid of the "CLEANUP FORK"s, and added a new system "Smgr marks". The mark files have the name of the corresponding fork file followed by ".u" (which means Uncommitted.). "Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM in the previous version.x I noticed that the previous version of the patch still leaves an orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is revmoed at rollback. In this version the responsibility to remove the mark files is moved to SyncPostCheckpoint, where main fork files are actually removed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 27ea96d84dfc2f3e0d62c4b8f7d20cc30771cf86 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v6 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 52 +++ src/backend/access/transam/README | 8 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 520 +++++++++++++++++++++++-- src/backend/commands/tablecmds.c | 246 ++++++++++-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 88 +++++ src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 346 +++++++++++----- src/backend/storage/smgr/md.c | 92 ++++- src/backend/storage/smgr/smgr.c | 32 ++ src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 24 ++ src/common/relpath.c | 47 ++- src/include/catalog/storage.h | 2 + src/include/catalog/storage_xlog.h | 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h | 17 + 22 files changed, 1384 insertions(+), 206 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7755553d57..d251f22207 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + action = "CREATE"; + break; + case XLOG_SMGR_MARK_UNLINK: + action = "DELETE"; + break; + default: + action = "<unknown action>"; + break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +98,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_MARK: + id = "MARK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..7cf77e4a02 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The Smgr MARK files +-------------------------------- + +A smgr mark files is created when a new relation file is created to +mark the relfilenode needs to be cleaned up at recovery time. In +contrast to 4 above, failure to remove smgr mark files will lead to +data loss, in which case the server will shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 6f8810e149..27bbe17395 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -40,6 +40,7 @@ #include "catalog/catversion.h" #include "catalog/pg_control.h" #include "catalog/pg_database.h" +#include "catalog/storage.h" #include "commands/progress.h" #include "commands/tablespace.h" #include "common/controldata_utils.h" @@ -4458,6 +4459,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -7577,6 +7586,14 @@ StartupXLOG(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index cba7a9ada0..7302a3fad4 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -27,6 +28,7 @@ #include "access/xlogutils.h" #include "catalog/storage.h" #include "catalog/storage_xlog.h" +#include "common/hashfn.h" #include "miscadmin.h" #include "storage/freespace.h" #include "storage/smgr.h" @@ -57,9 +59,18 @@ int wal_skip_threshold = 2048; /* in kilobytes */ * but I'm being paranoid. */ +#define PDOP_DELETE (1 << 0) +#define PDOP_UNLINK_FORK (1 << 1) +#define PDOP_UNLINK_MARK (1 << 2) +#define PDOP_SET_PERSISTENCE (1 << 3) + typedef struct PendingRelDelete { RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ + StorageMarks unlink_mark; /* mark to unlink */ BackendId backend; /* InvalidBackendId if not a temp rel */ bool atCommit; /* T=delete at commit; F=delete at abort */ int nestLevel; /* xact nesting level of request */ @@ -75,6 +86,24 @@ typedef struct PendingRelSync static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; +typedef struct SRelHashEntry +{ + SMgrRelation srel; + char status; /* for simplehash use */ +} SRelHashEntry; + +/* define hashtable for workarea for pending deletes */ +#define SH_PREFIX srelhash +#define SH_ELEMENT_TYPE SRelHashEntry +#define SH_KEY_TYPE SMgrRelation +#define SH_KEY srel +#define SH_HASH_KEY(tb, key) \ + hash_bytes((unsigned char *)&key, sizeof(SMgrRelation)) +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0) +#define SH_SCOPE static inline +#define SH_DEFINE +#define SH_DECLARE +#include "lib/simplehash.h" /* * AddPendingSync @@ -143,22 +172,48 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up but there's no + * clue to the orphan files. The SMGR_MARK_UNCOMMITED mark file works as + * the signal of that situation. + */ srel = smgropen(rnode, backend); + log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false); smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM); - /* Add the relation to the list of stuff to delete at abort */ + /* + * Add the relation to the list of stuff to delete at abort. We don't + * remove the mark file at commit. It needs to persiste until the main fork + * file is actually deleted. See SyncPostCheckpoint. + */ pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rnode; + pending->op = PDOP_DELETE; pending->backend = backend; pending->atCommit = false; /* delete if abort */ pending->nestLevel = GetCurrentTransactionNestLevel(); pending->next = pendingDeletes; pendingDeletes = pending; + /* drop cleanup fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_MARK; + pending->unlink_forknum = MAIN_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = backend; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { Assert(backend == InvalidBackendId); @@ -168,6 +223,207 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + SMgrRelation srel; + PendingRelDelete *prev; + PendingRelDelete *next; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, false, false); + + /* + * If we have entries for init-fork operation of this relation, that means + * that we have already registered pending delete entries to drop + * preexisting init fork since before the current transaction started. This + * function reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + (pending->op & PDOP_DELETE) == 0 && + (pending->unlink_forknum == INIT_FORKNUM || + (pending->op & PDOP_SET_PERSISTENCE) != 0)) + { + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + create = false; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (!create) + return; + + /* + * We are going to create the init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The cleanup fork works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by ourselves. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop the init fork, mark file and revert persistence at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK | PDOP_UNLINK_MARK | PDOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop mark file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_MARK; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + PendingRelDelete *prev; + PendingRelDelete *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(rel->rd_smgr, true, false); + + /* + * If we have entries for init-fork operations of this relation, that means + * that we have created the init fork in the current transaction. We + * remove the init fork and mark file immediately in that case. Otherwise + * just reister pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (RelFileNodeEquals(rnode, pending->relnode) && + (pending->op & PDOP_DELETE) == 0 && + (pending->unlink_forknum == INIT_FORKNUM || + (pending->op & PDOP_SET_PERSISTENCE) != 0)) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + inxact_created = true; + } + else + { + /* unrelated entry, don't touch it */ + prev = pending; + } + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT forks never be loaded to shared buffer so no point in dropping + * buffers for such files. + */ + log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +443,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL. + */ +void +log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_CREATE; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL. + */ +void +log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_UNLINK; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -200,6 +538,7 @@ RelationDropStorage(Relation rel) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rel->rd_node; + pending->op = PDOP_DELETE; pending->backend = rel->rd_backend; pending->atCommit = true; /* delete if commit */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -602,59 +941,104 @@ smgrDoPendingDeletes(bool isCommit) int nrels = 0, maxrels = 0; SMgrRelation *srels = NULL; + srelhash_hash *close_srels = NULL; + bool found; prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) { + SMgrRelation srel; + next = pending->next; if (pending->nestLevel < nestLevel) { /* outer-level entries should not be processed yet */ prev = pending; + continue; } + + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; else + pendingDeletes = next; + + if (pending->atCommit != isCommit) { - /* unlink list entry first, so we don't retry on failure */ - if (prev) - prev->next = next; - else - pendingDeletes = next; - /* do deletion if called for */ - if (pending->atCommit == isCommit) - { - SMgrRelation srel; - - srel = smgropen(pending->relnode, pending->backend); - - /* allocate the initial array, or extend it, if needed */ - if (maxrels == 0) - { - maxrels = 8; - srels = palloc(sizeof(SMgrRelation) * maxrels); - } - else if (maxrels <= nrels) - { - maxrels *= 2; - srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); - } - - srels[nrels++] = srel; - } /* must explicitly free the list entry */ pfree(pending); /* prev does not change */ + continue; } + + if (close_srels == NULL) + close_srels = srelhash_create(CurrentMemoryContext, 32, NULL); + + srel = smgropen(pending->relnode, pending->backend); + + /* Uniquify the smgr relations */ + srelhash_insert(close_srels, srel, &found); + + if (pending->op & PDOP_DELETE) + { + /* allocate the initial array, or extend it, if needed */ + if (maxrels == 0) + { + maxrels = 8; + srels = palloc(sizeof(SMgrRelation) * maxrels); + } + else if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + srels[nrels++] = srel; + } + + if (pending->op & PDOP_UNLINK_FORK) + { + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM); + + /* Don't emit wal while recovery. */ + if (!InRecovery) + log_smgrunlink(&pending->relnode, pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + } + + if (pending->op & PDOP_UNLINK_MARK) + { + if (!InRecovery) + log_smgrunlinkmark(&pending->relnode, + pending->unlink_forknum, + pending->unlink_mark); + smgrunlinkmark(srel, pending->unlink_forknum, + pending->unlink_mark, InRecovery); + } + + if (pending->op & PDOP_SET_PERSISTENCE) + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); } if (nrels > 0) { smgrdounlinkall(srels, nrels, false); - - for (int i = 0; i < nrels; i++) - smgrclose(srels[i]); - pfree(srels); } + + if (close_srels) + { + srelhash_iterator i; + SRelHashEntry *ent; + + /* close smgr relatoins */ + srelhash_start_iterate(close_srels, &i); + while ((ent = srelhash_iterate(close_srels, &i)) != NULL) + smgrclose(ent->srel); + srelhash_destroy(close_srels); + } } /* @@ -824,7 +1208,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId + && pending->op == PDOP_DELETE) nrels++; } if (nrels == 0) @@ -837,7 +1222,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) { *rptr = pending->relnode; rptr++; @@ -917,6 +1303,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1005,6 +1400,65 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record); + SMgrRelation reln; + PendingRelDelete *pending; + bool created = false; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true); + created = true; + break; + case XLOG_SMGR_MARK_UNLINK: + smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true); + break; + default: + elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark); + } + + if (created) + { + /* revert mark file operation at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = xlrec->rnode; + pending->op = PDOP_UNLINK_MARK; + pending->unlink_forknum = xlrec->forkNum; + pending->unlink_mark = xlrec->mark; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + } + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingRelDelete *pending; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = xlrec->rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 3349bcfaa7..4e2bceffda 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -51,6 +51,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -5085,6 +5086,170 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform im-place persistnce change"); + + RelationOpenSmgr(rel); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + RelationOpenSmgr(toastrel); + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * Some access methods do not accept in-place persistence change. For + * example, GiST uses page LSNs to figure out whether a block has + * changed, where UNLOGGED GiST indexes use fake LSNs that are + * incompatible with real LSNs used for LOGGED ones. + * + * XXXX: We don't bother allowing in-place persistence change for index + * methods other than btree for now. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + r->rd_rel->relam != BTREE_AM_OID) + { + int reindex_flags; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, 0); + + continue; + } + + RelationOpenSmgr(r); + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(r->rd_smgr, i)) + smgrimmedsync(r->rd_smgr, i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(r->rd_smgr, fork)) + log_newpage_range(r, fork, + 0, smgrnblocks(r->rd_smgr, fork), false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5205,45 +5370,52 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, - lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, persistence, + lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); + } } else { diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 56cd473f9f..bc5288de05 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1255,6 +1255,7 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces, bool excludeFound; ForkNumber relForkNum; /* Type of fork if file is a relation */ int relOidChars; /* Chars in filename that are the rel oid */ + StorageMarks mark; /* Skip special stuff */ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) @@ -1305,7 +1306,7 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces, /* Exclude all forks for unlogged tables except the init fork */ if (isDbDir && parse_filename_for_nontemp_relation(de->d_name, &relOidChars, - &relForkNum)) + &relForkNum, &mark)) { /* Never exclude init forks */ if (relForkNum != INIT_FORKNUM) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 852138f9c9..50674fd027 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlog.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3100,6 +3101,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 06b57ae71f..bdf6916d63 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -342,8 +342,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel); static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel); -static int fsync_parent_path(const char *fname, int elevel); - /* * pg_fsync --- do fsync with or without writethrough @@ -3647,7 +3645,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel) * This is aimed at making file operations persistent on disk in case of * an OS crash or power failure. */ -static int +int fsync_parent_path(const char *fname, int elevel) { char parentpath[MAXPGPATH]; diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 40c758d789..f52d2ac199 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,50 @@ #include <unistd.h> +#include "access/xlog.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. * + * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the + * init fork along with the mark file. + * + * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the + * whole relation along with the mark file. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. + * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. */ @@ -68,7 +89,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -77,13 +98,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + Assert(tspid != 0); + + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -99,7 +126,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -126,6 +154,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -136,7 +166,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) snprintf(dbspace_path, sizeof(dbspace_path), "%s/%s", tsdirname, de->d_name); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -146,125 +179,228 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create a ton of unlogged relations + * in the same database & tablespace, so we'd better use a hash table + * rather than an array or linked list to keep track of which files + * need to be reset. Otherwise, this cleanup operation would be + * O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT fork and mark files in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + StorageMarks mark; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum, &mark)) + continue; + + if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has + * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty + * state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_init = true; + else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + DropRelFileNodesAllBuffers(srels, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + StorageMarks mark; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) + &forkNum, &mark)) continue; /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { @@ -273,6 +409,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; @@ -280,9 +417,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -316,15 +455,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char mainpath[MAXPGPATH]; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -367,7 +509,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, - ForkNumber *fork) + ForkNumber *fork, StorageMarks *mark) { int pos; @@ -398,11 +540,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars, for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar) ; - if (segchar <= 1) - return false; - pos += segchar; + if (segchar > 1) + pos += segchar; } + /* mark file? */ + if (name[pos] == '.' && name[pos + 1] != 0) + { + *mark = name[pos + 1]; + pos += 2; + } + else + *mark = SMGR_MARK_NONE; + /* Now we should be at the end. */ if (name[pos] != '\0') return false; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 1e12cfad8e..87a777b307 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno, bool skipFsync, int behavior); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); - +static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum, + StorageMarks mark); /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -169,6 +170,80 @@ mdexists(SMgrRelation reln, ForkNumber forkNum) return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL); } +/* + * mdcreatemark() -- Create a mark file. + * + * If isRedo is true, it's okay for the file to exist already. + */ +void +mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path =markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + /* See mdcreate for details.. */ + TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode, + reln->smgr_rnode.node.dbNode, + isRedo); + + fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL); + if (fd < 0 && (!isRedo || errno != EEXIST)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not crete mark file \"%s\": %m", path))); + + pfree(path); + pg_fsync(fd); + close(fd); + + /* + * To guarantee that the creation of the file is persistent, fsync its + * parent directory. + */ + fsync_parent_path(path, ERROR); +} + + +/* + * mdunlinkmark() -- Delete the mark file + * + * If isRedo is true, it's okay for the file being not found. + */ +void +mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + + if (!isRedo || mdmarkexists(reln, forkNum, mark)) + durable_unlink(path, ERROR); + + pfree(path); +} + +/* + * mdmarkexists() -- Check if the file exists. + */ +static bool +mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + fd = BasicOpenFile(path, O_RDONLY); + if (fd < 0 && errno != ENOENT) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not access mark file \"%s\": %m", path))); + pfree(path); + + if (fd < 0) + return false; + + return true; +} + /* * mdcreate() -- Create a new relation on magnetic disk. * @@ -1024,6 +1099,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ @@ -1377,12 +1461,14 @@ mdsyncfiletag(const FileTag *ftag, char *path) * Return 0 on success, -1 on failure, with errno set. */ int -mdunlinkfiletag(const FileTag *ftag, char *path) +mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark) { char *p; /* Compute the path. */ - p = relpathperm(ftag->rnode, MAIN_FORKNUM); + p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode, + ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM, + mark); strlcpy(path, p, MAXPGPATH); pfree(p); diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 4dc24649df..dd3496cf51 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -62,6 +62,10 @@ typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); + void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); + void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); } f_smgr; static const f_smgr smgrsw[] = { @@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = { .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, .smgr_immedsync = mdimmedsync, + .smgr_createmark = mdcreatemark, + .smgr_unlinkmark = mdunlinkmark, } }; @@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo); } +/* + * smgrcreatemark() -- Create a mark file + */ +void +smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo); +} + +/* + * smgrunlinkmark() -- Delete a mark file + */ +void +smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo); +} + /* * smgrdosyncall() -- Immediately sync all forks of all given relations * @@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index 708215614d..a23c03ca3e 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -88,7 +88,8 @@ static CycleCtr checkpoint_cycle_ctr = 0; typedef struct SyncOps { int (*sync_syncfiletag) (const FileTag *ftag, char *path); - int (*sync_unlinkfiletag) (const FileTag *ftag, char *path); + int (*sync_unlinkfiletag) (const FileTag *ftag, char *path, + StorageMarks mark); bool (*sync_filetagmatches) (const FileTag *ftag, const FileTag *candidate); } SyncOps; @@ -216,7 +217,8 @@ SyncPostCheckpoint(void) /* Unlink the file */ if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag, - path) < 0) + path, + SMGR_MARK_NONE) < 0) { /* * There's a race condition, when the database is dropped at the @@ -230,6 +232,20 @@ SyncPostCheckpoint(void) (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } + else if (syncsw[entry->tag.handler].sync_unlinkfiletag( + &entry->tag, path, + SMGR_MARK_UNCOMMITTED) < 0) + { + /* + * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the + * fork files has been successfully removed. It's ok if the file + * does not exist. + */ + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); + } /* And remove the list entry */ pendingUnlinks = list_delete_first(pendingUnlinks); diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 59ebac7d6a..db6b658489 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record) * source system. */ } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } else if (rmid == RM_XACT_ID && ((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT || (rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED || diff --git a/src/common/relpath.c b/src/common/relpath.c index 1f5c426ec0..67f24890d6 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode) */ char * GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber) + int backendId, ForkNumber forkNumber, char mark) { char *path; + char markstr[10]; + + if (mark == 0) + markstr[0] = 0; + else + snprintf(markstr, 10, ".%c", mark); if (spcNode == GLOBALTABLESPACE_OID) { @@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, Assert(dbNode == 0); Assert(backendId == InvalidBackendId); if (forkNumber != MAIN_FORKNUM) - path = psprintf("global/%u_%s", - relNode, forkNames[forkNumber]); + path = psprintf("global/%u_%s%s", + relNode, forkNames[forkNumber], markstr); else - path = psprintf("global/%u", relNode); + path = psprintf("global/%u%s", relNode, markstr); } else if (spcNode == DEFAULTTABLESPACE_OID) { @@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/%u_%s", + path = psprintf("base/%u/%u_%s%s", dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/%u", - dbNode, relNode); + path = psprintf("base/%u/%u%s", + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/t%d_%u_%s", + path = psprintf("base/%u/t%d_%u_%s%s", dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/t%d_%u", - dbNode, backendId, relNode); + path = psprintf("base/%u/t%d_%u%s", + dbNode, backendId, relNode, markstr); } } else @@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/%u", + path = psprintf("pg_tblspc/%u/%s/%u/%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, relNode); + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, backendId, relNode); + dbNode, backendId, relNode, markstr); } } + return path; } diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 0ab32b44e9..382623159c 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index f0814f1458..12346ed7f6 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -18,17 +18,23 @@ #include "lib/stringinfo.h" #include "storage/block.h" #include "storage/relfilenode.h" +#include "storage/smgr.h" /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation and buffer persistence change here, + * but logging of deletion actions is handled mainly by xact.c, because it is + * part of transaction commit in most cases. However, there's a case where + * init forks are deleted outside control of transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_MARK 0x40 +#define XLOG_SMGR_BUFPERSISTENCE 0x50 typedef struct xl_smgr_create { @@ -36,6 +42,32 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef enum smgr_mark_action +{ + XLOG_SMGR_MARK_CREATE = 'c', + XLOG_SMGR_MARK_UNLINK = 'u' +} smgr_mark_action; + +typedef struct xl_smgr_mark +{ + RelFileNode rnode; + ForkNumber forkNum; + StorageMarks mark; + smgr_mark_action action; +} xl_smgr_mark; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a44be11ca0..106a5cf508 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork); extern char *GetDatabasePath(Oid dbNode, Oid spcNode); extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber); + int backendId, ForkNumber forkNumber, char mark); /* * Wrapper macros for GetRelationPath. Beware of multiple @@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, /* First argument is a RelFileNode */ #define relpathbackend(rnode, backend, forknum) \ GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \ - backend, forknum) + backend, forknum, 0) /* First argument is a RelFileNode */ #define relpathperm(rnode, forknum) \ @@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, #define relpath(rnode, forknum) \ relpathbackend((rnode).node, (rnode).backend, forknum) +/* First argument is a RelFileNodeBackend */ +#define markpath(rnode, forknum, mark) \ + GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \ + (rnode).node.relNode, \ + (rnode).backend, forknum, mark) #endif /* RELPATH_H */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index fb00fda6a7..ccb0a388f6 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -205,6 +205,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 328473bdc9..485c58e5f1 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -167,6 +167,7 @@ extern ssize_t pg_pwritev_with_retry(int fd, extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); +extern int fsync_parent_path(const char *fname, int elevel); extern int durable_rename(const char *oldfile, const char *newfile, int loglevel); extern int durable_unlink(const char *fname, int loglevel); extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 752b440864..99620816b5 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -23,6 +23,10 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); +extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); @@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); /* md sync callbacks */ extern int mdsyncfiletag(const FileTag *ftag, char *path); -extern int mdunlinkfiletag(const FileTag *ftag, char *path); +extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark); extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate); #endif /* MD_H */ diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index fad1e5c473..e1f97e9b89 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -16,13 +16,15 @@ #define REINIT_H #include "common/relpath.h" - +#include "storage/smgr.h" extern void ResetUnloggedRelations(int op); -extern bool parse_filename_for_nontemp_relation(const char *name, - int *oidchars, ForkNumber *fork); +extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, + ForkNumber *fork, + StorageMarks *mark); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a6fbf7b6a6..201ecace8a 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -18,6 +18,18 @@ #include "storage/block.h" #include "storage/relfilenode.h" +/* + * Storage marks is a file of which existence suggests something about a + * file. The name of such files is "<filename>.<mark>", where the mark is one + * of the values of StorageMarks. Since ".<digit>" means segment files so don't + * use digits for the mark character. + */ +typedef enum StorageMarks +{ + SMGR_MARK_NONE = 0, + SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */ +} StorageMarks; + /* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles. An SMgrRelation is created (if not already present) @@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); +extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, -- 2.27.0 From 625bbc0e05a698aa2c19b5fba4947009358bd560 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v6 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 ++++ src/backend/nodes/equalfuncs.c | 15 ++++ src/backend/parser/gram.y | 20 +++++ src/backend/tcop/utility.c | 11 +++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 ++ 8 files changed, 214 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 4e2bceffda..26bf8298e9 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13843,6 +13843,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(NIL); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 82d7cce5d5..3471b8e2cc 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4197,6 +4197,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5503,6 +5516,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index 3e980c457c..d05aef4fde 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1875,6 +1875,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3528,6 +3540,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index bc43641ffe..5c3fd1998e 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1948,6 +1948,26 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 05bb698cf4..3c18312367 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -161,6 +161,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1694,6 +1695,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2582,6 +2589,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index b3d30acc35..b4af2db6f0 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index e22df890ef..91dfc77978 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -427,6 +427,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 68425eb2c0..b9b75dc45b 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2293,6 +2293,15 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- -- 2.27.0
At Thu, 25 Mar 2021 14:08:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > (I'm not sure when the subject was broken..) > > At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > Commit bea449c635 conflicts with this on the change of the definition > > of DropRelFileNodeBuffers. The change simplified this patch by a bit:p > > In this version, I got rid of the "CLEANUP FORK"s, and added a new > system "Smgr marks". The mark files have the name of the > corresponding fork file followed by ".u" (which means Uncommitted.). > "Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM > and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM > in the previous version.x > > I noticed that the previous version of the patch still leaves an > orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash > before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is > revmoed at rollback. In this version the responsibility to remove the > mark files is moved to SyncPostCheckpoint, where main fork files are > actually removed. For the record, I noticed that basebackup could be confused by the mark files but I haven't looked that yet. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
> Kyotaro wrote: > > In this version, I got rid of the "CLEANUP FORK"s, and added a new > > system "Smgr marks". The mark files have the name of the > > corresponding fork file followed by ".u" (which means Uncommitted.). > > "Uncommited"-marked main fork means the same as the > CLEANUP2_FORKNUM > > and uncommitted-marked init fork means the same as the > CLEANUP_FORKNUM > > in the previous version.x > > > > I noticed that the previous version of the patch still leaves an > > orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash > > before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is > > revmoed at rollback. In this version the responsibility to remove the > > mark files is moved to SyncPostCheckpoint, where main fork files are > > actually removed. > > For the record, I noticed that basebackup could be confused by the mark files > but I haven't looked that yet. > Good morning Kyotaro, the patch didn't apply clean (it's from March; some hunks were failing), so I've fixed it and the combined git format-patchis attached. It did conflict with the following: b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE 7b565843a94 - Add call to object access hook at the end of table rewrite in ALTER TABLE 9ce346eabf3 - Report progress of startup operations that take a long time. f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr(). I'm especially worried if I didn't screw up something/forgot something related to the last one (rd->rd_smgr changes), butI'm getting "All 210 tests passed". Basic demonstration of this patch (with wal_level=minimal): create unlogged table t6 (id bigint, t text); -- produces 110GB table, takes ~5mins insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 100000000); alter table t6 set logged; on baseline SET LOGGED takes: ~7min10s on patched SET LOGGED takes: 25s So basically one can - thanks to this patch - use his application (performing classic INSERTs/UPDATEs/DELETEs, so withoutthe need to rewrite to use COPY) and perform literally batch upload and then just switch the tables to LOGGED. Some more intensive testing also looks good, assuming table prepared to put pressure on WAL: create unlogged table t_unlogged (id bigint, t text) partition by hash (id); create unlogged table t_unlogged_h0 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 0); [..] create unlogged table t_unlogged_h3 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 3); Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and Lock/extend . t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # accordingto pg_wal_stats.wal_bytes generates ~1MB of WAL t_unlogged.sql = insert into t_unlogged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # accordingto pg_wal_stats.wal_bytes generates ~3kB of WAL so using: pgbench -f <tabletypetest>.sql -T 30 -P 1 -c 32 -j 3 t with synchronous_commit =ON(default): with t_logged.sql: tps = 229 (lat avg = 138ms) with t_unlogged.sql tps = 283 (lat avg = 112ms) # almost all on LWLock/WALWrite with synchronous_commit =OFF: with t_logged.sql: tps = 413 (lat avg = 77ms) with t_unloged.sql: tps = 782 (lat avg = 40ms) Afterwards switching the unlogged ~16GB partitions takes 5s per partition. As the thread didn't get a lot of traction, I've registered it into current commitfest https://commitfest.postgresql.org/36/3461/with You as the author and in 'Ready for review' state. I think it behaves as almost finished one and apparently after reading all those discussions that go back over 10years+ timespan about this feature, and lot of failed effort towards wal_level=noWAL I think it would be nice to finally start gettingsome of that of it into the core. -Jakub Wartak.
Attachment
On Fri, Dec 17, 2021 at 09:10:30AM +0000, Jakub Wartak wrote: > I'm especially worried if I didn't screw up something/forgot something related to the last one (rd->rd_smgr changes), butI'm getting "All 210 tests passed". > As the thread didn't get a lot of traction, I've registered it into current commitfest https://commitfest.postgresql.org/36/3461/with You as the author and in 'Ready for review' state. > I think it behaves as almost finished one and apparently after reading all those discussions that go back over 10years+time span about this feature, and lot of failed effort towards wal_level=noWAL I think it would be nice to finallystart getting some of that of it into the core. The patch is failing: http://cfbot.cputube.org/kyotaro-horiguchi.html https://api.cirrus-ci.com/v1/artifact/task/5564333871595520/regress_diffs/src/bin/pg_upgrade/tmp_check/regress/regression.diffs I think you ran "make check", but should run something like this: make check-world -j8 >check-world.log 2>&1 && echo Success -- Justin
> Justin wrote: > On Fri, Dec 17, 2021 at 09:10:30AM +0000, Jakub Wartak wrote: > > As the thread didn't get a lot of traction, I've registered it into current > commitfest > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf > est.postgresql.org%2F36%2F3461%2F&data=04%7C01%7CJakub.Wartak% > 40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a > 3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT > WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV > CI6Mn0%3D%7C3000&sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC > NuBwiW4%3D&reserved=0 with You as the author and in 'Ready for > review' state. > > The patch is failing: [..] > I think you ran "make check", but should run something like this: > make check-world -j8 >check-world.log 2>&1 && echo Success Hi Justin, I've repeated the check-world and it says to me all is ok locally (also with --enable-cassert --enable-debug , at least onAmazon Linux 2) and also installcheck on default params seems to be ok I don't seem to understand why testfarm reports errors for tests like "path, polygon, rowsecurity" e.g. on Linux/graviton2and FreeBSD while not on MacOS(?) . Could someone point to me where to start looking/fixing? -J.
On Fri, Dec 17, 2021 at 02:33:25PM +0000, Jakub Wartak wrote: > > Justin wrote: > > On Fri, Dec 17, 2021 at 09:10:30AM +0000, Jakub Wartak wrote: > > > As the thread didn't get a lot of traction, I've registered it into current > > commitfest > > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitf > > est.postgresql.org%2F36%2F3461%2F&data=04%7C01%7CJakub.Wartak% > > 40tomtom.com%7Cb815e75090d44e20fd0a08d9c15b45cc%7C374f80267b544a > > 3ab87d328fa26ec10d%7C0%7C0%7C637753420044612362%7CUnknown%7CT > > WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV > > CI6Mn0%3D%7C3000&sdata=0BTQSVDnVPu4YpECKXXlBJT5q3Gfgv099SaC > > NuBwiW4%3D&reserved=0 with You as the author and in 'Ready for > > review' state. > > > > The patch is failing: > [..] > > I think you ran "make check", but should run something like this: > > make check-world -j8 >check-world.log 2>&1 && echo Success > > Hi Justin, > > I've repeated the check-world and it says to me all is ok locally (also with --enable-cassert --enable-debug , at leaston Amazon Linux 2) and also installcheck on default params seems to be ok > I don't seem to understand why testfarm reports errors for tests like "path, polygon, rowsecurity" e.g. on Linux/graviton2and FreeBSD while not on MacOS(?) . > Could someone point to me where to start looking/fixing? Since it says this, it looks a lot like a memory error like a use-after-free - like in fsync_parent_path(): CREATE TABLE PATH_TBL (f1 path); +ERROR: could not open file <....> Pacific": No such file or directory I see at least this one is still failing, though: time make -C src/test/recovery check
Attachment
Hello, Jakub. At Fri, 17 Dec 2021 09:10:30 +0000, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote in > the patch didn't apply clean (it's from March; some hunks were failing), so I've fixed it and the combined git format-patchis attached. It did conflict with the following: Thanks for looking this. Also thanks for Justin for finding the silly use-after-free bug. (Now I see the regression test fails and I'm not sure how come I didn't find this before.) > b0483263dda - Add support for SET ACCESS METHOD in ALTER TABLE > 7b565843a94 - Add call to object access hook at the end of table rewrite in ALTER TABLE > 9ce346eabf3 - Report progress of startup operations that take a long time. > f10f0ae420 - Replace RelationOpenSmgr() with RelationGetSmgr(). > > I'm especially worried if I didn't screw up something/forgot something related to the last one (rd->rd_smgr changes), butI'm getting "All 210 tests passed". About the last one, all rel->rd_smgr acesses need to be repalced with RelationGetSmgr(). On the other hand we can simply remove RelationOpenSmgr() calls since the target smgrrelation is guaranteed to be loaded by RelationGetSmgr(). The fix you made for RelationCreate/DropInitFork is correct and changes you made would work, but I prefer that the code not being too permissive for unknown (or unexpected) states. > Basic demonstration of this patch (with wal_level=minimal): > create unlogged table t6 (id bigint, t text); > -- produces 110GB table, takes ~5mins > insert into t6 select nextval('s1'), repeat('A', 1000) from generate_series(1, 100000000); > alter table t6 set logged; > on baseline SET LOGGED takes: ~7min10s > on patched SET LOGGED takes: 25s > > So basically one can - thanks to this patch - use his application (performing classic INSERTs/UPDATEs/DELETEs, so withoutthe need to rewrite to use COPY) and perform literally batch upload and then just switch the tables to LOGGED. This result is significant. That operation finally requires WAL writes but I was not sure how much gain FPIs (or bulk WAL logging) gives in comparison to operational WALs. > Some more intensive testing also looks good, assuming table prepared to put pressure on WAL: > create unlogged table t_unlogged (id bigint, t text) partition by hash (id); > create unlogged table t_unlogged_h0 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 0); > [..] > create unlogged table t_unlogged_h3 partition of t_unlogged FOR VALUES WITH (modulus 4, remainder 3); > > Workload would still be pretty heavy on LWLock/BufferContent,WALInsert and Lock/extend . > t_logged.sql = insert into t_logged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # accordingto pg_wal_stats.wal_bytes generates ~1MB of WAL > t_unlogged.sql = insert into t_unlogged select nextval('s1'), repeat('A', 1000) from generate_series(1, 1000); # accordingto pg_wal_stats.wal_bytes generates ~3kB of WAL > > so using: pgbench -f <tabletypetest>.sql -T 30 -P 1 -c 32 -j 3 t > with synchronous_commit =ON(default): > with t_logged.sql: tps = 229 (lat avg = 138ms) > with t_unlogged.sql tps = 283 (lat avg = 112ms) # almost all on LWLock/WALWrite > with synchronous_commit =OFF: > with t_logged.sql: tps = 413 (lat avg = 77ms) > with t_unloged.sql: tps = 782 (lat avg = 40ms) > Afterwards switching the unlogged ~16GB partitions takes 5s per partition. > > As the thread didn't get a lot of traction, I've registered it into current commitfest https://commitfest.postgresql.org/36/3461/with You as the author and in 'Ready for review' state. > > I think it behaves as almost finished one and apparently after reading all those discussions that go back over 10years+time span about this feature, and lot of failed effort towards wal_level=noWAL I think it would be nice to finallystart getting some of that of it into the core. Thanks for taking the performance benchmark. I didn't register for some reasons. 1. I'm not sure that we want to have the new mark files. 2. Aside of possible bugs, I'm not confident that the crash-safety of this patch is actually water-tight. At least we need tests for some failure cases. 3. As mentioned in transam/README, failure in removing smgr mark files leads to immediate shut down. I'm not sure this behavior is acceptable. 4. Including the reasons above, this is not fully functionally. For example, if we execute the following commands on primary, replica dones't work correctly. (boom!) =# CREATE UNLOGGED TABLE t (a int); =# ALTER TABLE t SET LOGGED; The following fixes are done in the attched v8. - Rebased. Referring to Jakub and Justin's work, I replaced direct access to ->rd_smgr with RelationGetSmgr() and removed calls to RelationOpenSmgr(). I still separate the "ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED" statement part. - Fixed RelationCreate/DropInitFork's behavior for non-target relations. (From Jakub's work). - Fixed wording of some comments. - As revisited, I found a bug around recovery. If the logged-ness of a relation gets flipped repeatedly in a transaction, duplicate pending-delete entries are accumulated during recovery and work in a wrong way. sgmr_redo now adds up to one entry for a action. - The issue 4 above is not fixed (yet). regards. -- Kyotaro Horiguchi NTT Open Source Software Center From c665734c9e056e80a0d56281011b95e55ea14507 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v8 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 52 +++ src/backend/access/transam/README | 8 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 539 +++++++++++++++++++++++-- src/backend/commands/tablecmds.c | 245 +++++++++-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 88 ++++ src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 345 +++++++++++----- src/backend/storage/smgr/md.c | 93 ++++- src/backend/storage/smgr/smgr.c | 32 ++ src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 24 ++ src/common/relpath.c | 47 ++- src/include/catalog/storage.h | 2 + src/include/catalog/storage_xlog.h | 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h | 17 + 22 files changed, 1401 insertions(+), 207 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7755553d57..d251f22207 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + action = "CREATE"; + break; + case XLOG_SMGR_MARK_UNLINK: + action = "DELETE"; + break; + default: + action = "<unknown action>"; + break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +98,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_MARK: + id = "MARK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..b344bbe511 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The Smgr MARK files +-------------------------------- + +An smgr mark file is created when a new relation file is created to +mark the relfilenode needs to be cleaned up at recovery time. In +contrast to the four actions above, failure to remove smgr mark files +will lead to data loss, in which case the server will shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 1e1fbe957f..59f4c2eacf 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -40,6 +40,7 @@ #include "catalog/catversion.h" #include "catalog/pg_control.h" #include "catalog/pg_database.h" +#include "catalog/storage.h" #include "commands/progress.h" #include "commands/tablespace.h" #include "common/controldata_utils.h" @@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -7824,6 +7833,14 @@ StartupXLOG(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index c5ad28d71f..f2bcc12958 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -27,6 +28,7 @@ #include "access/xlogutils.h" #include "catalog/storage.h" #include "catalog/storage_xlog.h" +#include "common/hashfn.h" #include "miscadmin.h" #include "storage/freespace.h" #include "storage/smgr.h" @@ -57,9 +59,18 @@ int wal_skip_threshold = 2048; /* in kilobytes */ * but I'm being paranoid. */ +#define PDOP_DELETE (1 << 0) +#define PDOP_UNLINK_FORK (1 << 1) +#define PDOP_UNLINK_MARK (1 << 2) +#define PDOP_SET_PERSISTENCE (1 << 3) + typedef struct PendingRelDelete { RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ + StorageMarks unlink_mark; /* mark to unlink */ BackendId backend; /* InvalidBackendId if not a temp rel */ bool atCommit; /* T=delete at commit; F=delete at abort */ int nestLevel; /* xact nesting level of request */ @@ -75,6 +86,24 @@ typedef struct PendingRelSync static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; +typedef struct SRelHashEntry +{ + SMgrRelation srel; + char status; /* for simplehash use */ +} SRelHashEntry; + +/* define hashtable for workarea for pending deletes */ +#define SH_PREFIX srelhash +#define SH_ELEMENT_TYPE SRelHashEntry +#define SH_KEY_TYPE SMgrRelation +#define SH_KEY srel +#define SH_HASH_KEY(tb, key) \ + hash_bytes((unsigned char *)&key, sizeof(SMgrRelation)) +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0) +#define SH_SCOPE static inline +#define SH_DEFINE +#define SH_DECLARE +#include "lib/simplehash.h" /* * AddPendingSync @@ -143,22 +172,48 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up but there's no + * clue to the orphan files. The SMGR_MARK_UNCOMMITED mark file works as + * the signal of that situation. + */ srel = smgropen(rnode, backend); + log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false); smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM); - /* Add the relation to the list of stuff to delete at abort */ + /* + * Add the relation to the list of stuff to delete at abort. We don't + * remove the mark file at commit. It needs to persists until the main fork + * file is actually deleted. See SyncPostCheckpoint. + */ pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rnode; + pending->op = PDOP_DELETE; pending->backend = backend; pending->atCommit = false; /* delete if abort */ pending->nestLevel = GetCurrentTransactionNestLevel(); pending->next = pendingDeletes; pendingDeletes = pending; + /* drop cleanup fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_MARK; + pending->unlink_forknum = MAIN_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = backend; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { Assert(backend == InvalidBackendId); @@ -168,6 +223,226 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + SMgrRelation srel; + PendingRelDelete *prev; + PendingRelDelete *next; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false); + + /* + * If we have entries for init-fork operations on this relation, that means + * that we have already registered pending delete entries to drop + * preexisting init-fork since before the current transaction started. This + * function reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + + /* + * We don't touch unrelated entries. Although init-fork related entries + * are not useful if the relation is created or dropped in this + * transaction, we don't bother to avoid registering entries for such + * relations here. + */ + if (!RelFileNodeEquals(rnode, pending->relnode) || + (pending->op & PDOP_DELETE) != 0 || + pending->unlink_forknum != INIT_FORKNUM) + { + prev = pending; + continue; + } + + /* make sure the entry is what we're expecting here */ + Assert(((pending->op & (PDOP_UNLINK_FORK|PDOP_UNLINK_MARK)) != 0 && + pending->unlink_forknum == INIT_FORKNUM) || + (pending->op & PDOP_SET_PERSISTENCE) != 0); + + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + create = false; + } + + if (!create) + return; + + /* + * We are going to create an init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The mark file works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by ourselves. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop the init fork, mark file and revert persistence at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK | PDOP_UNLINK_MARK | PDOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop mark file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_MARK; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + PendingRelDelete *prev; + PendingRelDelete *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false); + + /* + * If we have entries for init-fork operations of this relation, that means + * that we have created the init fork in the current transaction. We + * remove the init fork and mark file immediately in that case. Otherwise + * just register pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + + /* + * We don't touch unrelated entries. Although init-fork related entries + * are not useful if the relation is created or dropped in this + * transaction, we don't bother to avoid registering entries for such + * relations here. + */ + if (!RelFileNodeEquals(rnode, pending->relnode) || + (pending->op & PDOP_DELETE) != 0 || + pending->unlink_forknum != INIT_FORKNUM)) + { + prev = pending; + continue; + } + + /* make sure the entry is what we're expecting here */ + Assert(((pending->op & (PDOP_UNLINK_FORK|PDOP_UNLINK_MARK)) != 0 && + pending->unlink_forknum == INIT_FORKNUM) || + (pending->op & PDOP_SET_PERSISTENCE) != 0); + + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + inxact_created = true; + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT forks never be loaded to shared buffer so no point in dropping + * buffers for such files. + */ + log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +462,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL. + */ +void +log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_CREATE; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL. + */ +void +log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_UNLINK; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -200,6 +557,7 @@ RelationDropStorage(Relation rel) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rel->rd_node; + pending->op = PDOP_DELETE; pending->backend = rel->rd_backend; pending->atCommit = true; /* delete if commit */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -618,59 +976,104 @@ smgrDoPendingDeletes(bool isCommit) int nrels = 0, maxrels = 0; SMgrRelation *srels = NULL; + srelhash_hash *close_srels = NULL; + bool found; prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) { + SMgrRelation srel; + next = pending->next; if (pending->nestLevel < nestLevel) { /* outer-level entries should not be processed yet */ prev = pending; + continue; } + + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; else + pendingDeletes = next; + + if (pending->atCommit != isCommit) { - /* unlink list entry first, so we don't retry on failure */ - if (prev) - prev->next = next; - else - pendingDeletes = next; - /* do deletion if called for */ - if (pending->atCommit == isCommit) - { - SMgrRelation srel; - - srel = smgropen(pending->relnode, pending->backend); - - /* allocate the initial array, or extend it, if needed */ - if (maxrels == 0) - { - maxrels = 8; - srels = palloc(sizeof(SMgrRelation) * maxrels); - } - else if (maxrels <= nrels) - { - maxrels *= 2; - srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); - } - - srels[nrels++] = srel; - } /* must explicitly free the list entry */ pfree(pending); /* prev does not change */ + continue; } + + if (close_srels == NULL) + close_srels = srelhash_create(CurrentMemoryContext, 32, NULL); + + srel = smgropen(pending->relnode, pending->backend); + + /* Uniquify the smgr relations */ + srelhash_insert(close_srels, srel, &found); + + if (pending->op & PDOP_DELETE) + { + /* allocate the initial array, or extend it, if needed */ + if (maxrels == 0) + { + maxrels = 8; + srels = palloc(sizeof(SMgrRelation) * maxrels); + } + else if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + srels[nrels++] = srel; + } + + if (pending->op & PDOP_UNLINK_FORK) + { + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM); + + /* Don't emit wal while recovery. */ + if (!InRecovery) + log_smgrunlink(&pending->relnode, pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + } + + if (pending->op & PDOP_UNLINK_MARK) + { + if (!InRecovery) + log_smgrunlinkmark(&pending->relnode, + pending->unlink_forknum, + pending->unlink_mark); + smgrunlinkmark(srel, pending->unlink_forknum, + pending->unlink_mark, InRecovery); + } + + if (pending->op & PDOP_SET_PERSISTENCE) + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); } if (nrels > 0) { smgrdounlinkall(srels, nrels, false); - - for (int i = 0; i < nrels; i++) - smgrclose(srels[i]); - pfree(srels); } + + if (close_srels) + { + srelhash_iterator i; + SRelHashEntry *ent; + + /* close smgr relatoins */ + srelhash_start_iterate(close_srels, &i); + while ((ent = srelhash_iterate(close_srels, &i)) != NULL) + smgrclose(ent->srel); + srelhash_destroy(close_srels); + } } /* @@ -840,7 +1243,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId + && pending->op == PDOP_DELETE) nrels++; } if (nrels == 0) @@ -853,7 +1257,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) { *rptr = pending->relnode; rptr++; @@ -933,6 +1338,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1021,6 +1435,65 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record); + SMgrRelation reln; + PendingRelDelete *pending; + bool created = false; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true); + created = true; + break; + case XLOG_SMGR_MARK_UNLINK: + smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true); + break; + default: + elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark); + } + + if (created) + { + /* revert mark file operation at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = xlrec->rnode; + pending->op = PDOP_UNLINK_MARK; + pending->unlink_forknum = xlrec->forkNum; + pending->unlink_mark = xlrec->mark; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + } + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingRelDelete *pending; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = xlrec->rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index bf42587e38..afc77f0d98 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -52,6 +52,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -5329,6 +5330,166 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform in-place persistnce change"); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * Some access methods do not accept in-place persistence change. For + * example, GiST uses page LSNs to figure out whether a block has + * changed, where UNLOGGED GiST indexes use fake LSNs that are + * incompatible with real LSNs used for LOGGED ones. + * + * XXXX: We don't bother to allow in-place persistence change for index + * methods other than btree for now. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + r->rd_rel->relam != BTREE_AM_OID) + { + int reindex_flags; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, 0); + + continue; + } + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(RelationGetSmgr(r), i)) + smgrimmedsync(RelationGetSmgr(r), i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(RelationGetSmgr(r), fork)) + log_newpage_range(r, fork, 0, + smgrnblocks(RelationGetSmgr(r), fork), + false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5459,47 +5620,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod, - persistence, lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, + NewAccessMethod, + persistence, lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); - InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + } } else { diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index ec0485705d..45e1a5d817 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, bool excludeFound; ForkNumber relForkNum; /* Type of fork if file is a relation */ int relOidChars; /* Chars in filename that are the rel oid */ + StorageMarks mark; /* Skip special stuff */ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) @@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, /* Exclude all forks for unlogged tables except the init fork */ if (isDbDir && parse_filename_for_nontemp_relation(de->d_name, &relOidChars, - &relForkNum)) + &relForkNum, &mark)) { /* Never exclude init forks */ if (relForkNum != INIT_FORKNUM) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index b4532948d3..dab74bf99a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 263057841d..8487ae1f02 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel); static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel); -static int fsync_parent_path(const char *fname, int elevel); - /* * pg_fsync --- do fsync with or without writethrough @@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel) * This is aimed at making file operations persistent on disk in case of * an OS crash or power failure. */ -static int +int fsync_parent_path(const char *fname, int elevel) { char parentpath[MAXPGPATH]; diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 0ae3fb6902..f8458a1e1e 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,49 @@ #include <unistd.h> +#include "access/xlog.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" #include "postmaster/startup.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. + * + * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the + * init fork along with the mark file. + * + * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the + * whole relation along with the mark file. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. @@ -72,7 +92,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -81,13 +101,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + + Assert(tspid != 0); + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -103,7 +129,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s", dbspace_path); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -158,125 +190,228 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create a ton of unlogged relations + * in the same database & tablespace, so we'd better use a hash table + * rather than an array or linked list to keep track of which files + * need to be reset. Otherwise, this cleanup operation would be + * O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT fork and mark files in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + StorageMarks mark; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum, &mark)) + continue; + + if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has + * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty + * state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_init = true; + else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + DropRelFileNodesAllBuffers(srels, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + StorageMarks mark; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) + &forkNum, &mark)) continue; /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { @@ -285,6 +420,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; @@ -292,9 +428,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -328,15 +466,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char mainpath[MAXPGPATH]; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -379,7 +520,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, - ForkNumber *fork) + ForkNumber *fork, StorageMarks *mark) { int pos; @@ -410,11 +551,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars, for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar) ; - if (segchar <= 1) - return false; - pos += segchar; + if (segchar > 1) + pos += segchar; } + /* mark file? */ + if (name[pos] == '.' && name[pos + 1] != 0) + { + *mark = name[pos + 1]; + pos += 2; + } + else + *mark = SMGR_MARK_NONE; + /* Now we should be at the end. */ if (name[pos] != '\0') return false; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index b4bca7eed6..580b74839f 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno, bool skipFsync, int behavior); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); - +static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum, + StorageMarks mark); /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum) return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL); } +/* + * mdcreatemark() -- Create a mark file. + * + * If isRedo is true, it's okay for the file to exist already. + */ +void +mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path =markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + /* See mdcreate for details.. */ + TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode, + reln->smgr_rnode.node.dbNode, + isRedo); + + fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL); + if (fd < 0 && (!isRedo || errno != EEXIST)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not crete mark file \"%s\": %m", path))); + + pg_fsync(fd); + close(fd); + + /* + * To guarantee that the creation of the file is persistent, fsync its + * parent directory. + */ + fsync_parent_path(path, ERROR); + + pfree(path); +} + + +/* + * mdunlinkmark() -- Delete the mark file + * + * If isRedo is true, it's okay for the file being not found. + */ +void +mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + + if (!isRedo || mdmarkexists(reln, forkNum, mark)) + durable_unlink(path, ERROR); + + pfree(path); +} + +/* + * mdmarkexists() -- Check if the file exists. + */ +static bool +mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + fd = BasicOpenFile(path, O_RDONLY); + if (fd < 0 && errno != ENOENT) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not access mark file \"%s\": %m", path))); + pfree(path); + + if (fd < 0) + return false; + + return true; +} + /* * mdcreate() -- Create a new relation on magnetic disk. * @@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ @@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path) * Return 0 on success, -1 on failure, with errno set. */ int -mdunlinkfiletag(const FileTag *ftag, char *path) +mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark) { char *p; /* Compute the path. */ - p = relpathperm(ftag->rnode, MAIN_FORKNUM); + p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode, + ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM, + mark); strlcpy(path, p, MAXPGPATH); pfree(p); diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 0fcef4994b..110e64b0b2 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -62,6 +62,10 @@ typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); + void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); + void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); } f_smgr; static const f_smgr smgrsw[] = { @@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = { .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, .smgr_immedsync = mdimmedsync, + .smgr_createmark = mdcreatemark, + .smgr_unlinkmark = mdunlinkmark, } }; @@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo); } +/* + * smgrcreatemark() -- Create a mark file + */ +void +smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo); +} + +/* + * smgrunlinkmark() -- Delete a mark file + */ +void +smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo); +} + /* * smgrdosyncall() -- Immediately sync all forks of all given relations * @@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index d4083e8a56..9563940d45 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0; typedef struct SyncOps { int (*sync_syncfiletag) (const FileTag *ftag, char *path); - int (*sync_unlinkfiletag) (const FileTag *ftag, char *path); + int (*sync_unlinkfiletag) (const FileTag *ftag, char *path, + StorageMarks mark); bool (*sync_filetagmatches) (const FileTag *ftag, const FileTag *candidate); } SyncOps; @@ -222,7 +223,8 @@ SyncPostCheckpoint(void) /* Unlink the file */ if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag, - path) < 0) + path, + SMGR_MARK_NONE) < 0) { /* * There's a race condition, when the database is dropped at the @@ -236,6 +238,20 @@ SyncPostCheckpoint(void) (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } + else if (syncsw[entry->tag.handler].sync_unlinkfiletag( + &entry->tag, path, + SMGR_MARK_UNCOMMITTED) < 0) + { + /* + * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the + * fork files has been successfully removed. It's ok if the file + * does not exist. + */ + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); + } /* Mark the list entry as canceled, just in case */ entry->canceled = true; diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 436df54120..dbc0da5da5 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record) * source system. */ } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } else if (rmid == RM_XACT_ID && ((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT || (rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED || diff --git a/src/common/relpath.c b/src/common/relpath.c index 1f5c426ec0..67f24890d6 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode) */ char * GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber) + int backendId, ForkNumber forkNumber, char mark) { char *path; + char markstr[10]; + + if (mark == 0) + markstr[0] = 0; + else + snprintf(markstr, 10, ".%c", mark); if (spcNode == GLOBALTABLESPACE_OID) { @@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, Assert(dbNode == 0); Assert(backendId == InvalidBackendId); if (forkNumber != MAIN_FORKNUM) - path = psprintf("global/%u_%s", - relNode, forkNames[forkNumber]); + path = psprintf("global/%u_%s%s", + relNode, forkNames[forkNumber], markstr); else - path = psprintf("global/%u", relNode); + path = psprintf("global/%u%s", relNode, markstr); } else if (spcNode == DEFAULTTABLESPACE_OID) { @@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/%u_%s", + path = psprintf("base/%u/%u_%s%s", dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/%u", - dbNode, relNode); + path = psprintf("base/%u/%u%s", + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/t%d_%u_%s", + path = psprintf("base/%u/t%d_%u_%s%s", dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/t%d_%u", - dbNode, backendId, relNode); + path = psprintf("base/%u/t%d_%u%s", + dbNode, backendId, relNode, markstr); } } else @@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/%u", + path = psprintf("pg_tblspc/%u/%s/%u/%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, relNode); + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, backendId, relNode); + dbNode, backendId, relNode, markstr); } } + return path; } diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 0ab32b44e9..382623159c 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index f0814f1458..12346ed7f6 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -18,17 +18,23 @@ #include "lib/stringinfo.h" #include "storage/block.h" #include "storage/relfilenode.h" +#include "storage/smgr.h" /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation and buffer persistence change here, + * but logging of deletion actions is handled mainly by xact.c, because it is + * part of transaction commit in most cases. However, there's a case where + * init forks are deleted outside control of transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_MARK 0x40 +#define XLOG_SMGR_BUFPERSISTENCE 0x50 typedef struct xl_smgr_create { @@ -36,6 +42,32 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef enum smgr_mark_action +{ + XLOG_SMGR_MARK_CREATE = 'c', + XLOG_SMGR_MARK_UNLINK = 'u' +} smgr_mark_action; + +typedef struct xl_smgr_mark +{ + RelFileNode rnode; + ForkNumber forkNum; + StorageMarks mark; + smgr_mark_action action; +} xl_smgr_mark; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a44be11ca0..106a5cf508 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork); extern char *GetDatabasePath(Oid dbNode, Oid spcNode); extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber); + int backendId, ForkNumber forkNumber, char mark); /* * Wrapper macros for GetRelationPath. Beware of multiple @@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, /* First argument is a RelFileNode */ #define relpathbackend(rnode, backend, forknum) \ GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \ - backend, forknum) + backend, forknum, 0) /* First argument is a RelFileNode */ #define relpathperm(rnode, forknum) \ @@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, #define relpath(rnode, forknum) \ relpathbackend((rnode).node, (rnode).backend, forknum) +/* First argument is a RelFileNodeBackend */ +#define markpath(rnode, forknum, mark) \ + GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \ + (rnode).node.relNode, \ + (rnode).backend, forknum, mark) #endif /* RELPATH_H */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index cfce23ecbc..f5a7df87a4 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 34602ae006..2dc0357ad5 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd, extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); +extern int fsync_parent_path(const char *fname, int elevel); extern int durable_rename(const char *oldfile, const char *newfile, int loglevel); extern int durable_unlink(const char *fname, int loglevel); extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 752b440864..99620816b5 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -23,6 +23,10 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); +extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); @@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); /* md sync callbacks */ extern int mdsyncfiletag(const FileTag *ftag, char *path); -extern int mdunlinkfiletag(const FileTag *ftag, char *path); +extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark); extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate); #endif /* MD_H */ diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index fad1e5c473..e1f97e9b89 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -16,13 +16,15 @@ #define REINIT_H #include "common/relpath.h" - +#include "storage/smgr.h" extern void ResetUnloggedRelations(int op); -extern bool parse_filename_for_nontemp_relation(const char *name, - int *oidchars, ForkNumber *fork); +extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, + ForkNumber *fork, + StorageMarks *mark); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a6fbf7b6a6..201ecace8a 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -18,6 +18,18 @@ #include "storage/block.h" #include "storage/relfilenode.h" +/* + * Storage marks is a file of which existence suggests something about a + * file. The name of such files is "<filename>.<mark>", where the mark is one + * of the values of StorageMarks. Since ".<digit>" means segment files so don't + * use digits for the mark character. + */ +typedef enum StorageMarks +{ + SMGR_MARK_NONE = 0, + SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */ +} StorageMarks; + /* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles. An SMgrRelation is created (if not already present) @@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); +extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, -- 2.27.0 From 2d74ca97ae66dff87a883e2efa60f02fb8c883c3 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v8 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 ++++ src/backend/nodes/equalfuncs.c | 15 ++++ src/backend/parser/gram.y | 20 +++++ src/backend/tcop/utility.c | 11 +++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 ++ 8 files changed, 214 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index afc77f0d98..211ca3641a 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -14488,6 +14488,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(NIL); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index df0b747883..55e38cfe3f 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4269,6 +4269,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5622,6 +5635,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index cb7ddd463c..a19b7874d7 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3625,6 +3637,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 3d4dd43e47..9823d57a54 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1984,6 +1984,26 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 1fbc387d47..1483f9a475 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 336549cc5f..714077ff4c 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index 7c657c1241..8860b2e548 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -428,6 +428,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 4c5a8a39bf..c3e1bc66d1 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- -- 2.27.0
At Mon, 20 Dec 2021 15:28:23 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > 4. Including the reasons above, this is not fully functionally. > For example, if we execute the following commands on primary, > replica dones't work correctly. (boom!) > > =# CREATE UNLOGGED TABLE t (a int); > =# ALTER TABLE t SET LOGGED; > - The issue 4 above is not fixed (yet). Not only for the case, RelationChangePersistence needs to send a truncate record before FPIs. If primary crashes amid of the operation, the table content will be vanish with the persistence change. That is the correct behavior. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From b28163fd7b3527e69f5b76f252891f800d7ac98c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v9 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 52 +++ src/backend/access/transam/README | 8 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 593 +++++++++++++++++++++++-- src/backend/commands/tablecmds.c | 256 +++++++++-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 88 ++++ src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 344 ++++++++++---- src/backend/storage/smgr/md.c | 93 +++- src/backend/storage/smgr/smgr.c | 32 ++ src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 24 + src/common/relpath.c | 47 +- src/include/catalog/storage.h | 2 + src/include/catalog/storage_xlog.h | 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h | 17 + 22 files changed, 1465 insertions(+), 207 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7755553d57..d251f22207 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + action = "CREATE"; + break; + case XLOG_SMGR_MARK_UNLINK: + action = "DELETE"; + break; + default: + action = "<unknown action>"; + break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +98,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_MARK: + id = "MARK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..b344bbe511 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The Smgr MARK files +-------------------------------- + +An smgr mark file is created when a new relation file is created to +mark the relfilenode needs to be cleaned up at recovery time. In +contrast to the four actions above, failure to remove smgr mark files +will lead to data loss, in which case the server will shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 1e1fbe957f..59f4c2eacf 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -40,6 +40,7 @@ #include "catalog/catversion.h" #include "catalog/pg_control.h" #include "catalog/pg_database.h" +#include "catalog/storage.h" #include "commands/progress.h" #include "commands/tablespace.h" #include "common/controldata_utils.h" @@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -7824,6 +7833,14 @@ StartupXLOG(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index c5ad28d71f..03fccc3c3b 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -27,6 +28,7 @@ #include "access/xlogutils.h" #include "catalog/storage.h" #include "catalog/storage_xlog.h" +#include "common/hashfn.h" #include "miscadmin.h" #include "storage/freespace.h" #include "storage/smgr.h" @@ -57,9 +59,18 @@ int wal_skip_threshold = 2048; /* in kilobytes */ * but I'm being paranoid. */ +#define PDOP_DELETE (1 << 0) +#define PDOP_UNLINK_FORK (1 << 1) +#define PDOP_UNLINK_MARK (1 << 2) +#define PDOP_SET_PERSISTENCE (1 << 3) + typedef struct PendingRelDelete { RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ + StorageMarks unlink_mark; /* mark to unlink */ BackendId backend; /* InvalidBackendId if not a temp rel */ bool atCommit; /* T=delete at commit; F=delete at abort */ int nestLevel; /* xact nesting level of request */ @@ -75,6 +86,24 @@ typedef struct PendingRelSync static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; +typedef struct SRelHashEntry +{ + SMgrRelation srel; + char status; /* for simplehash use */ +} SRelHashEntry; + +/* define hashtable for workarea for pending deletes */ +#define SH_PREFIX srelhash +#define SH_ELEMENT_TYPE SRelHashEntry +#define SH_KEY_TYPE SMgrRelation +#define SH_KEY srel +#define SH_HASH_KEY(tb, key) \ + hash_bytes((unsigned char *)&key, sizeof(SMgrRelation)) +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(SMgrRelation)) == 0) +#define SH_SCOPE static inline +#define SH_DEFINE +#define SH_DECLARE +#include "lib/simplehash.h" /* * AddPendingSync @@ -143,22 +172,47 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up. The + * SMGR_MARK_UNCOMMITED mark file works as the signal of orphan files. + */ srel = smgropen(rnode, backend); + log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false); smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM); - /* Add the relation to the list of stuff to delete at abort */ + /* + * Add the relation to the list of stuff to delete at abort. We don't + * remove the mark file at commit. It needs to persists until the main fork + * file is actually deleted. See SyncPostCheckpoint. + */ pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rnode; + pending->op = PDOP_DELETE; pending->backend = backend; pending->atCommit = false; /* delete if abort */ pending->nestLevel = GetCurrentTransactionNestLevel(); pending->next = pendingDeletes; pendingDeletes = pending; + /* drop cleanup fork at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_MARK; + pending->unlink_forknum = MAIN_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = backend; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { Assert(backend == InvalidBackendId); @@ -168,6 +222,226 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + SMgrRelation srel; + PendingRelDelete *prev; + PendingRelDelete *next; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false); + + /* + * If we have entries for init-fork operations on this relation, that means + * that we have already registered pending delete entries to drop an + * init-fork preexisting since before the current transaction started. This + * function reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + + /* + * We don't touch unrelated entries. Although init-fork related entries + * are not useful if the relation is created or dropped in this + * transaction, we don't bother to avoid registering entries for such + * relations here. + */ + if (!RelFileNodeEquals(rnode, pending->relnode) || + pending->unlink_forknum != INIT_FORKNUM || + (pending->op & PDOP_DELETE) != 0) + { + prev = pending; + continue; + } + + /* make sure the entry is what we're expecting here */ + Assert(((pending->op & (PDOP_UNLINK_FORK|PDOP_UNLINK_MARK)) != 0 && + pending->unlink_forknum == INIT_FORKNUM) || + (pending->op & PDOP_SET_PERSISTENCE) != 0); + + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + create = false; + } + + if (!create) + return; + + /* + * We are going to create an init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The mark file works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by ourselves. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop the init fork, mark file and revert persistence at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK | PDOP_UNLINK_MARK | PDOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* drop mark file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_MARK; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingRelDelete *pending; + PendingRelDelete *prev; + PendingRelDelete *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false); + + /* + * If we have entries for init-fork operations of this relation, that means + * that we have created the init fork in the current transaction. We + * remove the init fork and mark file immediately in that case. Otherwise + * just register pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + + /* + * We don't touch unrelated entries. Although init-fork related entries + * are not useful if the relation is created or dropped in this + * transaction, we don't bother to avoid registering entries for such + * relations here. + */ + if (!RelFileNodeEquals(rnode, pending->relnode) || + pending->unlink_forknum != INIT_FORKNUM || + (pending->op & PDOP_DELETE) != 0) + { + prev = pending; + continue; + } + + /* make sure the entry is what we're expecting here */ + Assert(((pending->op & (PDOP_UNLINK_FORK|PDOP_UNLINK_MARK)) != 0 && + pending->unlink_forknum == INIT_FORKNUM) || + (pending->op & PDOP_SET_PERSISTENCE) != 0); + + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + pfree(pending); + + inxact_created = true; + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT forks never be loaded to shared buffer so no point in dropping + * buffers for such files. + */ + log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +461,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL. + */ +void +log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_CREATE; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL. + */ +void +log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_UNLINK; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -200,6 +556,7 @@ RelationDropStorage(Relation rel) pending = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); pending->relnode = rel->rd_node; + pending->op = PDOP_DELETE; pending->backend = rel->rd_backend; pending->atCommit = true; /* delete if commit */ pending->nestLevel = GetCurrentTransactionNestLevel(); @@ -618,59 +975,104 @@ smgrDoPendingDeletes(bool isCommit) int nrels = 0, maxrels = 0; SMgrRelation *srels = NULL; + srelhash_hash *close_srels = NULL; + bool found; prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) { + SMgrRelation srel; + next = pending->next; if (pending->nestLevel < nestLevel) { /* outer-level entries should not be processed yet */ prev = pending; + continue; } + + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; else + pendingDeletes = next; + + if (pending->atCommit != isCommit) { - /* unlink list entry first, so we don't retry on failure */ - if (prev) - prev->next = next; - else - pendingDeletes = next; - /* do deletion if called for */ - if (pending->atCommit == isCommit) - { - SMgrRelation srel; - - srel = smgropen(pending->relnode, pending->backend); - - /* allocate the initial array, or extend it, if needed */ - if (maxrels == 0) - { - maxrels = 8; - srels = palloc(sizeof(SMgrRelation) * maxrels); - } - else if (maxrels <= nrels) - { - maxrels *= 2; - srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); - } - - srels[nrels++] = srel; - } /* must explicitly free the list entry */ pfree(pending); /* prev does not change */ + continue; } + + if (close_srels == NULL) + close_srels = srelhash_create(CurrentMemoryContext, 32, NULL); + + srel = smgropen(pending->relnode, pending->backend); + + /* Uniquify the smgr relations */ + srelhash_insert(close_srels, srel, &found); + + if (pending->op & PDOP_DELETE) + { + /* allocate the initial array, or extend it, if needed */ + if (maxrels == 0) + { + maxrels = 8; + srels = palloc(sizeof(SMgrRelation) * maxrels); + } + else if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + srels[nrels++] = srel; + } + + if (pending->op & PDOP_UNLINK_FORK) + { + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM); + + /* Don't emit wal while recovery. */ + if (!InRecovery) + log_smgrunlink(&pending->relnode, pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + } + + if (pending->op & PDOP_UNLINK_MARK) + { + if (!InRecovery) + log_smgrunlinkmark(&pending->relnode, + pending->unlink_forknum, + pending->unlink_mark); + smgrunlinkmark(srel, pending->unlink_forknum, + pending->unlink_mark, InRecovery); + } + + if (pending->op & PDOP_SET_PERSISTENCE) + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); } if (nrels > 0) { smgrdounlinkall(srels, nrels, false); - - for (int i = 0; i < nrels; i++) - smgrclose(srels[i]); - pfree(srels); } + + if (close_srels) + { + srelhash_iterator i; + SRelHashEntry *ent; + + /* close smgr relatoins */ + srelhash_start_iterate(close_srels, &i); + while ((ent = srelhash_iterate(close_srels, &i)) != NULL) + smgrclose(ent->srel); + srelhash_destroy(close_srels); + } } /* @@ -840,7 +1242,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId + && pending->op == PDOP_DELETE) nrels++; } if (nrels == 0) @@ -853,7 +1256,8 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr) for (pending = pendingDeletes; pending != NULL; pending = pending->next) { if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit - && pending->backend == InvalidBackendId) + && pending->backend == InvalidBackendId && + pending->op == PDOP_DELETE) { *rptr = pending->relnode; rptr++; @@ -933,6 +1337,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1021,6 +1434,120 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record); + SMgrRelation reln; + PendingRelDelete *pending; + bool created = false; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true); + created = true; + break; + case XLOG_SMGR_MARK_UNLINK: + smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true); + break; + default: + elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark); + } + + if (created) + { + /* revert mark file operation at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = xlrec->rnode; + pending->op = PDOP_UNLINK_MARK; + pending->unlink_forknum = xlrec->forkNum; + pending->unlink_mark = xlrec->mark; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + } + else + { + /* + * Delete pending action for this mark file if any. We should have + * at most one entry for this action. + */ + PendingRelDelete *prev = NULL; + + for (pending = pendingDeletes; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + pending->unlink_forknum == xlrec->forkNum && + (pending->op & PDOP_UNLINK_MARK) != 0) + { + if (prev) + prev->next = pending->next; + else + pendingDeletes = pending->next; + pfree(pending); + break; + } + + prev = pending; + } + } + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingRelDelete *pending; + PendingRelDelete *prev = NULL; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* + * Delete pending action for persistence change if any. We should have + * at most one entry for this action. + */ + for (pending = pendingDeletes; pending != NULL; pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + (pending->op & PDOP_SET_PERSISTENCE) != 0) + { + Assert (pending->bufpersistence == xlrec->persistence); + + if (prev) + prev->next = pending->next; + else + pendingDeletes = pending->next; + pfree(pending); + break; + } + + prev = pending; + } + + /* + * Revert buffer-persistence changes at abort if the relation is going + * to different persistence from before this transaction. + */ + if (!pending) + { + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = xlrec->rnode; + pending->op = PDOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + } + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index bf42587e38..0d9c801535 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -52,6 +52,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -5329,6 +5330,177 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform in-place persistnce change"); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * Some access methods do not accept in-place persistence change. For + * example, GiST uses page LSNs to figure out whether a block has + * changed, where UNLOGGED GiST indexes use fake LSNs that are + * incompatible with real LSNs used for LOGGED ones. + * + * XXXX: We don't bother to allow in-place persistence change for index + * methods other than btree for now. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + r->rd_rel->relam != BTREE_AM_OID) + { + int reindex_flags; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, 0); + + continue; + } + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(RelationGetSmgr(r), i)) + smgrimmedsync(RelationGetSmgr(r), i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + * We don't emit this fhile wal_level = minimal. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + xl_smgr_truncate xlrec; + + xlrec.blkno = 0; + xlrec.rnode = rel->rd_node; + xlrec.flags = SMGR_TRUNCATE_ALL; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + + XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(RelationGetSmgr(r), fork)) + log_newpage_range(r, fork, 0, + smgrnblocks(RelationGetSmgr(r), fork), + false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5459,47 +5631,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod, - persistence, lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, + NewAccessMethod, + persistence, lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); - InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + } } else { diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index ec0485705d..45e1a5d817 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, bool excludeFound; ForkNumber relForkNum; /* Type of fork if file is a relation */ int relOidChars; /* Chars in filename that are the rel oid */ + StorageMarks mark; /* Skip special stuff */ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) @@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, /* Exclude all forks for unlogged tables except the init fork */ if (isDbDir && parse_filename_for_nontemp_relation(de->d_name, &relOidChars, - &relForkNum)) + &relForkNum, &mark)) { /* Never exclude init forks */ if (relForkNum != INIT_FORKNUM) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index b4532948d3..dab74bf99a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 263057841d..8487ae1f02 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel); static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel); -static int fsync_parent_path(const char *fname, int elevel); - /* * pg_fsync --- do fsync with or without writethrough @@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel) * This is aimed at making file operations persistent on disk in case of * an OS crash or power failure. */ -static int +int fsync_parent_path(const char *fname, int elevel) { char parentpath[MAXPGPATH]; diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 0ae3fb6902..0137902bb2 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,49 @@ #include <unistd.h> +#include "access/xlog.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" #include "postmaster/startup.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. + * + * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the + * init fork along with the mark file. + * + * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the + * whole relation along with the mark file. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. @@ -72,7 +92,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -81,13 +101,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + + Assert(tspid != 0); + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -103,7 +129,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s", dbspace_path); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create tons of unlogged relations in + * the same database & tablespace, so we'd better use a hash table rather + * than an array or linked list to keep track of which files need to be + * reset. Otherwise, this cleanup operation would be O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT fork and mark files in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + StorageMarks mark; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum, &mark)) + continue; + + if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has + * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty + * state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_init = true; + else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + DropRelFileNodesAllBuffers(srels, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + StorageMarks mark; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) + &forkNum, &mark)) continue; /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { @@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; @@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char mainpath[MAXPGPATH]; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, - ForkNumber *fork) + ForkNumber *fork, StorageMarks *mark) { int pos; @@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars, for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar) ; - if (segchar <= 1) - return false; - pos += segchar; + if (segchar > 1) + pos += segchar; } + /* mark file? */ + if (name[pos] == '.' && name[pos + 1] != 0) + { + *mark = name[pos + 1]; + pos += 2; + } + else + *mark = SMGR_MARK_NONE; + /* Now we should be at the end. */ if (name[pos] != '\0') return false; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index b4bca7eed6..580b74839f 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno, bool skipFsync, int behavior); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); - +static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum, + StorageMarks mark); /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum) return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL); } +/* + * mdcreatemark() -- Create a mark file. + * + * If isRedo is true, it's okay for the file to exist already. + */ +void +mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path =markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + /* See mdcreate for details.. */ + TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode, + reln->smgr_rnode.node.dbNode, + isRedo); + + fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL); + if (fd < 0 && (!isRedo || errno != EEXIST)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not crete mark file \"%s\": %m", path))); + + pg_fsync(fd); + close(fd); + + /* + * To guarantee that the creation of the file is persistent, fsync its + * parent directory. + */ + fsync_parent_path(path, ERROR); + + pfree(path); +} + + +/* + * mdunlinkmark() -- Delete the mark file + * + * If isRedo is true, it's okay for the file being not found. + */ +void +mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + + if (!isRedo || mdmarkexists(reln, forkNum, mark)) + durable_unlink(path, ERROR); + + pfree(path); +} + +/* + * mdmarkexists() -- Check if the file exists. + */ +static bool +mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + fd = BasicOpenFile(path, O_RDONLY); + if (fd < 0 && errno != ENOENT) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not access mark file \"%s\": %m", path))); + pfree(path); + + if (fd < 0) + return false; + + return true; +} + /* * mdcreate() -- Create a new relation on magnetic disk. * @@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ @@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path) * Return 0 on success, -1 on failure, with errno set. */ int -mdunlinkfiletag(const FileTag *ftag, char *path) +mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark) { char *p; /* Compute the path. */ - p = relpathperm(ftag->rnode, MAIN_FORKNUM); + p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode, + ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM, + mark); strlcpy(path, p, MAXPGPATH); pfree(p); diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 0fcef4994b..110e64b0b2 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -62,6 +62,10 @@ typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); + void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); + void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); } f_smgr; static const f_smgr smgrsw[] = { @@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = { .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, .smgr_immedsync = mdimmedsync, + .smgr_createmark = mdcreatemark, + .smgr_unlinkmark = mdunlinkmark, } }; @@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo); } +/* + * smgrcreatemark() -- Create a mark file + */ +void +smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo); +} + +/* + * smgrunlinkmark() -- Delete a mark file + */ +void +smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo); +} + /* * smgrdosyncall() -- Immediately sync all forks of all given relations * @@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index d4083e8a56..9563940d45 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0; typedef struct SyncOps { int (*sync_syncfiletag) (const FileTag *ftag, char *path); - int (*sync_unlinkfiletag) (const FileTag *ftag, char *path); + int (*sync_unlinkfiletag) (const FileTag *ftag, char *path, + StorageMarks mark); bool (*sync_filetagmatches) (const FileTag *ftag, const FileTag *candidate); } SyncOps; @@ -222,7 +223,8 @@ SyncPostCheckpoint(void) /* Unlink the file */ if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag, - path) < 0) + path, + SMGR_MARK_NONE) < 0) { /* * There's a race condition, when the database is dropped at the @@ -236,6 +238,20 @@ SyncPostCheckpoint(void) (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } + else if (syncsw[entry->tag.handler].sync_unlinkfiletag( + &entry->tag, path, + SMGR_MARK_UNCOMMITTED) < 0) + { + /* + * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the + * fork files has been successfully removed. It's ok if the file + * does not exist. + */ + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); + } /* Mark the list entry as canceled, just in case */ entry->canceled = true; diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 436df54120..dbc0da5da5 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record) * source system. */ } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } else if (rmid == RM_XACT_ID && ((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT || (rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED || diff --git a/src/common/relpath.c b/src/common/relpath.c index 1f5c426ec0..67f24890d6 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode) */ char * GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber) + int backendId, ForkNumber forkNumber, char mark) { char *path; + char markstr[10]; + + if (mark == 0) + markstr[0] = 0; + else + snprintf(markstr, 10, ".%c", mark); if (spcNode == GLOBALTABLESPACE_OID) { @@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, Assert(dbNode == 0); Assert(backendId == InvalidBackendId); if (forkNumber != MAIN_FORKNUM) - path = psprintf("global/%u_%s", - relNode, forkNames[forkNumber]); + path = psprintf("global/%u_%s%s", + relNode, forkNames[forkNumber], markstr); else - path = psprintf("global/%u", relNode); + path = psprintf("global/%u%s", relNode, markstr); } else if (spcNode == DEFAULTTABLESPACE_OID) { @@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/%u_%s", + path = psprintf("base/%u/%u_%s%s", dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/%u", - dbNode, relNode); + path = psprintf("base/%u/%u%s", + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/t%d_%u_%s", + path = psprintf("base/%u/t%d_%u_%s%s", dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/t%d_%u", - dbNode, backendId, relNode); + path = psprintf("base/%u/t%d_%u%s", + dbNode, backendId, relNode, markstr); } } else @@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/%u", + path = psprintf("pg_tblspc/%u/%s/%u/%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, relNode); + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, backendId, relNode); + dbNode, backendId, relNode, markstr); } } + return path; } diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 0ab32b44e9..382623159c 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index f0814f1458..12346ed7f6 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -18,17 +18,23 @@ #include "lib/stringinfo.h" #include "storage/block.h" #include "storage/relfilenode.h" +#include "storage/smgr.h" /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation and buffer persistence change here, + * but logging of deletion actions is handled mainly by xact.c, because it is + * part of transaction commit in most cases. However, there's a case where + * init forks are deleted outside control of transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_MARK 0x40 +#define XLOG_SMGR_BUFPERSISTENCE 0x50 typedef struct xl_smgr_create { @@ -36,6 +42,32 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef enum smgr_mark_action +{ + XLOG_SMGR_MARK_CREATE = 'c', + XLOG_SMGR_MARK_UNLINK = 'u' +} smgr_mark_action; + +typedef struct xl_smgr_mark +{ + RelFileNode rnode; + ForkNumber forkNum; + StorageMarks mark; + smgr_mark_action action; +} xl_smgr_mark; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a44be11ca0..106a5cf508 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork); extern char *GetDatabasePath(Oid dbNode, Oid spcNode); extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber); + int backendId, ForkNumber forkNumber, char mark); /* * Wrapper macros for GetRelationPath. Beware of multiple @@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, /* First argument is a RelFileNode */ #define relpathbackend(rnode, backend, forknum) \ GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \ - backend, forknum) + backend, forknum, 0) /* First argument is a RelFileNode */ #define relpathperm(rnode, forknum) \ @@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, #define relpath(rnode, forknum) \ relpathbackend((rnode).node, (rnode).backend, forknum) +/* First argument is a RelFileNodeBackend */ +#define markpath(rnode, forknum, mark) \ + GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \ + (rnode).node.relNode, \ + (rnode).backend, forknum, mark) #endif /* RELPATH_H */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index cfce23ecbc..f5a7df87a4 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 34602ae006..2dc0357ad5 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd, extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); +extern int fsync_parent_path(const char *fname, int elevel); extern int durable_rename(const char *oldfile, const char *newfile, int loglevel); extern int durable_unlink(const char *fname, int loglevel); extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 752b440864..99620816b5 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -23,6 +23,10 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); +extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); @@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); /* md sync callbacks */ extern int mdsyncfiletag(const FileTag *ftag, char *path); -extern int mdunlinkfiletag(const FileTag *ftag, char *path); +extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark); extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate); #endif /* MD_H */ diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index fad1e5c473..e1f97e9b89 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -16,13 +16,15 @@ #define REINIT_H #include "common/relpath.h" - +#include "storage/smgr.h" extern void ResetUnloggedRelations(int op); -extern bool parse_filename_for_nontemp_relation(const char *name, - int *oidchars, ForkNumber *fork); +extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, + ForkNumber *fork, + StorageMarks *mark); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a6fbf7b6a6..201ecace8a 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -18,6 +18,18 @@ #include "storage/block.h" #include "storage/relfilenode.h" +/* + * Storage marks is a file of which existence suggests something about a + * file. The name of such files is "<filename>.<mark>", where the mark is one + * of the values of StorageMarks. Since ".<digit>" means segment files so don't + * use digits for the mark character. + */ +typedef enum StorageMarks +{ + SMGR_MARK_NONE = 0, + SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */ +} StorageMarks; + /* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles. An SMgrRelation is created (if not already present) @@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); +extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, -- 2.27.0 From 951e264c26bbb0523a872268fb28981227dda041 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v9 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 ++++ src/backend/nodes/equalfuncs.c | 15 ++++ src/backend/parser/gram.y | 20 +++++ src/backend/tcop/utility.c | 11 +++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 ++ 8 files changed, 214 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 0d9c801535..7c18ed9e75 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -14499,6 +14499,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(NIL); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index df0b747883..55e38cfe3f 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4269,6 +4269,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5622,6 +5635,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index cb7ddd463c..a19b7874d7 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3625,6 +3637,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 3d4dd43e47..9823d57a54 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1984,6 +1984,26 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 1fbc387d47..1483f9a475 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 336549cc5f..714077ff4c 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index 7c657c1241..8860b2e548 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -428,6 +428,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 4c5a8a39bf..c3e1bc66d1 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- -- 2.27.0
Hi Kyotaro, I'm glad you are still into this > I didn't register for some reasons. Right now in v8 there's a typo in ./src/backend/catalog/storage.c : storage.c: In function 'RelationDropInitFork': storage.c:385:44: error: expected statement before ')' token pending->unlink_forknum != INIT_FORKNUM)) <-- here, one ) too much > 1. I'm not sure that we want to have the new mark files. I can't help with such design decision, but if there are doubts maybe then add checking return codes around: a) pg_fsync() and fsync_parent_path() (??) inside mdcreatemark() b) mdunlinkmark() inside mdunlinkmark() and PANIC if something goes wrong? > 2. Aside of possible bugs, I'm not confident that the crash-safety of > this patch is actually water-tight. At least we need tests for some > failure cases. > > 3. As mentioned in transam/README, failure in removing smgr mark files > leads to immediate shut down. I'm not sure this behavior is acceptable. Doesn't it happen for most of the stuff already? There's even data_sync_retry GUC. > 4. Including the reasons above, this is not fully functionally. > For example, if we execute the following commands on primary, > replica dones't work correctly. (boom!) > > =# CREATE UNLOGGED TABLE t (a int); > =# ALTER TABLE t SET LOGGED; > > > The following fixes are done in the attched v8. > > - Rebased. Referring to Jakub and Justin's work, I replaced direct > access to ->rd_smgr with RelationGetSmgr() and removed calls to > RelationOpenSmgr(). I still separate the "ALTER TABLE ALL IN > TABLESPACE SET LOGGED/UNLOGGED" statement part. > > - Fixed RelationCreate/DropInitFork's behavior for non-target > relations. (From Jakub's work). > > - Fixed wording of some comments. > > - As revisited, I found a bug around recovery. If the logged-ness of a > relation gets flipped repeatedly in a transaction, duplicate > pending-delete entries are accumulated during recovery and work in a > wrong way. sgmr_redo now adds up to one entry for a action. > > - The issue 4 above is not fixed (yet). Thanks again, If you have any list of crush tests ideas maybe I'll have some minutes to try to figure them out. Is there is any goto list of stuff to be checked to add confidence to this patch (as per point #2) ? BTW fast feedback regarding that ALTER patch (there were 4 unlogged tables): # ALTER TABLE ALL IN TABLESPACE tbs1 set logged; WARNING: unrecognized node type: 349 -J.
At Mon, 20 Dec 2021 07:59:29 +0000, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote in > Hi Kyotaro, I'm glad you are still into this > > > I didn't register for some reasons. > > Right now in v8 there's a typo in ./src/backend/catalog/storage.c : > > storage.c: In function 'RelationDropInitFork': > storage.c:385:44: error: expected statement before ')' token > pending->unlink_forknum != INIT_FORKNUM)) <-- here, one ) too much Yeah, I thought that I had removed it. v9 patch I believe is correct. > > 1. I'm not sure that we want to have the new mark files. > > I can't help with such design decision, but if there are doubts maybe then add checking return codes around: > a) pg_fsync() and fsync_parent_path() (??) inside mdcreatemark() > b) mdunlinkmark() inside mdunlinkmark() > and PANIC if something goes wrong? The point is it is worth the complexity it adds. Since the mark file can resolve another existing (but I don't recall in detail) issue and this patchset actually fixes it, it can be said to have a certain extent of persuasiveness. But that doesn't change the fact that it's additional complexity. > > 2. Aside of possible bugs, I'm not confident that the crash-safety of > > this patch is actually water-tight. At least we need tests for some > > failure cases. > > > > 3. As mentioned in transam/README, failure in removing smgr mark files > > leads to immediate shut down. I'm not sure this behavior is acceptable. > > Doesn't it happen for most of the stuff already? There's even data_sync_retry GUC. Hmm. Yes, actually it is "as water-tight as possible". I just want others' eyes on that perspective. CF could be the entry point of others but I'm a bit hesitent to add a new entry.. > > 4. Including the reasons above, this is not fully functionally. > > For example, if we execute the following commands on primary, > > replica dones't work correctly. (boom!) > > > > =# CREATE UNLOGGED TABLE t (a int); > > =# ALTER TABLE t SET LOGGED; > > > > > > The following fixes are done in the attched v8. > > > > - Rebased. Referring to Jakub and Justin's work, I replaced direct > > access to ->rd_smgr with RelationGetSmgr() and removed calls to > > RelationOpenSmgr(). I still separate the "ALTER TABLE ALL IN > > TABLESPACE SET LOGGED/UNLOGGED" statement part. > > > > - Fixed RelationCreate/DropInitFork's behavior for non-target > > relations. (From Jakub's work). > > > > - Fixed wording of some comments. > > > > - As revisited, I found a bug around recovery. If the logged-ness of a > > relation gets flipped repeatedly in a transaction, duplicate > > pending-delete entries are accumulated during recovery and work in a > > wrong way. sgmr_redo now adds up to one entry for a action. > > > > - The issue 4 above is not fixed (yet). > > Thanks again, If you have any list of crush tests ideas maybe I'll have some minutes > to try to figure them out. Is there is any goto list of stuff to be checked to add confidence > to this patch (as per point #2) ? Just causing a crash (kill -9) after executing problem-prone command sequence, then seeing recovery works well would sufficient. For example: create unlogged table; begin; insert ..; alter table set logged; <crash>. Recovery works. "create logged; begin; {alter unlogged; alter logged;} * 1000; alter logged; commit/abort" doesn't pollute pgdata. > BTW fast feedback regarding that ALTER patch (there were 4 unlogged tables): > # ALTER TABLE ALL IN TABLESPACE tbs1 set logged; > WARNING: unrecognized node type: 349 lol I met a server crash. Will fix. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 20 Dec 2021 17:39:27 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > At Mon, 20 Dec 2021 07:59:29 +0000, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote in > > BTW fast feedback regarding that ALTER patch (there were 4 unlogged tables): > > # ALTER TABLE ALL IN TABLESPACE tbs1 set logged; > > WARNING: unrecognized node type: 349 > > lol I met a server crash. Will fix. Thanks! That crash vanished after a recompilation for me and I don't see that error. On my dev env node# 349 is T_ALterTableSetLoggedAllStmt, which 0002 adds. So perhaps make clean/make all would fix that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi Kyotaro, > At Mon, 20 Dec 2021 17:39:27 +0900 (JST), Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote in > > At Mon, 20 Dec 2021 07:59:29 +0000, Jakub Wartak > > <Jakub.Wartak@tomtom.com> wrote in > > > BTW fast feedback regarding that ALTER patch (there were 4 unlogged > tables): > > > # ALTER TABLE ALL IN TABLESPACE tbs1 set logged; > > > WARNING: unrecognized node type: 349 > > > > lol I met a server crash. Will fix. Thanks! > > That crash vanished after a recompilation for me and I don't see that error. On > my dev env node# 349 is T_ALterTableSetLoggedAllStmt, which > 0002 adds. So perhaps make clean/make all would fix that. The fastest I could - I've repeated the whole cycle about that one with fresh v9 (make clean, configure, make install, freshinitdb) and I've found two problems: 1) check-worlds seems OK but make -C src/test/recovery check shows a couple of failing tests here locally and in https://cirrus-ci.com/task/4699985735319552?logs=test#L807: t/009_twophase.pl (Wstat: 256 Tests: 24 Failed: 1) Failed test: 21 Non-zero exit status: 1 t/014_unlogged_reinit.pl (Wstat: 512 Tests: 12 Failed: 2) Failed tests: 9-10 Non-zero exit status: 2 t/018_wal_optimize.pl (Wstat: 7424 Tests: 0 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 38 tests but ran 0. t/022_crash_temp_files.pl (Wstat: 7424 Tests: 6 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 9 tests but ran 6. 018 made no sense, I've tried to take a quick look with wal_level=minimal why it is failing , it is mystery to me as thesequence seems to be pretty basic but the outcome is not: ~> cat repro.sql create tablespace tbs1 location '/tbs1'; CREATE TABLE moved (id int); INSERT INTO moved VALUES (1); BEGIN; ALTER TABLE moved SET TABLESPACE tbs1; CREATE TABLE originated (id int); INSERT INTO originated VALUES (1); CREATE UNIQUE INDEX ON originated(id) TABLESPACE tbs1; COMMIT; ~> psql -f repro.sql z3; sleep 1; /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile -m immediate stop CREATE TABLESPACE CREATE TABLE INSERT 0 1 BEGIN ALTER TABLE CREATE TABLE INSERT 0 1 CREATE INDEX COMMIT waiting for server to shut down.... done server stopped ~> /usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/15/data -l logfile start waiting for server to start.... done server started z3# select * from moved; ERROR: could not open file "pg_tblspc/32834/PG_15_202112131/32833/32838": No such file or directory z3=# select * from originated; ERROR: could not open file "base/32833/32839": No such file or directory z3=# \dt+ List of relations Schema | Name | Type | Owner | Persistence | Size | Description --------+------------+-------+----------+-------------+---------+------------- public | moved | table | postgres | permanent | 0 bytes | public | originated | table | postgres | permanent | 0 bytes | This happens even without placing on tablespace at all {for originated table , but no for moved on}, some major mishap isthere (commit should guarantee correctness) or I'm tired and having sloppy fingers. 2) minor one testcase, still something is odd. drop tablespace tbs1; create tablespace tbs1 location '/tbs1'; CREATE UNLOGGED TABLE t4 (a int) tablespace tbs1; CREATE UNLOGGED TABLE t5 (a int) tablespace tbs1; CREATE UNLOGGED TABLE t6 (a int) tablespace tbs1; CREATE TABLE t7 (a int) tablespace tbs1; insert into t7 values (1); insert into t5 values (1); insert into t6 values (1); \dt+ List of relations Schema | Name | Type | Owner | Persistence | Size | Description --------+------+-------+----------+-------------+------------+------------- public | t4 | table | postgres | unlogged | 0 bytes | public | t5 | table | postgres | unlogged | 8192 bytes | public | t6 | table | postgres | unlogged | 8192 bytes | public | t7 | table | postgres | permanent | 8192 bytes | (4 rows) ALTER TABLE ALL IN TABLESPACE tbs1 set logged; ==> STILL WARNING: unrecognized node type: 349 \dt+ List of relations Schema | Name | Type | Owner | Persistence | Size | Description --------+------+-------+----------+-------------+------------+------------- public | t4 | table | postgres | permanent | 0 bytes | public | t5 | table | postgres | permanent | 8192 bytes | public | t6 | table | postgres | permanent | 8192 bytes | public | t7 | table | postgres | permanent | 8192 bytes | So it did rewrite however this warning seems to be unfixed. I've tested on e2c52beecdea152ca680a22ef35c6a7da55aa30f. -J.
Ugh! I completely forgot about TAP tests.. Thanks for the testing and sorry for the bugs. This is a bit big change so I need a bit of time before posting the next version. At Mon, 20 Dec 2021 13:38:35 +0000, Jakub Wartak <Jakub.Wartak@tomtom.com> wrote in > 1) check-worlds seems OK but make -C src/test/recovery check shows a couple of failing tests here locally and in https://cirrus-ci.com/task/4699985735319552?logs=test#L807: > t/009_twophase.pl (Wstat: 256 Tests: 24 Failed: 1) > Failed test: 21 > Non-zero exit status: 1 PREPARE TRANSACTION requires uncommited file creation to be committed. Concretely we need to remove the "mark" files for the in-transaction created relation file during PREPARE TRANSACTION. pendingSync is not a parallel mechanism with pendingDeletes so we cannot move mark deletion to pendingSync. After all I decided to add a separate list pendingCleanups for pending non-deletion tasks separately from pendingDeletes and execute it before insering the commit record. Not only the above but also all of the following failures vanished by the change. > t/014_unlogged_reinit.pl (Wstat: 512 Tests: 12 Failed: 2) > Failed tests: 9-10 > Non-zero exit status: 2 > t/018_wal_optimize.pl (Wstat: 7424 Tests: 0 Failed: 0) > Non-zero exit status: 29 > Parse errors: Bad plan. You planned 38 tests but ran 0. > t/022_crash_temp_files.pl (Wstat: 7424 Tests: 6 Failed: 0) > Non-zero exit status: 29 > Parse errors: Bad plan. You planned 9 tests but ran 6. > 018 made no sense, I've tried to take a quick look with wal_level=minimal why it is failing , it is mystery to me as thesequence seems to be pretty basic but the outcome is not: I think this shares the same cause. > ~> cat repro.sql > create tablespace tbs1 location '/tbs1'; > CREATE TABLE moved (id int); > INSERT INTO moved VALUES (1); > BEGIN; > ALTER TABLE moved SET TABLESPACE tbs1; > CREATE TABLE originated (id int); > INSERT INTO originated VALUES (1); > CREATE UNIQUE INDEX ON originated(id) TABLESPACE tbs1; > COMMIT; .. > ERROR: could not open file "base/32833/32839": No such file or directory > z3=# \dt+ > List of relations > Schema | Name | Type | Owner | Persistence | Size | Description > --------+------------+-------+----------+-------------+---------+------------- > public | moved | table | postgres | permanent | 0 bytes | > public | originated | table | postgres | permanent | 0 bytes | > > This happens even without placing on tablespace at all {for originated table , but no for moved on}, some major mishapis there (commit should guarantee correctness) or I'm tired and having sloppy fingers. > > 2) minor one testcase, still something is odd. > > drop tablespace tbs1; > create tablespace tbs1 location '/tbs1'; > CREATE UNLOGGED TABLE t4 (a int) tablespace tbs1; > CREATE UNLOGGED TABLE t5 (a int) tablespace tbs1; > CREATE UNLOGGED TABLE t6 (a int) tablespace tbs1; > CREATE TABLE t7 (a int) tablespace tbs1; > insert into t7 values (1); > insert into t5 values (1); > insert into t6 values (1); > \dt+ > List of relations > Schema | Name | Type | Owner | Persistence | Size | Description > --------+------+-------+----------+-------------+------------+------------- > public | t4 | table | postgres | unlogged | 0 bytes | > public | t5 | table | postgres | unlogged | 8192 bytes | > public | t6 | table | postgres | unlogged | 8192 bytes | > public | t7 | table | postgres | permanent | 8192 bytes | > (4 rows) > > ALTER TABLE ALL IN TABLESPACE tbs1 set logged; > ==> STILL WARNING: unrecognized node type: 349 > \dt+ > List of relations > Schema | Name | Type | Owner | Persistence | Size | Description > --------+------+-------+----------+-------------+------------+------------- > public | t4 | table | postgres | permanent | 0 bytes | > public | t5 | table | postgres | permanent | 8192 bytes | > public | t6 | table | postgres | permanent | 8192 bytes | > public | t7 | table | postgres | permanent | 8192 bytes | > > So it did rewrite however this warning seems to be unfixed. I've tested on e2c52beecdea152ca680a22ef35c6a7da55aa30f. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 4 Jan 2022 16:05:08 -0800, Andres Freund <andres@anarazel.de> wrote in > The tap tests seems to fail on all platforms. See > https://cirrus-ci.com/build/4911549314760704 > > E.g. the linux failure is > > [16:45:15.569] > [16:45:15.569] # Failed test 'inserted' > [16:45:15.569] # at t/027_persistence_change.pl line 121. > [16:45:15.569] # Looks like you failed 1 test of 25. > [16:45:15.569] [16:45:15] t/027_persistence_change.pl .......... > [16:45:15.569] Dubious, test returned 1 (wstat 256, 0x100) > [16:45:15.569] Failed 1/25 subtests > [16:45:15.569] [16:45:15] > [16:45:15.569] > [16:45:15.569] Test Summary Report > [16:45:15.569] ------------------- > [16:45:15.569] t/027_persistence_change.pl (Wstat: 256 Tests: 25 Failed: 1) > [16:45:15.569] Failed test: 18 > [16:45:15.569] Non-zero exit status: 1 > [16:45:15.569] Files=27, Tests=315, 220 wallclock secs ( 0.14 usr 0.03 sys + 48.94 cusr 17.13 csys = 66.24 CPU) > > https://api.cirrus-ci.com/v1/artifact/task/4785083130314752/tap/src/test/recovery/tmp_check/log/regress_log_027_persistence_change Thank you very much. It still doesn't fail on my devlopment environment (CentOS8), but I found a silly bug of the test script. I'm still not sure the reason the test item failed but I repost the updated version then watch what the CI says. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From e2deae1bef19827803e0e8f85b1e45e3fcd88505 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v14 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 52 ++ src/backend/access/transam/README | 8 + src/backend/access/transam/xact.c | 7 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 545 +++++++++++++++++- src/backend/commands/tablecmds.c | 266 +++++++-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 88 +++ src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 344 +++++++---- src/backend/storage/smgr/md.c | 93 ++- src/backend/storage/smgr/smgr.c | 32 + src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 24 + src/common/relpath.c | 47 +- src/include/catalog/storage.h | 3 + src/include/catalog/storage_xlog.h | 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h | 17 + src/test/recovery/t/027_persistence_change.pl | 247 ++++++++ 24 files changed, 1707 insertions(+), 182 deletions(-) create mode 100644 src/test/recovery/t/027_persistence_change.pl diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7755553d57..d251f22207 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + action = "CREATE"; + break; + case XLOG_SMGR_MARK_UNLINK: + action = "DELETE"; + break; + default: + action = "<unknown action>"; + break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +98,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_MARK: + id = "MARK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..b344bbe511 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The Smgr MARK files +-------------------------------- + +An smgr mark file is created when a new relation file is created to +mark the relfilenode needs to be cleaned up at recovery time. In +contrast to the four actions above, failure to remove smgr mark files +will lead to data loss, in which case the server will shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index e7b0bc804d..b41186d6d8 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2197,6 +2197,9 @@ CommitTransaction(void) */ smgrDoPendingSyncs(true, is_parallel_worker); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2447,6 +2450,9 @@ PrepareTransaction(void) */ smgrDoPendingSyncs(true, false); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2772,6 +2778,7 @@ AbortTransaction(void) AfterTriggerEndXact(false); /* 'false' means it's abort */ AtAbort_Portals(); smgrDoPendingSyncs(false, is_parallel_worker); + smgrDoPendingCleanups(false); AtEOXact_LargeObject(false); AtAbort_Notify(); AtEOXact_RelationMap(false, is_parallel_worker); diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 87cd05c945..243860fcb1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -40,6 +40,7 @@ #include "catalog/catversion.h" #include "catalog/pg_control.h" #include "catalog/pg_database.h" +#include "catalog/storage.h" #include "commands/progress.h" #include "commands/tablespace.h" #include "common/controldata_utils.h" @@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -7824,6 +7833,14 @@ StartupXLOG(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index c5ad28d71f..d6b30387e9 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -66,6 +67,23 @@ typedef struct PendingRelDelete struct PendingRelDelete *next; /* linked-list link */ } PendingRelDelete; +#define PCOP_UNLINK_FORK (1 << 0) +#define PCOP_UNLINK_MARK (1 << 1) +#define PCOP_SET_PERSISTENCE (1 << 2) + +typedef struct PendingCleanup +{ + RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ + StorageMarks unlink_mark; /* mark to unlink */ + BackendId backend; /* InvalidBackendId if not a temp rel */ + bool atCommit; /* T=delete at commit; F=delete at abort */ + int nestLevel; /* xact nesting level of request */ + struct PendingCleanup *next; /* linked-list link */ +} PendingCleanup; + typedef struct PendingRelSync { RelFileNode rnode; @@ -73,6 +91,7 @@ typedef struct PendingRelSync } PendingRelSync; static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ +static PendingCleanup *pendingCleanups = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; @@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode) SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence) { - PendingRelDelete *pending; + PendingRelDelete *pendingdel; + PendingCleanup *pendingclean; SMgrRelation srel; BackendId backend; bool needs_wal; @@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up. The + * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup. + */ srel = smgropen(rnode, backend); + log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false); smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM); /* Add the relation to the list of stuff to delete at abort */ - pending = (PendingRelDelete *) + pendingdel = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); - pending->relnode = rnode; - pending->backend = backend; - pending->atCommit = false; /* delete if abort */ - pending->nestLevel = GetCurrentTransactionNestLevel(); - pending->next = pendingDeletes; - pendingDeletes = pending; + pendingdel->relnode = rnode; + pendingdel->backend = backend; + pendingdel->atCommit = false; /* delete if abort */ + pendingdel->nestLevel = GetCurrentTransactionNestLevel(); + pendingdel->next = pendingDeletes; + pendingDeletes = pendingdel; + + /* drop mark files at commit */ + pendingclean = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pendingclean->relnode = rnode; + pendingclean->op = PCOP_UNLINK_MARK; + pendingclean->unlink_forknum = MAIN_FORKNUM; + pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED; + pendingclean->backend = backend; + pendingclean->atCommit = true; + pendingclean->nestLevel = GetCurrentTransactionNestLevel(); + pendingclean->next = pendingCleanups; + pendingCleanups = pendingclean; if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { @@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + SMgrRelation srel; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false); + + /* + * If we have entries for init-fork operations on this relation, that means + * that we have already registered pending delete entries to drop an + * init-fork preexisting since before the current transaction started. This + * function reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum == INIT_FORKNUM) + { + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + create = false; + } + else + prev = pending; + } + + if (!create) + return; + + /* + * We are going to create an init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The mark file works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by ourselves. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop the init fork, mark file and revert persistence at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* drop mark file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false); + + /* + * If we have entries for init-fork operations of this relation, that means + * that we have created the init fork in the current transaction. We + * remove the init fork and mark file immediately in that case. Otherwise + * just register pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum != INIT_FORKNUM) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + inxact_created = true; + } + else + prev = pending; + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT forks never be loaded to shared buffer so no point in dropping + * buffers for such files. + */ + log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL. + */ +void +log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_CREATE; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL. + */ +void +log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_UNLINK; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit) prev->next = next; else pendingDeletes = next; + pfree(pending); /* prev does not change */ } @@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit) } } +/* + * smgrDoPendingUnmark() -- Clean up work that emits WAL records + * + * The operations handled in the function emits WAL records, which must be + * emitted before the commit record for the current transaction. + */ +void +smgrDoPendingCleanups(bool isCommit) +{ + int nestLevel = GetCurrentTransactionNestLevel(); + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + if (pending->nestLevel < nestLevel) + { + /* outer-level entries should not be processed yet */ + prev = pending; + } + else + { + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + /* do cleanup if called for */ + if (pending->atCommit == isCommit) + { + SMgrRelation srel; + + srel = smgropen(pending->relnode, pending->backend); + + Assert ((pending->op & + ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | + PCOP_SET_PERSISTENCE)) == 0); + + if (pending->op & PCOP_UNLINK_FORK) + { + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM); + + /* Don't emit wal while recovery. */ + if (!InRecovery) + log_smgrunlink(&pending->relnode, + pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + } + + if (pending->op & PCOP_UNLINK_MARK) + { + SMgrRelation srel; + + if (!InRecovery) + log_smgrunlinkmark(&pending->relnode, + pending->unlink_forknum, + pending->unlink_mark); + srel = smgropen(pending->relnode, pending->backend); + smgrunlinkmark(srel, pending->unlink_forknum, + pending->unlink_mark, InRecovery); + smgrclose(srel); + } + + if (pending->op & PCOP_SET_PERSISTENCE) + { + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); + } + } + + /* must explicitly free the list entry */ + pfree(pending); + /* prev does not change */ + } + } +} + /* * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact. */ @@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + bool created = false; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true); + created = true; + break; + case XLOG_SMGR_MARK_UNLINK: + smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true); + break; + default: + elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark); + } + + if (created) + { + /* revert mark file operation at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = xlrec->forkNum; + pending->unlink_mark = xlrec->mark; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + else + { + /* + * Delete pending action for this mark file if any. We should have + * at most one entry for this action. + */ + PendingCleanup *prev = NULL; + + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + pending->unlink_forknum == xlrec->forkNum && + (pending->op & PCOP_UNLINK_MARK) != 0) + { + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + } + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + PendingCleanup *prev = NULL; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* + * Delete pending action for persistence change if any. We should have + * at most one entry for this action. + */ + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + (pending->op & PCOP_SET_PERSISTENCE) != 0) + { + Assert (pending->bufpersistence == xlrec->persistence); + + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + + /* + * Revert buffer-persistence changes at abort if the relation is going + * to different persistence from before this transaction. + */ + if (!pending) + { + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 3631b8a929..848fda40ca 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -52,6 +52,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -5329,6 +5330,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform in-place persistnce change"); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * XXXX: Some access methods do not bear up an in-place persistence + * change. Specifically, GiST uses page LSNs to figure out whether a + * block has changed, where UNLOGGED GiST indexes use fake LSNs that + * are incompatible with real LSNs used for LOGGED ones. + * + * Maybe if gistGetFakeLSN behaved the same way for permanent and + * unlogged indexes, we could skip index rebuild in exchange of some + * extra WAL records emitted while it is unlogged. + * + * Check relam against a positive list so that we take this way for + * unknown AMs. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + /* GiST is excluded */ + r->rd_rel->relam != BTREE_AM_OID && + r->rd_rel->relam != HASH_AM_OID && + r->rd_rel->relam != GIN_AM_OID && + r->rd_rel->relam != SPGIST_AM_OID && + r->rd_rel->relam != BRIN_AM_OID) + { + int reindex_flags; + ReindexParams params = {0}; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, ¶ms); + + continue; + } + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(RelationGetSmgr(r), i)) + smgrimmedsync(RelationGetSmgr(r), i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + * We don't emit this fhile wal_level = minimal. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + xl_smgr_truncate xlrec; + + xlrec.blkno = 0; + xlrec.rnode = r->rd_node; + xlrec.flags = SMGR_TRUNCATE_ALL; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + + XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(RelationGetSmgr(r), fork)) + log_newpage_range(r, fork, 0, + smgrnblocks(RelationGetSmgr(r), fork), + false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5459,47 +5641,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod, - persistence, lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, + NewAccessMethod, + persistence, lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); - InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + } } else { diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index ec0485705d..45e1a5d817 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, bool excludeFound; ForkNumber relForkNum; /* Type of fork if file is a relation */ int relOidChars; /* Chars in filename that are the rel oid */ + StorageMarks mark; /* Skip special stuff */ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) @@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, /* Exclude all forks for unlogged tables except the init fork */ if (isDbDir && parse_filename_for_nontemp_relation(de->d_name, &relOidChars, - &relForkNum)) + &relForkNum, &mark)) { /* Never exclude init forks */ if (relForkNum != INIT_FORKNUM) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index b4532948d3..dab74bf99a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 263057841d..8487ae1f02 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel); static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel); -static int fsync_parent_path(const char *fname, int elevel); - /* * pg_fsync --- do fsync with or without writethrough @@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel) * This is aimed at making file operations persistent on disk in case of * an OS crash or power failure. */ -static int +int fsync_parent_path(const char *fname, int elevel) { char parentpath[MAXPGPATH]; diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 0ae3fb6902..0137902bb2 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,49 @@ #include <unistd.h> +#include "access/xlog.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" #include "postmaster/startup.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. + * + * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the + * init fork along with the mark file. + * + * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the + * whole relation along with the mark file. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. @@ -72,7 +92,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -81,13 +101,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + + Assert(tspid != 0); + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -103,7 +129,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s", dbspace_path); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create tons of unlogged relations in + * the same database & tablespace, so we'd better use a hash table rather + * than an array or linked list to keep track of which files need to be + * reset. Otherwise, this cleanup operation would be O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT fork and mark files in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + StorageMarks mark; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum, &mark)) + continue; + + if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has + * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty + * state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_init = true; + else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + DropRelFileNodesAllBuffers(srels, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + StorageMarks mark; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) + &forkNum, &mark)) continue; /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { @@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; @@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char mainpath[MAXPGPATH]; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, - ForkNumber *fork) + ForkNumber *fork, StorageMarks *mark) { int pos; @@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars, for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar) ; - if (segchar <= 1) - return false; - pos += segchar; + if (segchar > 1) + pos += segchar; } + /* mark file? */ + if (name[pos] == '.' && name[pos + 1] != 0) + { + *mark = name[pos + 1]; + pos += 2; + } + else + *mark = SMGR_MARK_NONE; + /* Now we should be at the end. */ if (name[pos] != '\0') return false; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index b4bca7eed6..580b74839f 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno, bool skipFsync, int behavior); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); - +static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum, + StorageMarks mark); /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum) return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL); } +/* + * mdcreatemark() -- Create a mark file. + * + * If isRedo is true, it's okay for the file to exist already. + */ +void +mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path =markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + /* See mdcreate for details.. */ + TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode, + reln->smgr_rnode.node.dbNode, + isRedo); + + fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL); + if (fd < 0 && (!isRedo || errno != EEXIST)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not crete mark file \"%s\": %m", path))); + + pg_fsync(fd); + close(fd); + + /* + * To guarantee that the creation of the file is persistent, fsync its + * parent directory. + */ + fsync_parent_path(path, ERROR); + + pfree(path); +} + + +/* + * mdunlinkmark() -- Delete the mark file + * + * If isRedo is true, it's okay for the file being not found. + */ +void +mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + + if (!isRedo || mdmarkexists(reln, forkNum, mark)) + durable_unlink(path, ERROR); + + pfree(path); +} + +/* + * mdmarkexists() -- Check if the file exists. + */ +static bool +mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + fd = BasicOpenFile(path, O_RDONLY); + if (fd < 0 && errno != ENOENT) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not access mark file \"%s\": %m", path))); + pfree(path); + + if (fd < 0) + return false; + + return true; +} + /* * mdcreate() -- Create a new relation on magnetic disk. * @@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ @@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path) * Return 0 on success, -1 on failure, with errno set. */ int -mdunlinkfiletag(const FileTag *ftag, char *path) +mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark) { char *p; /* Compute the path. */ - p = relpathperm(ftag->rnode, MAIN_FORKNUM); + p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode, + ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM, + mark); strlcpy(path, p, MAXPGPATH); pfree(p); diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 0fcef4994b..110e64b0b2 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -62,6 +62,10 @@ typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); + void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); + void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); } f_smgr; static const f_smgr smgrsw[] = { @@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = { .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, .smgr_immedsync = mdimmedsync, + .smgr_createmark = mdcreatemark, + .smgr_unlinkmark = mdunlinkmark, } }; @@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo); } +/* + * smgrcreatemark() -- Create a mark file + */ +void +smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo); +} + +/* + * smgrunlinkmark() -- Delete a mark file + */ +void +smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo); +} + /* * smgrdosyncall() -- Immediately sync all forks of all given relations * @@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index d4083e8a56..9563940d45 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0; typedef struct SyncOps { int (*sync_syncfiletag) (const FileTag *ftag, char *path); - int (*sync_unlinkfiletag) (const FileTag *ftag, char *path); + int (*sync_unlinkfiletag) (const FileTag *ftag, char *path, + StorageMarks mark); bool (*sync_filetagmatches) (const FileTag *ftag, const FileTag *candidate); } SyncOps; @@ -222,7 +223,8 @@ SyncPostCheckpoint(void) /* Unlink the file */ if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag, - path) < 0) + path, + SMGR_MARK_NONE) < 0) { /* * There's a race condition, when the database is dropped at the @@ -236,6 +238,20 @@ SyncPostCheckpoint(void) (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } + else if (syncsw[entry->tag.handler].sync_unlinkfiletag( + &entry->tag, path, + SMGR_MARK_UNCOMMITTED) < 0) + { + /* + * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the + * fork files has been successfully removed. It's ok if the file + * does not exist. + */ + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); + } /* Mark the list entry as canceled, just in case */ entry->canceled = true; diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 436df54120..dbc0da5da5 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record) * source system. */ } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } else if (rmid == RM_XACT_ID && ((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT || (rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED || diff --git a/src/common/relpath.c b/src/common/relpath.c index 1f5c426ec0..4945b111cc 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode) */ char * GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber) + int backendId, ForkNumber forkNumber, char mark) { char *path; + char markstr[4]; + + if (mark == 0) + markstr[0] = 0; + else + snprintf(markstr, sizeof(markstr), ".%c", mark); if (spcNode == GLOBALTABLESPACE_OID) { @@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, Assert(dbNode == 0); Assert(backendId == InvalidBackendId); if (forkNumber != MAIN_FORKNUM) - path = psprintf("global/%u_%s", - relNode, forkNames[forkNumber]); + path = psprintf("global/%u_%s%s", + relNode, forkNames[forkNumber], markstr); else - path = psprintf("global/%u", relNode); + path = psprintf("global/%u%s", relNode, markstr); } else if (spcNode == DEFAULTTABLESPACE_OID) { @@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/%u_%s", + path = psprintf("base/%u/%u_%s%s", dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/%u", - dbNode, relNode); + path = psprintf("base/%u/%u%s", + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/t%d_%u_%s", + path = psprintf("base/%u/t%d_%u_%s%s", dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/t%d_%u", - dbNode, backendId, relNode); + path = psprintf("base/%u/t%d_%u%s", + dbNode, backendId, relNode, markstr); } } else @@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/%u", + path = psprintf("pg_tblspc/%u/%s/%u/%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, relNode); + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, backendId, relNode); + dbNode, backendId, relNode, markstr); } } + return path; } diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 0ab32b44e9..584ebac391 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); @@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress); extern void smgrDoPendingDeletes(bool isCommit); extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker); extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr); +extern void smgrDoPendingCleanups(bool isCommit); extern void AtSubCommit_smgr(void); extern void AtSubAbort_smgr(void); extern void PostPrepare_smgr(void); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index f0814f1458..12346ed7f6 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -18,17 +18,23 @@ #include "lib/stringinfo.h" #include "storage/block.h" #include "storage/relfilenode.h" +#include "storage/smgr.h" /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation and buffer persistence change here, + * but logging of deletion actions is handled mainly by xact.c, because it is + * part of transaction commit in most cases. However, there's a case where + * init forks are deleted outside control of transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_MARK 0x40 +#define XLOG_SMGR_BUFPERSISTENCE 0x50 typedef struct xl_smgr_create { @@ -36,6 +42,32 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef enum smgr_mark_action +{ + XLOG_SMGR_MARK_CREATE = 'c', + XLOG_SMGR_MARK_UNLINK = 'u' +} smgr_mark_action; + +typedef struct xl_smgr_mark +{ + RelFileNode rnode; + ForkNumber forkNum; + StorageMarks mark; + smgr_mark_action action; +} xl_smgr_mark; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a44be11ca0..106a5cf508 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork); extern char *GetDatabasePath(Oid dbNode, Oid spcNode); extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber); + int backendId, ForkNumber forkNumber, char mark); /* * Wrapper macros for GetRelationPath. Beware of multiple @@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, /* First argument is a RelFileNode */ #define relpathbackend(rnode, backend, forknum) \ GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \ - backend, forknum) + backend, forknum, 0) /* First argument is a RelFileNode */ #define relpathperm(rnode, forknum) \ @@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, #define relpath(rnode, forknum) \ relpathbackend((rnode).node, (rnode).backend, forknum) +/* First argument is a RelFileNodeBackend */ +#define markpath(rnode, forknum, mark) \ + GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \ + (rnode).node.relNode, \ + (rnode).backend, forknum, mark) #endif /* RELPATH_H */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index cfce23ecbc..f5a7df87a4 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 34602ae006..2dc0357ad5 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd, extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); +extern int fsync_parent_path(const char *fname, int elevel); extern int durable_rename(const char *oldfile, const char *newfile, int loglevel); extern int durable_unlink(const char *fname, int loglevel); extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 752b440864..99620816b5 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -23,6 +23,10 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); +extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); @@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); /* md sync callbacks */ extern int mdsyncfiletag(const FileTag *ftag, char *path); -extern int mdunlinkfiletag(const FileTag *ftag, char *path); +extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark); extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate); #endif /* MD_H */ diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index fad1e5c473..e1f97e9b89 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -16,13 +16,15 @@ #define REINIT_H #include "common/relpath.h" - +#include "storage/smgr.h" extern void ResetUnloggedRelations(int op); -extern bool parse_filename_for_nontemp_relation(const char *name, - int *oidchars, ForkNumber *fork); +extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, + ForkNumber *fork, + StorageMarks *mark); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a6fbf7b6a6..201ecace8a 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -18,6 +18,18 @@ #include "storage/block.h" #include "storage/relfilenode.h" +/* + * Storage marks is a file of which existence suggests something about a + * file. The name of such files is "<filename>.<mark>", where the mark is one + * of the values of StorageMarks. Since ".<digit>" means segment files so don't + * use digits for the mark character. + */ +typedef enum StorageMarks +{ + SMGR_MARK_NONE = 0, + SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */ +} StorageMarks; + /* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles. An SMgrRelation is created (if not already present) @@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); +extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, diff --git a/src/test/recovery/t/027_persistence_change.pl b/src/test/recovery/t/027_persistence_change.pl new file mode 100644 index 0000000000..c2f7076ea9 --- /dev/null +++ b/src/test/recovery/t/027_persistence_change.pl @@ -0,0 +1,247 @@ + +# Copyright (c) 2021, PostgreSQL Global Development Group + +# Test relation persistence change +use strict; +use warnings; +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use Test::More; +use Test::More tests => 30; +use IPC::Run qw(pump finish timer); +use Config; + +my $data_unit = 2000; + +# Initialize primary node. +my $node = PostgreSQL::Test::Cluster->new('node'); +$node->init; +# we don't want checkpointing +$node->append_conf('postgresql.conf', qq( +checkpoint_timeout = '24h' +)); +$node->start; +create($node); + +my $relfilenodes1 = relfilenodes(); + +# correctly recover empty tables +$node->stop('immediate'); +$node->start; +insert($node, 0, $data_unit); + +# data persists after a crash +$node->stop('immediate'); +$node->start; +checkdataloss($data_unit, 'crash logged 1'); + +set_unlogged($node); +# SET UNLOGGED didn't change relfilenode +my $relfilenodes2 = relfilenodes(); +checkrelfilenodes($relfilenodes1, $relfilenodes2, 'logged->unlogged'); + +# data cleanly vanishes after a crash +$node->stop('immediate'); +$node->start; +checkdataloss(0, 'crash unlogged'); + +insert($node, 0, $data_unit); +set_logged($node); + +$node->stop('immediate'); +$node->start; +# SET LOGGED didn't change relfilenode and data survive a crash +my $relfilenodes3 = relfilenodes(); +checkrelfilenodes($relfilenodes2, $relfilenodes3, 'unlogged->logged'); +checkdataloss($data_unit, 'crash logged 2'); + +# unlogged insert -> graceful stop +set_unlogged($node); +insert($node, $data_unit, $data_unit, 0); +$node->stop; +$node->start; +checkdataloss($data_unit * 2, 'unlogged graceful restart'); + +# crash during transaction +set_logged($node); +$node->stop('immediate'); +$node->start; +insert($node, $data_unit * 2, $data_unit); + +my $h = insert($node, $data_unit * 3, $data_unit, 1); ## this is aborted +$node->stop('immediate'); + +# finishing $h stalls this case, just tear it off. +$h = undef; + +# check if indexes are working +$node->start; +# drop first half of data to reduce run time +$node->safe_psql('postgres', 'DELETE FROM t WHERE bt < ' . $data_unit * 2); +check($node, $data_unit * 2, $data_unit * 3 - 1, 'final check'); + +sub create +{ + my ($node) = @_; + + $node->psql('postgres', qq( + CREATE TABLE t (bt int, gin int[], gist point, hash int, + brin int, spgist point); + CREATE INDEX i_bt ON t USING btree (bt); + CREATE INDEX i_gin ON t USING gin (gin); + CREATE INDEX i_gist ON t USING gist (gist); + CREATE INDEX i_hash ON t USING hash (hash); + CREATE INDEX i_brin ON t USING brin (brin); + CREATE INDEX i_spgist ON t USING spgist (spgist);)); +} + + +sub insert +{ + my ($node, $st, $num, $interactive) = @_; + my $ed = $st + $num - 1; + my $query = qq(BEGIN; +INSERT INTO t + (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i) + FROM generate_series($st, $ed) i); +); + + if ($interactive) + { + my $in = ''; + my $out = ''; + my $timer = timer(10); + + my $h = $node->interactive_psql('postgres', \$in, \$out, $timer); + like($out, qr/psql/, "print startup banner"); + + $timer->start(10); + $in .= $query . "SELECT 'END';\n"; + pump $h until ($out =~ /\nEND/ || $timer->is_expired); + ok(($out =~ /\nEND/ && !$timer->is_expired), "inserted-$st-$num"); + return $h + # the trasaction is not terminated + } + else + { + $node->psql('postgres', $query . "COMMIT;"); + return undef; + } +} + +sub check +{ + my ($node, $st, $ed, $head) = @_; + my $num_data = $ed - $st + 1; + + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO true; + SET enable_indexscan TO false; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE bt = i)), + $num_data, "$head: heap is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE bt = i)), + $num_data, "$head: btree is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE gin = ARRAY[i, i * 2];)), + $num_data, "$head: gin is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)), + $num_data, "$head: gist is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE hash = i;)), + $num_data, "$head: hash is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE brin = i;)), + $num_data, "$head: brin is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)), + $num_data, "$head: spgist is not broken"); +} + +sub set_unlogged +{ + my ($node) = @_; + + $node->psql('postgres', qq( + ALTER TABLE t SET UNLOGGED; +)); +} + +sub set_logged +{ + my ($node) = @_; + + $node->psql('postgres', qq( + ALTER TABLE t SET LOGGED; +)); +} + +sub relfilenodes +{ + my $result = $node->safe_psql('postgres', qq{ + SELECT relname, relfilenode FROM pg_class + WHERE relname + IN ('t', 'i_bt','i_gin','i_gist','i_hash','i_brin','i_spgist');}); + + my %relfilenodes; + + foreach my $l (split(/\n/, $result)) + { + die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/); + $relfilenodes{$1} = $2; + } + + # the number must correspond to the in list above + is (scalar %relfilenodes, 7, "number of relations is correct"); + + return \%relfilenodes; +} + +sub checkrelfilenodes +{ + my ($rnodes1, $rnodes2, $s) = @_; + + foreach my $n (keys %{$rnodes1}) + { + if ($n eq 'i_gist') + { + # persistence of GiST index is not changed in-place + isnt($rnodes1->{$n}, $rnodes2->{$n}, + "$s: relfilenode is changed: $n"); + } + else + { + # otherwise all relations are processed in-place + is($rnodes1->{$n}, $rnodes2->{$n}, + "$s: relfilenode is not changed: $n"); + } + } +} + +sub checkdataloss +{ + my ($expected, $s) = @_; + + is($node->safe_psql('postgres', "SELECT count(*) FROM t;"), $expected, + "$s: data in table t is in the expected state"); +} -- 2.27.0 From f7a23cafbbdbca874ac5ecdbc15360d0408de160 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v14 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 ++++ src/backend/nodes/equalfuncs.c | 15 ++++ src/backend/parser/gram.y | 20 +++++ src/backend/tcop/utility.c | 11 +++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 ++ 8 files changed, 214 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 848fda40ca..9aa263db65 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -14509,6 +14509,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(NIL); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 18e778e856..51b6ad757f 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4270,6 +4270,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5623,6 +5636,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index cb7ddd463c..a19b7874d7 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3625,6 +3637,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 6dddc07947..a55ea302c1 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1984,6 +1984,26 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 1fbc387d47..1483f9a475 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 336549cc5f..714077ff4c 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index 7c657c1241..8860b2e548 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -428,6 +428,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 593e301f7a..b9226a7cd9 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- -- 2.27.0
Hi, On January 5, 2022 8:30:17 PM PST, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: >At Tue, 4 Jan 2022 16:05:08 -0800, Andres Freund <andres@anarazel.de> wrote in >> The tap tests seems to fail on all platforms. See >> https://cirrus-ci.com/build/4911549314760704 >> >> E.g. the linux failure is >> >> [16:45:15.569] >> [16:45:15.569] # Failed test 'inserted' >> [16:45:15.569] # at t/027_persistence_change.pl line 121. >> [16:45:15.569] # Looks like you failed 1 test of 25. >> [16:45:15.569] [16:45:15] t/027_persistence_change.pl .......... >> [16:45:15.569] Dubious, test returned 1 (wstat 256, 0x100) >> [16:45:15.569] Failed 1/25 subtests >> [16:45:15.569] [16:45:15] >> [16:45:15.569] >> [16:45:15.569] Test Summary Report >> [16:45:15.569] ------------------- >> [16:45:15.569] t/027_persistence_change.pl (Wstat: 256 Tests: 25 Failed: 1) >> [16:45:15.569] Failed test: 18 >> [16:45:15.569] Non-zero exit status: 1 >> [16:45:15.569] Files=27, Tests=315, 220 wallclock secs ( 0.14 usr 0.03 sys + 48.94 cusr 17.13 csys = 66.24 CPU) >> >> https://api.cirrus-ci.com/v1/artifact/task/4785083130314752/tap/src/test/recovery/tmp_check/log/regress_log_027_persistence_change > >Thank you very much. It still doesn't fail on my devlopment >environment (CentOS8), but I found a silly bug of the test script. >I'm still not sure the reason the test item failed but I repost the >updated version then watch what the CI says. Fwiw, you can now test the same way as cfbot does with a lower turnaround time, as explained in src/tools/ci/README -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
At Wed, 05 Jan 2022 20:42:32 -0800, Andres Freund <andres@anarazel.de> wrote in > Hi, > > On January 5, 2022 8:30:17 PM PST, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > >I'm still not sure the reason the test item failed but I repost the > >updated version then watch what the CI says. > > Fwiw, you can now test the same way as cfbot does with a lower turnaround time, as explained in src/tools/ci/README Fantastic! I'll give it a try. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 06 Jan 2022 16:39:21 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Fantastic! I'll give it a try. Thanks! I did that and found that the test stumbled on newlines. Tests succeeded for other than Windows. Windows version fails for a real known issue. [7916][postmaster] LOG: received immediate shutdown request [7916][postmaster] LOG: database system is shut down [6228][postmaster] LOG: starting PostgreSQL 15devel, compiled by Visual C++ build 1929, 64-bit [6228][postmaster] LOG: listening on Unix socket "C:/Users/ContainerAdministrator/AppData/Local/Temp/NcMnt2KTsr/.s.PGSQL.58698" [2948][startup] LOG: database system was interrupted; last known up at 2022-01-07 07:12:14 GMT [2948][startup] LOG: database system was not properly shut down; automatic recovery in progress [2948][startup] LOG: redo starts at 0/1484280 [2948][startup] LOG: invalid record length at 0/14A47B8: wanted 24, got 0 [2948][startup] FATAL: could not remove file "base/12759/16384.u": Permission denied [6228][postmaster] LOG: startup process (PID 2948) exited with exit code 1 Mmm.. Someone is still grasping the file after restart? Anyway, I post the fixed version. This still fails on Windows.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 48527df0c7d094a8ca7cc8d0c90df02bfd7c2614 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v15 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 52 ++ src/backend/access/transam/README | 8 + src/backend/access/transam/xact.c | 7 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 545 +++++++++++++++++- src/backend/commands/tablecmds.c | 266 +++++++-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 88 +++ src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 344 +++++++---- src/backend/storage/smgr/md.c | 93 ++- src/backend/storage/smgr/smgr.c | 32 + src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 24 + src/common/relpath.c | 47 +- src/include/catalog/storage.h | 3 + src/include/catalog/storage_xlog.h | 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h | 17 + src/test/recovery/t/027_persistence_change.pl | 247 ++++++++ 24 files changed, 1707 insertions(+), 182 deletions(-) create mode 100644 src/test/recovery/t/027_persistence_change.pl diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7755553d57..d251f22207 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + action = "CREATE"; + break; + case XLOG_SMGR_MARK_UNLINK: + action = "DELETE"; + break; + default: + action = "<unknown action>"; + break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +98,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_MARK: + id = "MARK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..b344bbe511 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The Smgr MARK files +-------------------------------- + +An smgr mark file is created when a new relation file is created to +mark the relfilenode needs to be cleaned up at recovery time. In +contrast to the four actions above, failure to remove smgr mark files +will lead to data loss, in which case the server will shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index e7b0bc804d..b41186d6d8 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2197,6 +2197,9 @@ CommitTransaction(void) */ smgrDoPendingSyncs(true, is_parallel_worker); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2447,6 +2450,9 @@ PrepareTransaction(void) */ smgrDoPendingSyncs(true, false); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2772,6 +2778,7 @@ AbortTransaction(void) AfterTriggerEndXact(false); /* 'false' means it's abort */ AtAbort_Portals(); smgrDoPendingSyncs(false, is_parallel_worker); + smgrDoPendingCleanups(false); AtEOXact_LargeObject(false); AtAbort_Notify(); AtEOXact_RelationMap(false, is_parallel_worker); diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 87cd05c945..243860fcb1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -40,6 +40,7 @@ #include "catalog/catversion.h" #include "catalog/pg_control.h" #include "catalog/pg_database.h" +#include "catalog/storage.h" #include "commands/progress.h" #include "commands/tablespace.h" #include "common/controldata_utils.h" @@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -7824,6 +7833,14 @@ StartupXLOG(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index c5ad28d71f..d6b30387e9 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -66,6 +67,23 @@ typedef struct PendingRelDelete struct PendingRelDelete *next; /* linked-list link */ } PendingRelDelete; +#define PCOP_UNLINK_FORK (1 << 0) +#define PCOP_UNLINK_MARK (1 << 1) +#define PCOP_SET_PERSISTENCE (1 << 2) + +typedef struct PendingCleanup +{ + RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ + StorageMarks unlink_mark; /* mark to unlink */ + BackendId backend; /* InvalidBackendId if not a temp rel */ + bool atCommit; /* T=delete at commit; F=delete at abort */ + int nestLevel; /* xact nesting level of request */ + struct PendingCleanup *next; /* linked-list link */ +} PendingCleanup; + typedef struct PendingRelSync { RelFileNode rnode; @@ -73,6 +91,7 @@ typedef struct PendingRelSync } PendingRelSync; static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ +static PendingCleanup *pendingCleanups = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; @@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode) SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence) { - PendingRelDelete *pending; + PendingRelDelete *pendingdel; + PendingCleanup *pendingclean; SMgrRelation srel; BackendId backend; bool needs_wal; @@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up. The + * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup. + */ srel = smgropen(rnode, backend); + log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false); smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM); /* Add the relation to the list of stuff to delete at abort */ - pending = (PendingRelDelete *) + pendingdel = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); - pending->relnode = rnode; - pending->backend = backend; - pending->atCommit = false; /* delete if abort */ - pending->nestLevel = GetCurrentTransactionNestLevel(); - pending->next = pendingDeletes; - pendingDeletes = pending; + pendingdel->relnode = rnode; + pendingdel->backend = backend; + pendingdel->atCommit = false; /* delete if abort */ + pendingdel->nestLevel = GetCurrentTransactionNestLevel(); + pendingdel->next = pendingDeletes; + pendingDeletes = pendingdel; + + /* drop mark files at commit */ + pendingclean = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pendingclean->relnode = rnode; + pendingclean->op = PCOP_UNLINK_MARK; + pendingclean->unlink_forknum = MAIN_FORKNUM; + pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED; + pendingclean->backend = backend; + pendingclean->atCommit = true; + pendingclean->nestLevel = GetCurrentTransactionNestLevel(); + pendingclean->next = pendingCleanups; + pendingCleanups = pendingclean; if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { @@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + SMgrRelation srel; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false); + + /* + * If we have entries for init-fork operations on this relation, that means + * that we have already registered pending delete entries to drop an + * init-fork preexisting since before the current transaction started. This + * function reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum == INIT_FORKNUM) + { + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + create = false; + } + else + prev = pending; + } + + if (!create) + return; + + /* + * We are going to create an init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The mark file works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by ourselves. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop the init fork, mark file and revert persistence at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* drop mark file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false); + + /* + * If we have entries for init-fork operations of this relation, that means + * that we have created the init fork in the current transaction. We + * remove the init fork and mark file immediately in that case. Otherwise + * just register pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum != INIT_FORKNUM) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + inxact_created = true; + } + else + prev = pending; + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT forks never be loaded to shared buffer so no point in dropping + * buffers for such files. + */ + log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL. + */ +void +log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_CREATE; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL. + */ +void +log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_UNLINK; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit) prev->next = next; else pendingDeletes = next; + pfree(pending); /* prev does not change */ } @@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit) } } +/* + * smgrDoPendingUnmark() -- Clean up work that emits WAL records + * + * The operations handled in the function emits WAL records, which must be + * emitted before the commit record for the current transaction. + */ +void +smgrDoPendingCleanups(bool isCommit) +{ + int nestLevel = GetCurrentTransactionNestLevel(); + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + if (pending->nestLevel < nestLevel) + { + /* outer-level entries should not be processed yet */ + prev = pending; + } + else + { + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + /* do cleanup if called for */ + if (pending->atCommit == isCommit) + { + SMgrRelation srel; + + srel = smgropen(pending->relnode, pending->backend); + + Assert ((pending->op & + ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | + PCOP_SET_PERSISTENCE)) == 0); + + if (pending->op & PCOP_UNLINK_FORK) + { + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM); + + /* Don't emit wal while recovery. */ + if (!InRecovery) + log_smgrunlink(&pending->relnode, + pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + } + + if (pending->op & PCOP_UNLINK_MARK) + { + SMgrRelation srel; + + if (!InRecovery) + log_smgrunlinkmark(&pending->relnode, + pending->unlink_forknum, + pending->unlink_mark); + srel = smgropen(pending->relnode, pending->backend); + smgrunlinkmark(srel, pending->unlink_forknum, + pending->unlink_mark, InRecovery); + smgrclose(srel); + } + + if (pending->op & PCOP_SET_PERSISTENCE) + { + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); + } + } + + /* must explicitly free the list entry */ + pfree(pending); + /* prev does not change */ + } + } +} + /* * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact. */ @@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + bool created = false; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true); + created = true; + break; + case XLOG_SMGR_MARK_UNLINK: + smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true); + break; + default: + elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark); + } + + if (created) + { + /* revert mark file operation at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = xlrec->forkNum; + pending->unlink_mark = xlrec->mark; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + else + { + /* + * Delete pending action for this mark file if any. We should have + * at most one entry for this action. + */ + PendingCleanup *prev = NULL; + + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + pending->unlink_forknum == xlrec->forkNum && + (pending->op & PCOP_UNLINK_MARK) != 0) + { + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + } + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + PendingCleanup *prev = NULL; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* + * Delete pending action for persistence change if any. We should have + * at most one entry for this action. + */ + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + (pending->op & PCOP_SET_PERSISTENCE) != 0) + { + Assert (pending->bufpersistence == xlrec->persistence); + + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + + /* + * Revert buffer-persistence changes at abort if the relation is going + * to different persistence from before this transaction. + */ + if (!pending) + { + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 89bc865e28..51fcf9ca5f 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -52,6 +52,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -5346,6 +5347,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform in-place persistnce change"); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * XXXX: Some access methods do not bear up an in-place persistence + * change. Specifically, GiST uses page LSNs to figure out whether a + * block has changed, where UNLOGGED GiST indexes use fake LSNs that + * are incompatible with real LSNs used for LOGGED ones. + * + * Maybe if gistGetFakeLSN behaved the same way for permanent and + * unlogged indexes, we could skip index rebuild in exchange of some + * extra WAL records emitted while it is unlogged. + * + * Check relam against a positive list so that we take this way for + * unknown AMs. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + /* GiST is excluded */ + r->rd_rel->relam != BTREE_AM_OID && + r->rd_rel->relam != HASH_AM_OID && + r->rd_rel->relam != GIN_AM_OID && + r->rd_rel->relam != SPGIST_AM_OID && + r->rd_rel->relam != BRIN_AM_OID) + { + int reindex_flags; + ReindexParams params = {0}; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, ¶ms); + + continue; + } + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(RelationGetSmgr(r), i)) + smgrimmedsync(RelationGetSmgr(r), i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + * We don't emit this fhile wal_level = minimal. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + xl_smgr_truncate xlrec; + + xlrec.blkno = 0; + xlrec.rnode = r->rd_node; + xlrec.flags = SMGR_TRUNCATE_ALL; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + + XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(RelationGetSmgr(r), fork)) + log_newpage_range(r, fork, 0, + smgrnblocks(RelationGetSmgr(r), fork), + false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5476,47 +5658,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod, - persistence, lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, + NewAccessMethod, + persistence, lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); - InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + } } else { diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index ec0485705d..45e1a5d817 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, bool excludeFound; ForkNumber relForkNum; /* Type of fork if file is a relation */ int relOidChars; /* Chars in filename that are the rel oid */ + StorageMarks mark; /* Skip special stuff */ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) @@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, /* Exclude all forks for unlogged tables except the init fork */ if (isDbDir && parse_filename_for_nontemp_relation(de->d_name, &relOidChars, - &relForkNum)) + &relForkNum, &mark)) { /* Never exclude init forks */ if (relForkNum != INIT_FORKNUM) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index b4532948d3..dab74bf99a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 263057841d..8487ae1f02 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel); static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel); -static int fsync_parent_path(const char *fname, int elevel); - /* * pg_fsync --- do fsync with or without writethrough @@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel) * This is aimed at making file operations persistent on disk in case of * an OS crash or power failure. */ -static int +int fsync_parent_path(const char *fname, int elevel) { char parentpath[MAXPGPATH]; diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 0ae3fb6902..0137902bb2 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,49 @@ #include <unistd.h> +#include "access/xlog.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" #include "postmaster/startup.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. + * + * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the + * init fork along with the mark file. + * + * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the + * whole relation along with the mark file. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. @@ -72,7 +92,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -81,13 +101,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + + Assert(tspid != 0); + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -103,7 +129,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s", dbspace_path); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create tons of unlogged relations in + * the same database & tablespace, so we'd better use a hash table rather + * than an array or linked list to keep track of which files need to be + * reset. Otherwise, this cleanup operation would be O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT fork and mark files in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + StorageMarks mark; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum, &mark)) + continue; + + if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has + * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty + * state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_init = true; + else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + DropRelFileNodesAllBuffers(srels, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + StorageMarks mark; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) + &forkNum, &mark)) continue; /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { @@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; @@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char mainpath[MAXPGPATH]; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, - ForkNumber *fork) + ForkNumber *fork, StorageMarks *mark) { int pos; @@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars, for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar) ; - if (segchar <= 1) - return false; - pos += segchar; + if (segchar > 1) + pos += segchar; } + /* mark file? */ + if (name[pos] == '.' && name[pos + 1] != 0) + { + *mark = name[pos + 1]; + pos += 2; + } + else + *mark = SMGR_MARK_NONE; + /* Now we should be at the end. */ if (name[pos] != '\0') return false; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index b4bca7eed6..580b74839f 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno, bool skipFsync, int behavior); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); - +static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum, + StorageMarks mark); /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -169,6 +170,81 @@ mdexists(SMgrRelation reln, ForkNumber forkNum) return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL); } +/* + * mdcreatemark() -- Create a mark file. + * + * If isRedo is true, it's okay for the file to exist already. + */ +void +mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path =markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + /* See mdcreate for details.. */ + TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode, + reln->smgr_rnode.node.dbNode, + isRedo); + + fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL); + if (fd < 0 && (!isRedo || errno != EEXIST)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not crete mark file \"%s\": %m", path))); + + pg_fsync(fd); + close(fd); + + /* + * To guarantee that the creation of the file is persistent, fsync its + * parent directory. + */ + fsync_parent_path(path, ERROR); + + pfree(path); +} + + +/* + * mdunlinkmark() -- Delete the mark file + * + * If isRedo is true, it's okay for the file being not found. + */ +void +mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + + if (!isRedo || mdmarkexists(reln, forkNum, mark)) + durable_unlink(path, ERROR); + + pfree(path); +} + +/* + * mdmarkexists() -- Check if the file exists. + */ +static bool +mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + fd = BasicOpenFile(path, O_RDONLY); + if (fd < 0 && errno != ENOENT) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not access mark file \"%s\": %m", path))); + pfree(path); + + if (fd < 0) + return false; + + return true; +} + /* * mdcreate() -- Create a new relation on magnetic disk. * @@ -1025,6 +1101,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ @@ -1378,12 +1463,14 @@ mdsyncfiletag(const FileTag *ftag, char *path) * Return 0 on success, -1 on failure, with errno set. */ int -mdunlinkfiletag(const FileTag *ftag, char *path) +mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark) { char *p; /* Compute the path. */ - p = relpathperm(ftag->rnode, MAIN_FORKNUM); + p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode, + ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM, + mark); strlcpy(path, p, MAXPGPATH); pfree(p); diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 0fcef4994b..110e64b0b2 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -62,6 +62,10 @@ typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); + void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); + void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); } f_smgr; static const f_smgr smgrsw[] = { @@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = { .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, .smgr_immedsync = mdimmedsync, + .smgr_createmark = mdcreatemark, + .smgr_unlinkmark = mdunlinkmark, } }; @@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo); } +/* + * smgrcreatemark() -- Create a mark file + */ +void +smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo); +} + +/* + * smgrunlinkmark() -- Delete a mark file + */ +void +smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo); +} + /* * smgrdosyncall() -- Immediately sync all forks of all given relations * @@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index d4083e8a56..9563940d45 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0; typedef struct SyncOps { int (*sync_syncfiletag) (const FileTag *ftag, char *path); - int (*sync_unlinkfiletag) (const FileTag *ftag, char *path); + int (*sync_unlinkfiletag) (const FileTag *ftag, char *path, + StorageMarks mark); bool (*sync_filetagmatches) (const FileTag *ftag, const FileTag *candidate); } SyncOps; @@ -222,7 +223,8 @@ SyncPostCheckpoint(void) /* Unlink the file */ if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag, - path) < 0) + path, + SMGR_MARK_NONE) < 0) { /* * There's a race condition, when the database is dropped at the @@ -236,6 +238,20 @@ SyncPostCheckpoint(void) (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } + else if (syncsw[entry->tag.handler].sync_unlinkfiletag( + &entry->tag, path, + SMGR_MARK_UNCOMMITTED) < 0) + { + /* + * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the + * fork files has been successfully removed. It's ok if the file + * does not exist. + */ + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); + } /* Mark the list entry as canceled, just in case */ entry->canceled = true; diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 436df54120..dbc0da5da5 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record) * source system. */ } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } else if (rmid == RM_XACT_ID && ((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT || (rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED || diff --git a/src/common/relpath.c b/src/common/relpath.c index 1f5c426ec0..4945b111cc 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode) */ char * GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber) + int backendId, ForkNumber forkNumber, char mark) { char *path; + char markstr[4]; + + if (mark == 0) + markstr[0] = 0; + else + snprintf(markstr, sizeof(markstr), ".%c", mark); if (spcNode == GLOBALTABLESPACE_OID) { @@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, Assert(dbNode == 0); Assert(backendId == InvalidBackendId); if (forkNumber != MAIN_FORKNUM) - path = psprintf("global/%u_%s", - relNode, forkNames[forkNumber]); + path = psprintf("global/%u_%s%s", + relNode, forkNames[forkNumber], markstr); else - path = psprintf("global/%u", relNode); + path = psprintf("global/%u%s", relNode, markstr); } else if (spcNode == DEFAULTTABLESPACE_OID) { @@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/%u_%s", + path = psprintf("base/%u/%u_%s%s", dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/%u", - dbNode, relNode); + path = psprintf("base/%u/%u%s", + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/t%d_%u_%s", + path = psprintf("base/%u/t%d_%u_%s%s", dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/t%d_%u", - dbNode, backendId, relNode); + path = psprintf("base/%u/t%d_%u%s", + dbNode, backendId, relNode, markstr); } } else @@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/%u", + path = psprintf("pg_tblspc/%u/%s/%u/%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, relNode); + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, backendId, relNode); + dbNode, backendId, relNode, markstr); } } + return path; } diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 0ab32b44e9..584ebac391 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); @@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress); extern void smgrDoPendingDeletes(bool isCommit); extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker); extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr); +extern void smgrDoPendingCleanups(bool isCommit); extern void AtSubCommit_smgr(void); extern void AtSubAbort_smgr(void); extern void PostPrepare_smgr(void); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index f0814f1458..12346ed7f6 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -18,17 +18,23 @@ #include "lib/stringinfo.h" #include "storage/block.h" #include "storage/relfilenode.h" +#include "storage/smgr.h" /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation and buffer persistence change here, + * but logging of deletion actions is handled mainly by xact.c, because it is + * part of transaction commit in most cases. However, there's a case where + * init forks are deleted outside control of transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_MARK 0x40 +#define XLOG_SMGR_BUFPERSISTENCE 0x50 typedef struct xl_smgr_create { @@ -36,6 +42,32 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef enum smgr_mark_action +{ + XLOG_SMGR_MARK_CREATE = 'c', + XLOG_SMGR_MARK_UNLINK = 'u' +} smgr_mark_action; + +typedef struct xl_smgr_mark +{ + RelFileNode rnode; + ForkNumber forkNum; + StorageMarks mark; + smgr_mark_action action; +} xl_smgr_mark; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a44be11ca0..106a5cf508 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork); extern char *GetDatabasePath(Oid dbNode, Oid spcNode); extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber); + int backendId, ForkNumber forkNumber, char mark); /* * Wrapper macros for GetRelationPath. Beware of multiple @@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, /* First argument is a RelFileNode */ #define relpathbackend(rnode, backend, forknum) \ GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \ - backend, forknum) + backend, forknum, 0) /* First argument is a RelFileNode */ #define relpathperm(rnode, forknum) \ @@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, #define relpath(rnode, forknum) \ relpathbackend((rnode).node, (rnode).backend, forknum) +/* First argument is a RelFileNodeBackend */ +#define markpath(rnode, forknum, mark) \ + GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \ + (rnode).node.relNode, \ + (rnode).backend, forknum, mark) #endif /* RELPATH_H */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index cfce23ecbc..f5a7df87a4 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 34602ae006..2dc0357ad5 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd, extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); +extern int fsync_parent_path(const char *fname, int elevel); extern int durable_rename(const char *oldfile, const char *newfile, int loglevel); extern int durable_unlink(const char *fname, int loglevel); extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 752b440864..99620816b5 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -23,6 +23,10 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); +extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); @@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); /* md sync callbacks */ extern int mdsyncfiletag(const FileTag *ftag, char *path); -extern int mdunlinkfiletag(const FileTag *ftag, char *path); +extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark); extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate); #endif /* MD_H */ diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index fad1e5c473..e1f97e9b89 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -16,13 +16,15 @@ #define REINIT_H #include "common/relpath.h" - +#include "storage/smgr.h" extern void ResetUnloggedRelations(int op); -extern bool parse_filename_for_nontemp_relation(const char *name, - int *oidchars, ForkNumber *fork); +extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, + ForkNumber *fork, + StorageMarks *mark); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a6fbf7b6a6..201ecace8a 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -18,6 +18,18 @@ #include "storage/block.h" #include "storage/relfilenode.h" +/* + * Storage marks is a file of which existence suggests something about a + * file. The name of such files is "<filename>.<mark>", where the mark is one + * of the values of StorageMarks. Since ".<digit>" means segment files so don't + * use digits for the mark character. + */ +typedef enum StorageMarks +{ + SMGR_MARK_NONE = 0, + SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */ +} StorageMarks; + /* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles. An SMgrRelation is created (if not already present) @@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); +extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, diff --git a/src/test/recovery/t/027_persistence_change.pl b/src/test/recovery/t/027_persistence_change.pl new file mode 100644 index 0000000000..526b19cbda --- /dev/null +++ b/src/test/recovery/t/027_persistence_change.pl @@ -0,0 +1,247 @@ + +# Copyright (c) 2021, PostgreSQL Global Development Group + +# Test relation persistence change +use strict; +use warnings; +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use Test::More; +use Test::More tests => 30; +use IPC::Run qw(pump finish timer); +use Config; + +my $data_unit = 2000; + +# Initialize primary node. +my $node = PostgreSQL::Test::Cluster->new('node'); +$node->init; +# we don't want checkpointing +$node->append_conf('postgresql.conf', qq( +checkpoint_timeout = '24h' +)); +$node->start; +create($node); + +my $relfilenodes1 = relfilenodes(); + +# correctly recover empty tables +$node->stop('immediate'); +$node->start; +insert($node, 0, $data_unit, 0); + +# data persists after a crash +$node->stop('immediate'); +$node->start; +checkdataloss($data_unit, 'crash logged 1'); + +set_unlogged($node); +# SET UNLOGGED didn't change relfilenode +my $relfilenodes2 = relfilenodes(); +checkrelfilenodes($relfilenodes1, $relfilenodes2, 'logged->unlogged'); + +# data cleanly vanishes after a crash +$node->stop('immediate'); +$node->start; +checkdataloss(0, 'crash unlogged'); + +insert($node, 0, $data_unit, 0); +set_logged($node); + +$node->stop('immediate'); +$node->start; +# SET LOGGED didn't change relfilenode and data survive a crash +my $relfilenodes3 = relfilenodes(); +checkrelfilenodes($relfilenodes2, $relfilenodes3, 'unlogged->logged'); +checkdataloss($data_unit, 'crash logged 2'); + +# unlogged insert -> graceful stop +set_unlogged($node); +insert($node, $data_unit, $data_unit, 0); +$node->stop; +$node->start; +checkdataloss($data_unit * 2, 'unlogged graceful restart'); + +# crash during transaction +set_logged($node); +$node->stop('immediate'); +$node->start; +insert($node, $data_unit * 2, $data_unit, 0); + +my $h = insert($node, $data_unit * 3, $data_unit, 1); ## this is aborted +$node->stop('immediate'); + +# finishing $h stalls this case, just tear it off. +$h = undef; + +# check if indexes are working +$node->start; +# drop first half of data to reduce run time +$node->safe_psql('postgres', 'DELETE FROM t WHERE bt < ' . $data_unit * 2); +check($node, $data_unit * 2, $data_unit * 3 - 1, 'final check'); + +sub create +{ + my ($node) = @_; + + $node->psql('postgres', qq( + CREATE TABLE t (bt int, gin int[], gist point, hash int, + brin int, spgist point); + CREATE INDEX i_bt ON t USING btree (bt); + CREATE INDEX i_gin ON t USING gin (gin); + CREATE INDEX i_gist ON t USING gist (gist); + CREATE INDEX i_hash ON t USING hash (hash); + CREATE INDEX i_brin ON t USING brin (brin); + CREATE INDEX i_spgist ON t USING spgist (spgist);)); +} + + +sub insert +{ + my ($node, $st, $num, $interactive) = @_; + my $ed = $st + $num - 1; + my $query = qq(BEGIN; +INSERT INTO t + (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i) + FROM generate_series($st, $ed) i); +); + + if ($interactive) + { + my $in = ''; + my $out = ''; + my $timer = timer(10); + + my $h = $node->interactive_psql('postgres', \$in, \$out, $timer); + like($out, qr/psql/, "print startup banner"); + + $in .= "$query\n"; + pump $h until ($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/ || + $timer->is_expired); + ok(($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/), "inserted-$st-$num"); + return $h + # the trasaction is not terminated + } + else + { + $node->psql('postgres', $query . "COMMIT;"); + return undef; + } +} + +sub check +{ + my ($node, $st, $ed, $head) = @_; + my $num_data = $ed - $st + 1; + + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO true; + SET enable_indexscan TO false; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE bt = i)), + $num_data, "$head: heap is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE bt = i)), + $num_data, "$head: btree is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE gin = ARRAY[i, i * 2];)), + $num_data, "$head: gin is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)), + $num_data, "$head: gist is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE hash = i;)), + $num_data, "$head: hash is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE brin = i;)), + $num_data, "$head: brin is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)), + $num_data, "$head: spgist is not broken"); +} + +sub set_unlogged +{ + my ($node) = @_; + + $node->psql('postgres', qq( + ALTER TABLE t SET UNLOGGED; +)); +} + +sub set_logged +{ + my ($node) = @_; + + $node->psql('postgres', qq( + ALTER TABLE t SET LOGGED; +)); +} + +sub relfilenodes +{ + my $result = $node->safe_psql('postgres', qq{ + SELECT relname, relfilenode FROM pg_class + WHERE relname + IN ('t', 'i_bt','i_gin','i_gist','i_hash','i_brin','i_spgist');}); + + my %relfilenodes; + + foreach my $l (split(/\n/, $result)) + { + die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/); + $relfilenodes{$1} = $2; + } + + # the number must correspond to the in list above + is (scalar %relfilenodes, 7, "number of relations is correct"); + + return \%relfilenodes; +} + +sub checkrelfilenodes +{ + my ($rnodes1, $rnodes2, $s) = @_; + + foreach my $n (keys %{$rnodes1}) + { + if ($n eq 'i_gist') + { + # persistence of GiST index is not changed in-place + isnt($rnodes1->{$n}, $rnodes2->{$n}, + "$s: relfilenode is changed: $n"); + } + else + { + # otherwise all relations are processed in-place + is($rnodes1->{$n}, $rnodes2->{$n}, + "$s: relfilenode is not changed: $n"); + } + } +} + +sub checkdataloss +{ + my ($expected, $s) = @_; + + is($node->safe_psql('postgres', "SELECT count(*) FROM t;"), $expected, + "$s: data in table t is in the expected state"); +} -- 2.27.0 From 4e1d78c0eaf0c34f58c2ab2708244a75f3791add Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v15 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 ++++ src/backend/nodes/equalfuncs.c | 15 ++++ src/backend/parser/gram.y | 20 +++++ src/backend/tcop/utility.c | 11 +++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 9 ++ 8 files changed, 214 insertions(+) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 51fcf9ca5f..1620fe771d 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -14770,6 +14770,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(NIL); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 18e778e856..51b6ad757f 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4270,6 +4270,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5623,6 +5636,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index cb7ddd463c..a19b7874d7 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3625,6 +3637,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 6dddc07947..a55ea302c1 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1984,6 +1984,26 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 1fbc387d47..1483f9a475 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 336549cc5f..714077ff4c 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index 7c657c1241..8860b2e548 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -428,6 +428,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 593e301f7a..b9226a7cd9 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2350,6 +2350,15 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- -- 2.27.0
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: not tested I've retested v15 of the patch with everything that came to my mind. The patch passes all my tests (well, there's this justwindows / cfbot issue). Patch looks good to me. I haven't looked in-depth at the code, so patch might still need review. FYI, about potential usage of this patch: the most advanced test that I did was continually bouncing wal_level - it works.So chain of : 1. wal_level=replica->minimal 2. alter table set unlogged and load a lot of data, set logged 3. wal_level=minimal->replica 4. barman incremental backup # rsync(1) just backups changed files in steps 2 and 3 (not whole DB) 5. some other (logged) work The idea is that when performing mass-alterations to the DB (think nightly ETL/ELT on TB-sized DBs), one could skip backingup most of DB and then just quickly backup only the changed files - during the maintenance window - e.g. thanks tolocal-rsync barman mode. This is the output of barman show-backups after loading data to unlogged table each such cycle: mydb 20220110T100236 - Mon Jan 10 10:05:14 2022 - Size: 144.1 GiB - WAL Size: 16.0 KiB mydb 20220110T094905 - Mon Jan 10 09:50:12 2022 - Size: 128.5 GiB - WAL Size: 80.2 KiB mydb 20220110T094016 - Mon Jan 10 09:40:17 2022 - Size: 109.1 GiB - WAL Size: 496.3 KiB And dedupe ratio of the last one: Backup size: 144.1 GiB. Actual size on disk: 36.1 GiB (-74.96% deduplication ratio). The only thing I've found out that bouncing wal_level also forces max_wal_senders=X -> 0 -> X which in turn requires droppingreplication slot for pg_receievewal (e.g. barman receive-wal --create-slot/--drop-slot/--reset). I have tested therestore using barman recover afterwards to backup 20220110T094905 and indeed it worked OK using this patch too. The new status of this patch is: Needs review
I found a bug. mdmarkexists() didn't close the tentatively opend fd. This is a silent leak on Linux and similars and it causes delete failure on Windows. It was the reason of the CI failure. 027_persistence_change.pl uses interactive_psql() that doesn't work on the Windos VM on the CI. In this version the following changes have been made in 0001. - Properly close file descriptor in mdmarkexists. - Skip some tests when IO::Pty is not available. It might be better to separate that part. Looking again the ALTER TABLE ALL IN TABLESPACE SET LOGGED patch, I noticed that it doesn't implement OWNED BY part and doesn't have test and documenttaion (it was PoC). Added all of them to 0002. At Tue, 11 Jan 2022 09:33:55 +0000, Jakub Wartak <jakub.wartak@tomtom.com> wrote in > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, passed > Spec compliant: tested, passed > Documentation: not tested > > I've retested v15 of the patch with everything that came to my mind. The patch passes all my tests (well, there's thisjust windows / cfbot issue). Patch looks good to me. I haven't looked in-depth at the code, so patch might still needreview. Thanks for checking. > FYI, about potential usage of this patch: the most advanced test that I did was continually bouncing wal_level - it works.So chain of : > 1. wal_level=replica->minimal > 2. alter table set unlogged and load a lot of data, set logged > 3. wal_level=minimal->replica > 4. barman incremental backup # rsync(1) just backups changed files in steps 2 and 3 (not whole DB) > 5. some other (logged) work > The idea is that when performing mass-alterations to the DB (think nightly ETL/ELT on TB-sized DBs), one could skip backingup most of DB and then just quickly backup only the changed files - during the maintenance window - e.g. thanks tolocal-rsync barman mode. This is the output of barman show-backups after loading data to unlogged table each such cycle: > mydb 20220110T100236 - Mon Jan 10 10:05:14 2022 - Size: 144.1 GiB - WAL Size: 16.0 KiB > mydb 20220110T094905 - Mon Jan 10 09:50:12 2022 - Size: 128.5 GiB - WAL Size: 80.2 KiB > mydb 20220110T094016 - Mon Jan 10 09:40:17 2022 - Size: 109.1 GiB - WAL Size: 496.3 KiB > And dedupe ratio of the last one: Backup size: 144.1 GiB. Actual size on disk: 36.1 GiB (-74.96% deduplication ratio). Ah, The patch skips duping relation files. This is advantageous that that not only eliminates the I/O activities the duping causes but also reduce the size of incremental backup. I didn't noticed only the latter advantage. > The only thing I've found out that bouncing wal_level also forces max_wal_senders=X -> 0 -> X which in turn requires droppingreplication slot for pg_receievewal (e.g. barman receive-wal --create-slot/--drop-slot/--reset). I have tested therestore using barman recover afterwards to backup 20220110T094905 and indeed it worked OK using this patch too. Year, it is irrelevant to this patch but I'm annoyed by the restriction. I think it would be okay that max_wal_senders is forcibly set to 0 while wal_level=minimal.. > The new status of this patch is: Needs review regards. -- Kyotaro Horiguchi NTT Open Source Software Center From d6bf0bd0d60391b24d5be7942b546acfffa3d7b1 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v16 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 52 ++ src/backend/access/transam/README | 8 + src/backend/access/transam/xact.c | 7 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 545 +++++++++++++++++- src/backend/commands/tablecmds.c | 266 +++++++-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 88 +++ src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 344 +++++++---- src/backend/storage/smgr/md.c | 94 ++- src/backend/storage/smgr/smgr.c | 32 + src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 24 + src/common/relpath.c | 47 +- src/include/catalog/storage.h | 3 + src/include/catalog/storage_xlog.h | 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h | 17 + src/test/recovery/t/027_persistence_change.pl | 263 +++++++++ 24 files changed, 1724 insertions(+), 182 deletions(-) create mode 100644 src/test/recovery/t/027_persistence_change.pl diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7755553d57..d251f22207 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + action = "CREATE"; + break; + case XLOG_SMGR_MARK_UNLINK: + action = "DELETE"; + break; + default: + action = "<unknown action>"; + break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +98,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_MARK: + id = "MARK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..b344bbe511 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The Smgr MARK files +-------------------------------- + +An smgr mark file is created when a new relation file is created to +mark the relfilenode needs to be cleaned up at recovery time. In +contrast to the four actions above, failure to remove smgr mark files +will lead to data loss, in which case the server will shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index e7b0bc804d..b41186d6d8 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2197,6 +2197,9 @@ CommitTransaction(void) */ smgrDoPendingSyncs(true, is_parallel_worker); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2447,6 +2450,9 @@ PrepareTransaction(void) */ smgrDoPendingSyncs(true, false); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2772,6 +2778,7 @@ AbortTransaction(void) AfterTriggerEndXact(false); /* 'false' means it's abort */ AtAbort_Portals(); smgrDoPendingSyncs(false, is_parallel_worker); + smgrDoPendingCleanups(false); AtEOXact_LargeObject(false); AtAbort_Notify(); AtEOXact_RelationMap(false, is_parallel_worker); diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 87cd05c945..243860fcb1 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -40,6 +40,7 @@ #include "catalog/catversion.h" #include "catalog/pg_control.h" #include "catalog/pg_database.h" +#include "catalog/storage.h" #include "commands/progress.h" #include "commands/tablespace.h" #include "common/controldata_utils.h" @@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -7824,6 +7833,14 @@ StartupXLOG(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index c5ad28d71f..d6b30387e9 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -66,6 +67,23 @@ typedef struct PendingRelDelete struct PendingRelDelete *next; /* linked-list link */ } PendingRelDelete; +#define PCOP_UNLINK_FORK (1 << 0) +#define PCOP_UNLINK_MARK (1 << 1) +#define PCOP_SET_PERSISTENCE (1 << 2) + +typedef struct PendingCleanup +{ + RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ + StorageMarks unlink_mark; /* mark to unlink */ + BackendId backend; /* InvalidBackendId if not a temp rel */ + bool atCommit; /* T=delete at commit; F=delete at abort */ + int nestLevel; /* xact nesting level of request */ + struct PendingCleanup *next; /* linked-list link */ +} PendingCleanup; + typedef struct PendingRelSync { RelFileNode rnode; @@ -73,6 +91,7 @@ typedef struct PendingRelSync } PendingRelSync; static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ +static PendingCleanup *pendingCleanups = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; @@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode) SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence) { - PendingRelDelete *pending; + PendingRelDelete *pendingdel; + PendingCleanup *pendingclean; SMgrRelation srel; BackendId backend; bool needs_wal; @@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up. The + * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup. + */ srel = smgropen(rnode, backend); + log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false); smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM); /* Add the relation to the list of stuff to delete at abort */ - pending = (PendingRelDelete *) + pendingdel = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); - pending->relnode = rnode; - pending->backend = backend; - pending->atCommit = false; /* delete if abort */ - pending->nestLevel = GetCurrentTransactionNestLevel(); - pending->next = pendingDeletes; - pendingDeletes = pending; + pendingdel->relnode = rnode; + pendingdel->backend = backend; + pendingdel->atCommit = false; /* delete if abort */ + pendingdel->nestLevel = GetCurrentTransactionNestLevel(); + pendingdel->next = pendingDeletes; + pendingDeletes = pendingdel; + + /* drop mark files at commit */ + pendingclean = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pendingclean->relnode = rnode; + pendingclean->op = PCOP_UNLINK_MARK; + pendingclean->unlink_forknum = MAIN_FORKNUM; + pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED; + pendingclean->backend = backend; + pendingclean->atCommit = true; + pendingclean->nestLevel = GetCurrentTransactionNestLevel(); + pendingclean->next = pendingCleanups; + pendingCleanups = pendingclean; if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { @@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + SMgrRelation srel; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false); + + /* + * If we have entries for init-fork operations on this relation, that means + * that we have already registered pending delete entries to drop an + * init-fork preexisting since before the current transaction started. This + * function reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum == INIT_FORKNUM) + { + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + create = false; + } + else + prev = pending; + } + + if (!create) + return; + + /* + * We are going to create an init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The mark file works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by ourselves. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop the init fork, mark file and revert persistence at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* drop mark file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false); + + /* + * If we have entries for init-fork operations of this relation, that means + * that we have created the init fork in the current transaction. We + * remove the init fork and mark file immediately in that case. Otherwise + * just register pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum != INIT_FORKNUM) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + inxact_created = true; + } + else + prev = pending; + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT forks never be loaded to shared buffer so no point in dropping + * buffers for such files. + */ + log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL. + */ +void +log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_CREATE; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL. + */ +void +log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_UNLINK; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit) prev->next = next; else pendingDeletes = next; + pfree(pending); /* prev does not change */ } @@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit) } } +/* + * smgrDoPendingUnmark() -- Clean up work that emits WAL records + * + * The operations handled in the function emits WAL records, which must be + * emitted before the commit record for the current transaction. + */ +void +smgrDoPendingCleanups(bool isCommit) +{ + int nestLevel = GetCurrentTransactionNestLevel(); + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + if (pending->nestLevel < nestLevel) + { + /* outer-level entries should not be processed yet */ + prev = pending; + } + else + { + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + /* do cleanup if called for */ + if (pending->atCommit == isCommit) + { + SMgrRelation srel; + + srel = smgropen(pending->relnode, pending->backend); + + Assert ((pending->op & + ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | + PCOP_SET_PERSISTENCE)) == 0); + + if (pending->op & PCOP_UNLINK_FORK) + { + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM); + + /* Don't emit wal while recovery. */ + if (!InRecovery) + log_smgrunlink(&pending->relnode, + pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + } + + if (pending->op & PCOP_UNLINK_MARK) + { + SMgrRelation srel; + + if (!InRecovery) + log_smgrunlinkmark(&pending->relnode, + pending->unlink_forknum, + pending->unlink_mark); + srel = smgropen(pending->relnode, pending->backend); + smgrunlinkmark(srel, pending->unlink_forknum, + pending->unlink_mark, InRecovery); + smgrclose(srel); + } + + if (pending->op & PCOP_SET_PERSISTENCE) + { + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); + } + } + + /* must explicitly free the list entry */ + pfree(pending); + /* prev does not change */ + } + } +} + /* * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact. */ @@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + bool created = false; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true); + created = true; + break; + case XLOG_SMGR_MARK_UNLINK: + smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true); + break; + default: + elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark); + } + + if (created) + { + /* revert mark file operation at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = xlrec->forkNum; + pending->unlink_mark = xlrec->mark; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + else + { + /* + * Delete pending action for this mark file if any. We should have + * at most one entry for this action. + */ + PendingCleanup *prev = NULL; + + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + pending->unlink_forknum == xlrec->forkNum && + (pending->op & PCOP_UNLINK_MARK) != 0) + { + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + } + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + PendingCleanup *prev = NULL; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* + * Delete pending action for persistence change if any. We should have + * at most one entry for this action. + */ + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + (pending->op & PCOP_SET_PERSISTENCE) != 0) + { + Assert (pending->bufpersistence == xlrec->persistence); + + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + + /* + * Revert buffer-persistence changes at abort if the relation is going + * to different persistence from before this transaction. + */ + if (!pending) + { + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 89bc865e28..51fcf9ca5f 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -52,6 +52,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -5346,6 +5347,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform in-place persistnce change"); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * XXXX: Some access methods do not bear up an in-place persistence + * change. Specifically, GiST uses page LSNs to figure out whether a + * block has changed, where UNLOGGED GiST indexes use fake LSNs that + * are incompatible with real LSNs used for LOGGED ones. + * + * Maybe if gistGetFakeLSN behaved the same way for permanent and + * unlogged indexes, we could skip index rebuild in exchange of some + * extra WAL records emitted while it is unlogged. + * + * Check relam against a positive list so that we take this way for + * unknown AMs. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + /* GiST is excluded */ + r->rd_rel->relam != BTREE_AM_OID && + r->rd_rel->relam != HASH_AM_OID && + r->rd_rel->relam != GIN_AM_OID && + r->rd_rel->relam != SPGIST_AM_OID && + r->rd_rel->relam != BRIN_AM_OID) + { + int reindex_flags; + ReindexParams params = {0}; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, ¶ms); + + continue; + } + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(RelationGetSmgr(r), i)) + smgrimmedsync(RelationGetSmgr(r), i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + * We don't emit this fhile wal_level = minimal. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + xl_smgr_truncate xlrec; + + xlrec.blkno = 0; + xlrec.rnode = r->rd_node; + xlrec.flags = SMGR_TRUNCATE_ALL; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + + XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(RelationGetSmgr(r), fork)) + log_newpage_range(r, fork, 0, + smgrnblocks(RelationGetSmgr(r), fork), + false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5476,47 +5658,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod, - persistence, lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, + NewAccessMethod, + persistence, lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); - InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + } } else { diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index ec0485705d..45e1a5d817 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1070,6 +1070,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, bool excludeFound; ForkNumber relForkNum; /* Type of fork if file is a relation */ int relOidChars; /* Chars in filename that are the rel oid */ + StorageMarks mark; /* Skip special stuff */ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) @@ -1120,7 +1121,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, /* Exclude all forks for unlogged tables except the init fork */ if (isDbDir && parse_filename_for_nontemp_relation(de->d_name, &relOidChars, - &relForkNum)) + &relForkNum, &mark)) { /* Never exclude init forks */ if (relForkNum != INIT_FORKNUM) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index b4532948d3..dab74bf99a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 263057841d..8487ae1f02 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel); static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel); -static int fsync_parent_path(const char *fname, int elevel); - /* * pg_fsync --- do fsync with or without writethrough @@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel) * This is aimed at making file operations persistent on disk in case of * an OS crash or power failure. */ -static int +int fsync_parent_path(const char *fname, int elevel) { char parentpath[MAXPGPATH]; diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 0ae3fb6902..0137902bb2 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,49 @@ #include <unistd.h> +#include "access/xlog.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" #include "postmaster/startup.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. + * + * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the + * init fork along with the mark file. + * + * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the + * whole relation along with the mark file. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. @@ -72,7 +92,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -81,13 +101,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + + Assert(tspid != 0); + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -103,7 +129,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s", dbspace_path); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create tons of unlogged relations in + * the same database & tablespace, so we'd better use a hash table rather + * than an array or linked list to keep track of which files need to be + * reset. Otherwise, this cleanup operation would be O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT fork and mark files in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + StorageMarks mark; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum, &mark)) + continue; + + if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has + * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty + * state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_init = true; + else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + DropRelFileNodesAllBuffers(srels, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + StorageMarks mark; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) + &forkNum, &mark)) continue; /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { @@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; @@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char mainpath[MAXPGPATH]; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, - ForkNumber *fork) + ForkNumber *fork, StorageMarks *mark) { int pos; @@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars, for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar) ; - if (segchar <= 1) - return false; - pos += segchar; + if (segchar > 1) + pos += segchar; } + /* mark file? */ + if (name[pos] == '.' && name[pos + 1] != 0) + { + *mark = name[pos + 1]; + pos += 2; + } + else + *mark = SMGR_MARK_NONE; + /* Now we should be at the end. */ if (name[pos] != '\0') return false; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index b4bca7eed6..1f3aac5bcc 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno, bool skipFsync, int behavior); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); - +static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum, + StorageMarks mark); /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -169,6 +170,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum) return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL); } +/* + * mdcreatemark() -- Create a mark file. + * + * If isRedo is true, it's okay for the file to exist already. + */ +void +mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path =markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + /* See mdcreate for details.. */ + TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode, + reln->smgr_rnode.node.dbNode, + isRedo); + + fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL); + if (fd < 0 && (!isRedo || errno != EEXIST)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not crete mark file \"%s\": %m", path))); + + pg_fsync(fd); + close(fd); + + /* + * To guarantee that the creation of the file is persistent, fsync its + * parent directory. + */ + fsync_parent_path(path, ERROR); + + pfree(path); +} + + +/* + * mdunlinkmark() -- Delete the mark file + * + * If isRedo is true, it's okay for the file being not found. + */ +void +mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + + if (!isRedo || mdmarkexists(reln, forkNum, mark)) + durable_unlink(path, ERROR); + + pfree(path); +} + +/* + * mdmarkexists() -- Check if the file exists. + */ +static bool +mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + fd = BasicOpenFile(path, O_RDONLY); + if (fd < 0 && errno != ENOENT) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not access mark file \"%s\": %m", path))); + pfree(path); + + if (fd < 0) + return false; + + close(fd); + return true; +} + /* * mdcreate() -- Create a new relation on magnetic disk. * @@ -1025,6 +1102,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ @@ -1378,12 +1464,14 @@ mdsyncfiletag(const FileTag *ftag, char *path) * Return 0 on success, -1 on failure, with errno set. */ int -mdunlinkfiletag(const FileTag *ftag, char *path) +mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark) { char *p; /* Compute the path. */ - p = relpathperm(ftag->rnode, MAIN_FORKNUM); + p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode, + ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM, + mark); strlcpy(path, p, MAXPGPATH); pfree(p); diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index 0fcef4994b..110e64b0b2 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -62,6 +62,10 @@ typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); + void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); + void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); } f_smgr; static const f_smgr smgrsw[] = { @@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = { .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, .smgr_immedsync = mdimmedsync, + .smgr_createmark = mdcreatemark, + .smgr_unlinkmark = mdunlinkmark, } }; @@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo); } +/* + * smgrcreatemark() -- Create a mark file + */ +void +smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo); +} + +/* + * smgrunlinkmark() -- Delete a mark file + */ +void +smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo); +} + /* * smgrdosyncall() -- Immediately sync all forks of all given relations * @@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index d4083e8a56..9563940d45 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0; typedef struct SyncOps { int (*sync_syncfiletag) (const FileTag *ftag, char *path); - int (*sync_unlinkfiletag) (const FileTag *ftag, char *path); + int (*sync_unlinkfiletag) (const FileTag *ftag, char *path, + StorageMarks mark); bool (*sync_filetagmatches) (const FileTag *ftag, const FileTag *candidate); } SyncOps; @@ -222,7 +223,8 @@ SyncPostCheckpoint(void) /* Unlink the file */ if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag, - path) < 0) + path, + SMGR_MARK_NONE) < 0) { /* * There's a race condition, when the database is dropped at the @@ -236,6 +238,20 @@ SyncPostCheckpoint(void) (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } + else if (syncsw[entry->tag.handler].sync_unlinkfiletag( + &entry->tag, path, + SMGR_MARK_UNCOMMITTED) < 0) + { + /* + * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the + * fork files has been successfully removed. It's ok if the file + * does not exist. + */ + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); + } /* Mark the list entry as canceled, just in case */ entry->canceled = true; diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 436df54120..dbc0da5da5 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record) * source system. */ } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } else if (rmid == RM_XACT_ID && ((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT || (rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED || diff --git a/src/common/relpath.c b/src/common/relpath.c index 1f5c426ec0..4945b111cc 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode) */ char * GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber) + int backendId, ForkNumber forkNumber, char mark) { char *path; + char markstr[4]; + + if (mark == 0) + markstr[0] = 0; + else + snprintf(markstr, sizeof(markstr), ".%c", mark); if (spcNode == GLOBALTABLESPACE_OID) { @@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, Assert(dbNode == 0); Assert(backendId == InvalidBackendId); if (forkNumber != MAIN_FORKNUM) - path = psprintf("global/%u_%s", - relNode, forkNames[forkNumber]); + path = psprintf("global/%u_%s%s", + relNode, forkNames[forkNumber], markstr); else - path = psprintf("global/%u", relNode); + path = psprintf("global/%u%s", relNode, markstr); } else if (spcNode == DEFAULTTABLESPACE_OID) { @@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/%u_%s", + path = psprintf("base/%u/%u_%s%s", dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/%u", - dbNode, relNode); + path = psprintf("base/%u/%u%s", + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/t%d_%u_%s", + path = psprintf("base/%u/t%d_%u_%s%s", dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/t%d_%u", - dbNode, backendId, relNode); + path = psprintf("base/%u/t%d_%u%s", + dbNode, backendId, relNode, markstr); } } else @@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/%u", + path = psprintf("pg_tblspc/%u/%s/%u/%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, relNode); + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, backendId, relNode); + dbNode, backendId, relNode, markstr); } } + return path; } diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 0ab32b44e9..584ebac391 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); @@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress); extern void smgrDoPendingDeletes(bool isCommit); extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker); extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr); +extern void smgrDoPendingCleanups(bool isCommit); extern void AtSubCommit_smgr(void); extern void AtSubAbort_smgr(void); extern void PostPrepare_smgr(void); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index f0814f1458..12346ed7f6 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -18,17 +18,23 @@ #include "lib/stringinfo.h" #include "storage/block.h" #include "storage/relfilenode.h" +#include "storage/smgr.h" /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation and buffer persistence change here, + * but logging of deletion actions is handled mainly by xact.c, because it is + * part of transaction commit in most cases. However, there's a case where + * init forks are deleted outside control of transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_MARK 0x40 +#define XLOG_SMGR_BUFPERSISTENCE 0x50 typedef struct xl_smgr_create { @@ -36,6 +42,32 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef enum smgr_mark_action +{ + XLOG_SMGR_MARK_CREATE = 'c', + XLOG_SMGR_MARK_UNLINK = 'u' +} smgr_mark_action; + +typedef struct xl_smgr_mark +{ + RelFileNode rnode; + ForkNumber forkNum; + StorageMarks mark; + smgr_mark_action action; +} xl_smgr_mark; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a44be11ca0..106a5cf508 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork); extern char *GetDatabasePath(Oid dbNode, Oid spcNode); extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber); + int backendId, ForkNumber forkNumber, char mark); /* * Wrapper macros for GetRelationPath. Beware of multiple @@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, /* First argument is a RelFileNode */ #define relpathbackend(rnode, backend, forknum) \ GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \ - backend, forknum) + backend, forknum, 0) /* First argument is a RelFileNode */ #define relpathperm(rnode, forknum) \ @@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, #define relpath(rnode, forknum) \ relpathbackend((rnode).node, (rnode).backend, forknum) +/* First argument is a RelFileNodeBackend */ +#define markpath(rnode, forknum, mark) \ + GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \ + (rnode).node.relNode, \ + (rnode).backend, forknum, mark) #endif /* RELPATH_H */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index cfce23ecbc..f5a7df87a4 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 34602ae006..2dc0357ad5 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd, extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); +extern int fsync_parent_path(const char *fname, int elevel); extern int durable_rename(const char *oldfile, const char *newfile, int loglevel); extern int durable_unlink(const char *fname, int loglevel); extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 752b440864..99620816b5 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -23,6 +23,10 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); +extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); @@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); /* md sync callbacks */ extern int mdsyncfiletag(const FileTag *ftag, char *path); -extern int mdunlinkfiletag(const FileTag *ftag, char *path); +extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark); extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate); #endif /* MD_H */ diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index fad1e5c473..e1f97e9b89 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -16,13 +16,15 @@ #define REINIT_H #include "common/relpath.h" - +#include "storage/smgr.h" extern void ResetUnloggedRelations(int op); -extern bool parse_filename_for_nontemp_relation(const char *name, - int *oidchars, ForkNumber *fork); +extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, + ForkNumber *fork, + StorageMarks *mark); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a6fbf7b6a6..201ecace8a 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -18,6 +18,18 @@ #include "storage/block.h" #include "storage/relfilenode.h" +/* + * Storage marks is a file of which existence suggests something about a + * file. The name of such files is "<filename>.<mark>", where the mark is one + * of the values of StorageMarks. Since ".<digit>" means segment files so don't + * use digits for the mark character. + */ +typedef enum StorageMarks +{ + SMGR_MARK_NONE = 0, + SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */ +} StorageMarks; + /* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles. An SMgrRelation is created (if not already present) @@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); +extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, diff --git a/src/test/recovery/t/027_persistence_change.pl b/src/test/recovery/t/027_persistence_change.pl new file mode 100644 index 0000000000..261c4cf943 --- /dev/null +++ b/src/test/recovery/t/027_persistence_change.pl @@ -0,0 +1,263 @@ + +# Copyright (c) 2021, PostgreSQL Global Development Group + +# Test relation persistence change +use strict; +use warnings; +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use Test::More; +use Test::More tests => 30; +use IPC::Run qw(pump finish timer); +use Config; + +my $data_unit = 2000; + +# Initialize primary node. +my $node = PostgreSQL::Test::Cluster->new('node'); +$node->init; +# we don't want checkpointing +$node->append_conf('postgresql.conf', qq( +checkpoint_timeout = '24h' +)); +$node->start; +create($node); + +my $relfilenodes1 = relfilenodes(); + +# correctly recover empty tables +$node->stop('immediate'); +$node->start; +insert($node, 0, $data_unit, 0); + +# data persists after a crash +$node->stop('immediate'); +$node->start; +checkdataloss($data_unit, 'crash logged 1'); + +set_unlogged($node); +# SET UNLOGGED shouldn't change relfilenode +my $relfilenodes2 = relfilenodes(); +checkrelfilenodes($relfilenodes1, $relfilenodes2, 'logged->unlogged'); + +# data cleanly vanishes after a crash +$node->stop('immediate'); +$node->start; +checkdataloss(0, 'crash unlogged'); + +insert($node, 0, $data_unit, 0); +set_logged($node); + +$node->stop('immediate'); +$node->start; +# SET LOGGED shouldn't change relfilenode and data should survive the crash +my $relfilenodes3 = relfilenodes(); +checkrelfilenodes($relfilenodes2, $relfilenodes3, 'unlogged->logged'); +checkdataloss($data_unit, 'crash logged 2'); + +# unlogged insert -> graceful stop +set_unlogged($node); +insert($node, $data_unit, $data_unit, 0); +$node->stop; +$node->start; +checkdataloss($data_unit * 2, 'unlogged graceful restart'); + +# crash during transaction +set_logged($node); +$node->stop('immediate'); +$node->start; +insert($node, $data_unit * 2, $data_unit, 0); + +my $h; + +# insert(,,,1) requires IO::Pty. Skip the test if the module is not +# available, but do the insert to make the expected situation for the +# later tests. +eval { require IO::Pty; }; +if ($@) +{ + insert($node, $data_unit * 3, $data_unit, 0); + ok (1, 'SKIPPED: IO::Pty is needed'); + ok (1, 'SKIPPED: IO::Pty is needed'); +} +else +{ + $h = insert($node, $data_unit * 3, $data_unit, 1); ## this is aborted +} + +$node->stop('immediate'); + +# finishing $h stalls this case, just tear it off. +$h = undef; + +# check if indexes are working +$node->start; +# drop first half of data to reduce run time +$node->safe_psql('postgres', 'DELETE FROM t WHERE bt < ' . $data_unit * 2); +check($node, $data_unit * 2, $data_unit * 3 - 1, 'final check'); + +sub create +{ + my ($node) = @_; + + $node->psql('postgres', qq( + CREATE TABLE t (bt int, gin int[], gist point, hash int, + brin int, spgist point); + CREATE INDEX i_bt ON t USING btree (bt); + CREATE INDEX i_gin ON t USING gin (gin); + CREATE INDEX i_gist ON t USING gist (gist); + CREATE INDEX i_hash ON t USING hash (hash); + CREATE INDEX i_brin ON t USING brin (brin); + CREATE INDEX i_spgist ON t USING spgist (spgist);)); +} + + +sub insert +{ + my ($node, $st, $num, $interactive) = @_; + my $ed = $st + $num - 1; + my $query = qq(BEGIN; +INSERT INTO t + (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i) + FROM generate_series($st, $ed) i); +); + + if ($interactive) + { + my $in = ''; + my $out = ''; + my $timer = timer(10); + + my $h = $node->interactive_psql('postgres', \$in, \$out, $timer); + like($out, qr/psql/, "print startup banner"); + + $in .= "$query\n"; + pump $h until ($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/ || + $timer->is_expired); + ok(($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/), "inserted-$st-$num"); + return $h + # the trasaction is not terminated + } + else + { + $node->psql('postgres', $query . "COMMIT;"); + return undef; + } +} + +sub check +{ + my ($node, $st, $ed, $head) = @_; + my $num_data = $ed - $st + 1; + + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO true; + SET enable_indexscan TO false; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE bt = i)), + $num_data, "$head: heap is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE bt = i)), + $num_data, "$head: btree is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE gin = ARRAY[i, i * 2];)), + $num_data, "$head: gin is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)), + $num_data, "$head: gist is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE hash = i;)), + $num_data, "$head: hash is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE brin = i;)), + $num_data, "$head: brin is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)), + $num_data, "$head: spgist is not broken"); +} + +sub set_unlogged +{ + my ($node) = @_; + + $node->psql('postgres', qq( + ALTER TABLE t SET UNLOGGED; +)); +} + +sub set_logged +{ + my ($node) = @_; + + $node->psql('postgres', qq( + ALTER TABLE t SET LOGGED; +)); +} + +sub relfilenodes +{ + my $result = $node->safe_psql('postgres', qq{ + SELECT relname, relfilenode FROM pg_class + WHERE relname + IN ('t', 'i_bt','i_gin','i_gist','i_hash','i_brin','i_spgist');}); + + my %relfilenodes; + + foreach my $l (split(/\n/, $result)) + { + die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/); + $relfilenodes{$1} = $2; + } + + # the number must correspond to the in list above + is (scalar %relfilenodes, 7, "number of relations is correct"); + + return \%relfilenodes; +} + +sub checkrelfilenodes +{ + my ($rnodes1, $rnodes2, $s) = @_; + + foreach my $n (keys %{$rnodes1}) + { + if ($n eq 'i_gist') + { + # persistence of GiST index is not changed in-place + isnt($rnodes1->{$n}, $rnodes2->{$n}, + "$s: relfilenode is changed: $n"); + } + else + { + # otherwise all relations are processed in-place + is($rnodes1->{$n}, $rnodes2->{$n}, + "$s: relfilenode is not changed: $n"); + } + } +} + +sub checkdataloss +{ + my ($expected, $s) = @_; + + is($node->safe_psql('postgres', "SELECT count(*) FROM t;"), $expected, + "$s: data in table t is in the expected state"); +} -- 2.27.0 From edb09262d0793df84dfcb9138bad0309f84cfe87 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v16 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- doc/src/sgml/ref/alter_table.sgml | 15 +++ src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 +++ src/backend/nodes/equalfuncs.c | 15 +++ src/backend/parser/gram.y | 42 +++++++ src/backend/tcop/utility.c | 11 ++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 10 ++ src/test/regress/expected/tablespace.out | 76 ++++++++++++ src/test/regress/sql/tablespace.sql | 41 +++++++ 11 files changed, 369 insertions(+) diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index a76e2e7322..6f108980af 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> SET SCHEMA <replaceable class="parameter">new_schema</replaceable> ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable>[, ... ] ] SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ] +ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable>[, ... ] ] + SET { LOGGED | UNLOGGED } [ NOWAIT ] ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable>| DEFAULT } ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> @@ -753,6 +755,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM (see <xref linkend="sql-createtable-unlogged"/>). It cannot be applied to a temporary table. </para> + + <para> + All tables in the current database in a tablespace can be changed by using + the <literal>ALL IN TABLESPACE</literal> form, which will lock all tables + to be changed first and then change each one. This form also supports + <literal>OWNED BY</literal>, which will only change tables owned by the + roles specified. If the <literal>NOWAIT</literal> option is specified + then the command will fail if it is unable to acquire all of the locks + required immediately. The <literal>information_schema</literal> + relations are not considered part of the system catalogs and will be + changed. See also + <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>. + </para> </listitem> </varlistentry> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 51fcf9ca5f..524c9d5c1b 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -14770,6 +14770,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(stmt->roles); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 18e778e856..51b6ad757f 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4270,6 +4270,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5623,6 +5636,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index cb7ddd463c..a19b7874d7 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3625,6 +3637,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 6dddc07947..50bc3190de 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1984,6 +1984,48 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->roles = $9; + n->logged = true; + n->nowait = $12; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->roles = $9; + n->logged = false; + n->nowait = $12; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 1fbc387d47..1483f9a475 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 336549cc5f..714077ff4c 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index 7c657c1241..8860b2e548 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -428,6 +428,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 593e301f7a..01661e9622 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2350,6 +2350,16 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + List *roles; /* List of roles to change objects of */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out index 864f4b6e20..420eed0717 100644 --- a/src/test/regress/expected/tablespace.out +++ b/src/test/regress/expected/tablespace.out @@ -935,5 +935,81 @@ drop cascades to table testschema.asexecute drop cascades to table testschema.part drop cascades to table testschema.atable drop cascades to table testschema.tablespace_acl +-- +-- Check persistence change in a tablespace +CREATE SCHEMA testschema; +GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1; +CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace'; +GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1; +CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default; +SET ROLE regress_tablespace_user1; +CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | p + usu | regress_tablespace | u + lu1 | regress_tablespace | p + uu1 | regress_tablespace | u + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +ALTER TABLE ALL IN TABLESPACE regress_tablespace + OWNED BY regress_tablespace_user1 SET LOGGED; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | p + usu | regress_tablespace | u + lu1 | regress_tablespace | p + uu1 | regress_tablespace | p + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +RESET ROLE; +ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | u + usu | regress_tablespace | u + lu1 | regress_tablespace | u + uu1 | regress_tablespace | u + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +-- Should succeed +DROP SCHEMA testschema CASCADE; +NOTICE: drop cascades to 8 other objects +DETAIL: drop cascades to table testschema.lsu +drop cascades to table testschema.usu +drop cascades to table testschema._lsu +drop cascades to table testschema._usu +drop cascades to table testschema.lu1 +drop cascades to table testschema.uu1 +drop cascades to table testschema._lu1 +drop cascades to table testschema._uu1 +DROP TABLESPACE regress_tablespace; DROP ROLE regress_tablespace_user1; DROP ROLE regress_tablespace_user2; diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql index 92076db9a1..0025c56401 100644 --- a/src/test/regress/sql/tablespace.sql +++ b/src/test/regress/sql/tablespace.sql @@ -412,5 +412,46 @@ DROP TABLESPACE regress_tblspace_renamed; DROP SCHEMA testschema CASCADE; + +-- +-- Check persistence change in a tablespace +CREATE SCHEMA testschema; +GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1; +CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace'; +GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1; + +CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default; +SET ROLE regress_tablespace_user1; +CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +ALTER TABLE ALL IN TABLESPACE regress_tablespace + OWNED BY regress_tablespace_user1 SET LOGGED; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +RESET ROLE; + +ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +-- Should succeed +DROP SCHEMA testschema CASCADE; +DROP TABLESPACE regress_tablespace; + DROP ROLE regress_tablespace_user1; DROP ROLE regress_tablespace_user2; -- 2.27.0
Hi, On Fri, Jan 14, 2022 at 11:43:10AM +0900, Kyotaro Horiguchi wrote: > I found a bug. > > mdmarkexists() didn't close the tentatively opend fd. This is a silent > leak on Linux and similars and it causes delete failure on Windows. > It was the reason of the CI failure. > > 027_persistence_change.pl uses interactive_psql() that doesn't work on > the Windos VM on the CI. > > In this version the following changes have been made in 0001. > > - Properly close file descriptor in mdmarkexists. > > - Skip some tests when IO::Pty is not available. > It might be better to separate that part. > > Looking again the ALTER TABLE ALL IN TABLESPACE SET LOGGED patch, I > noticed that it doesn't implement OWNED BY part and doesn't have test > and documenttaion (it was PoC). Added all of them to 0002. The cfbot is failing on all OS with this version of the patch. Apparently v16-0002 introduces some usage of "testtablespace" client-side variable that's never defined, e.g. https://api.cirrus-ci.com/v1/artifact/task/4670105480069120/regress_diffs/src/bin/pg_upgrade/tmp_check/regress/regression.diffs: diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/tablespace.out /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tablespace.out --- /tmp/cirrus-ci-build/src/test/regress/expected/tablespace.out 2022-01-18 04:26:38.744707547 +0000 +++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/regress/results/tablespace.out 2022-01-18 04:30:37.557078083 +0000 @@ -948,76 +948,71 @@ CREATE SCHEMA testschema; GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1; CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace'; +ERROR: syntax error at or near ":" +LINE 1: CREATE TABLESPACE regress_tablespace LOCATION :'testtablespa...
Julien Rouhaud <rjuju123@gmail.com> writes: > The cfbot is failing on all OS with this version of the patch. Apparently > v16-0002 introduces some usage of "testtablespace" client-side variable that's > never defined, e.g. That test infrastructure got rearranged very recently, see d6d317dbf. regards, tom lane
At Tue, 18 Jan 2022 10:37:53 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Julien Rouhaud <rjuju123@gmail.com> writes: > > The cfbot is failing on all OS with this version of the patch. Apparently > > v16-0002 introduces some usage of "testtablespace" client-side variable that's > > never defined, e.g. > > That test infrastructure got rearranged very recently, see d6d317dbf. Thanks to both. It seems that even though I know about the change, I forgot to make my repo up to date before checking. The v17 attached changes only the following point (as well as corresponding "expected" file). -+CREATE TABLESPACE regress_tablespace LOCATION :'testtablespace'; ++CREATE TABLESPACE regress_tablespace LOCATION ''; regards. -- Kyotaro Horiguchi NTT Open Source Software Center From c227842521de00d5da9dffb2f2dd86e8c1c171a8 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v17 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 52 ++ src/backend/access/transam/README | 8 + src/backend/access/transam/xact.c | 7 + src/backend/access/transam/xlog.c | 17 + src/backend/catalog/storage.c | 545 +++++++++++++++++- src/backend/commands/tablecmds.c | 266 +++++++-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 88 +++ src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 344 +++++++---- src/backend/storage/smgr/md.c | 94 ++- src/backend/storage/smgr/smgr.c | 32 + src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 24 + src/common/relpath.c | 47 +- src/include/catalog/storage.h | 3 + src/include/catalog/storage_xlog.h | 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h | 17 + src/test/recovery/t/027_persistence_change.pl | 263 +++++++++ 24 files changed, 1724 insertions(+), 182 deletions(-) create mode 100644 src/test/recovery/t/027_persistence_change.pl diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7547813254..2c674e5de0 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + action = "CREATE"; + break; + case XLOG_SMGR_MARK_UNLINK: + action = "DELETE"; + break; + default: + action = "<unknown action>"; + break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +98,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_MARK: + id = "MARK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..b344bbe511 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -724,6 +724,14 @@ we must panic and abort recovery. The DBA will have to manually clean up then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The Smgr MARK files +-------------------------------- + +An smgr mark file is created when a new relation file is created to +mark the relfilenode needs to be cleaned up at recovery time. In +contrast to the four actions above, failure to remove smgr mark files +will lead to data loss, in which case the server will shut down. + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index c9516e03fa..3c7010eb0f 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2197,6 +2197,9 @@ CommitTransaction(void) */ smgrDoPendingSyncs(true, is_parallel_worker); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2447,6 +2450,9 @@ PrepareTransaction(void) */ smgrDoPendingSyncs(true, false); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2772,6 +2778,7 @@ AbortTransaction(void) AfterTriggerEndXact(false); /* 'false' means it's abort */ AtAbort_Portals(); smgrDoPendingSyncs(false, is_parallel_worker); + smgrDoPendingCleanups(false); AtEOXact_LargeObject(false); AtAbort_Notify(); AtEOXact_RelationMap(false, is_parallel_worker); diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index c9d4cbf3ff..7cab6a0170 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -40,6 +40,7 @@ #include "catalog/catversion.h" #include "catalog/pg_control.h" #include "catalog/pg_database.h" +#include "catalog/storage.h" #include "commands/progress.h" #include "commands/tablespace.h" #include "common/controldata_utils.h" @@ -4564,6 +4565,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -7824,6 +7833,14 @@ StartupXLOG(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 9b8075536a..92a9451e90 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -66,6 +67,23 @@ typedef struct PendingRelDelete struct PendingRelDelete *next; /* linked-list link */ } PendingRelDelete; +#define PCOP_UNLINK_FORK (1 << 0) +#define PCOP_UNLINK_MARK (1 << 1) +#define PCOP_SET_PERSISTENCE (1 << 2) + +typedef struct PendingCleanup +{ + RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ + StorageMarks unlink_mark; /* mark to unlink */ + BackendId backend; /* InvalidBackendId if not a temp rel */ + bool atCommit; /* T=delete at commit; F=delete at abort */ + int nestLevel; /* xact nesting level of request */ + struct PendingCleanup *next; /* linked-list link */ +} PendingCleanup; + typedef struct PendingRelSync { RelFileNode rnode; @@ -73,6 +91,7 @@ typedef struct PendingRelSync } PendingRelSync; static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ +static PendingCleanup *pendingCleanups = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; @@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode) SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence) { - PendingRelDelete *pending; + PendingRelDelete *pendingdel; + PendingCleanup *pendingclean; SMgrRelation srel; BackendId backend; bool needs_wal; @@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up. The + * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup. + */ srel = smgropen(rnode, backend); + log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false); smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM); /* Add the relation to the list of stuff to delete at abort */ - pending = (PendingRelDelete *) + pendingdel = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); - pending->relnode = rnode; - pending->backend = backend; - pending->atCommit = false; /* delete if abort */ - pending->nestLevel = GetCurrentTransactionNestLevel(); - pending->next = pendingDeletes; - pendingDeletes = pending; + pendingdel->relnode = rnode; + pendingdel->backend = backend; + pendingdel->atCommit = false; /* delete if abort */ + pendingdel->nestLevel = GetCurrentTransactionNestLevel(); + pendingdel->next = pendingDeletes; + pendingDeletes = pendingdel; + + /* drop mark files at commit */ + pendingclean = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pendingclean->relnode = rnode; + pendingclean->op = PCOP_UNLINK_MARK; + pendingclean->unlink_forknum = MAIN_FORKNUM; + pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED; + pendingclean->backend = backend; + pendingclean->atCommit = true; + pendingclean->nestLevel = GetCurrentTransactionNestLevel(); + pendingclean->next = pendingCleanups; + pendingCleanups = pendingclean; if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { @@ -168,6 +208,203 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + SMgrRelation srel; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false); + + /* + * If we have entries for init-fork operations on this relation, that means + * that we have already registered pending delete entries to drop an + * init-fork preexisting since before the current transaction started. This + * function reverts that change just by removing the entries. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum == INIT_FORKNUM) + { + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + create = false; + } + else + prev = pending; + } + + if (!create) + return; + + /* + * We are going to create an init fork. If server crashes before the + * current transaction ends the init fork left alone corrupts data while + * recovery. The mark file works as the sentinel to identify that + * situation. + */ + srel = smgropen(rnode, InvalidBackendId); + log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * index-init fork needs further initialization. ambuildempty shoud do + * WAL-log and file sync by itself but otherwise we do that by ourselves. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop the init fork, mark file and revert persistence at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* drop mark file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion doesn't happen. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false); + + /* + * If we have entries for init-fork operations of this relation, that means + * that we have created the init fork in the current transaction. We + * remove the init fork and mark file immediately in that case. Otherwise + * just register pending-delete for the existing init fork. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum != INIT_FORKNUM) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + inxact_created = true; + } + else + prev = pending; + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT forks never be loaded to shared buffer so no point in dropping + * buffers for such files. + */ + log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +424,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL. + */ +void +log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_CREATE; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL. + */ +void +log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_UNLINK; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -255,6 +574,7 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit) prev->next = next; else pendingDeletes = next; + pfree(pending); /* prev does not change */ } @@ -673,6 +993,88 @@ smgrDoPendingDeletes(bool isCommit) } } +/* + * smgrDoPendingUnmark() -- Clean up work that emits WAL records + * + * The operations handled in the function emits WAL records, which must be + * emitted before the commit record for the current transaction. + */ +void +smgrDoPendingCleanups(bool isCommit) +{ + int nestLevel = GetCurrentTransactionNestLevel(); + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + if (pending->nestLevel < nestLevel) + { + /* outer-level entries should not be processed yet */ + prev = pending; + } + else + { + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + /* do cleanup if called for */ + if (pending->atCommit == isCommit) + { + SMgrRelation srel; + + srel = smgropen(pending->relnode, pending->backend); + + Assert ((pending->op & + ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | + PCOP_SET_PERSISTENCE)) == 0); + + if (pending->op & PCOP_UNLINK_FORK) + { + /* other forks needs to drop buffers */ + Assert(pending->unlink_forknum == INIT_FORKNUM); + + /* Don't emit wal while recovery. */ + if (!InRecovery) + log_smgrunlink(&pending->relnode, + pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + } + + if (pending->op & PCOP_UNLINK_MARK) + { + SMgrRelation srel; + + if (!InRecovery) + log_smgrunlinkmark(&pending->relnode, + pending->unlink_forknum, + pending->unlink_mark); + srel = smgropen(pending->relnode, pending->backend); + smgrunlinkmark(srel, pending->unlink_forknum, + pending->unlink_mark, InRecovery); + smgrclose(srel); + } + + if (pending->op & PCOP_SET_PERSISTENCE) + { + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); + } + } + + /* must explicitly free the list entry */ + pfree(pending); + /* prev does not change */ + } + } +} + /* * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact. */ @@ -933,6 +1335,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1021,6 +1432,124 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + bool created = false; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true); + created = true; + break; + case XLOG_SMGR_MARK_UNLINK: + smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true); + break; + default: + elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark); + } + + if (created) + { + /* revert mark file operation at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = xlrec->forkNum; + pending->unlink_mark = xlrec->mark; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + else + { + /* + * Delete pending action for this mark file if any. We should have + * at most one entry for this action. + */ + PendingCleanup *prev = NULL; + + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + pending->unlink_forknum == xlrec->forkNum && + (pending->op & PCOP_UNLINK_MARK) != 0) + { + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + } + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + PendingCleanup *prev = NULL; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* + * Delete pending action for persistence change if any. We should have + * at most one entry for this action. + */ + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + (pending->op & PCOP_SET_PERSISTENCE) != 0) + { + Assert (pending->bufpersistence == xlrec->persistence); + + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + + /* + * Revert buffer-persistence changes at abort if the relation is going + * to different persistence from before this transaction. + */ + if (!pending) + { + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 1f0654c2f5..9e673ba68f 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -52,6 +52,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -5346,6 +5347,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform in-place persistnce change"); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * XXXX: Some access methods do not bear up an in-place persistence + * change. Specifically, GiST uses page LSNs to figure out whether a + * block has changed, where UNLOGGED GiST indexes use fake LSNs that + * are incompatible with real LSNs used for LOGGED ones. + * + * Maybe if gistGetFakeLSN behaved the same way for permanent and + * unlogged indexes, we could skip index rebuild in exchange of some + * extra WAL records emitted while it is unlogged. + * + * Check relam against a positive list so that we take this way for + * unknown AMs. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + /* GiST is excluded */ + r->rd_rel->relam != BTREE_AM_OID && + r->rd_rel->relam != HASH_AM_OID && + r->rd_rel->relam != GIN_AM_OID && + r->rd_rel->relam != SPGIST_AM_OID && + r->rd_rel->relam != BRIN_AM_OID) + { + int reindex_flags; + ReindexParams params = {0}; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, ¶ms); + + continue; + } + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(RelationGetSmgr(r), i)) + smgrimmedsync(RelationGetSmgr(r), i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + * We don't emit this fhile wal_level = minimal. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + xl_smgr_truncate xlrec; + + xlrec.blkno = 0; + xlrec.rnode = r->rd_node; + xlrec.flags = SMGR_TRUNCATE_ALL; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + + XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(RelationGetSmgr(r), fork)) + log_newpage_range(r, fork, 0, + smgrnblocks(RelationGetSmgr(r), fork), + false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5476,47 +5658,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod, - persistence, lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, + NewAccessMethod, + persistence, lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); - InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + } } else { diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 3afbbe7e02..3f16b5f58c 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1102,6 +1102,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, bool excludeFound; ForkNumber relForkNum; /* Type of fork if file is a relation */ int relOidChars; /* Chars in filename that are the rel oid */ + StorageMarks mark; /* Skip special stuff */ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) @@ -1152,7 +1153,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, /* Exclude all forks for unlogged tables except the init fork */ if (isDbDir && parse_filename_for_nontemp_relation(de->d_name, &relOidChars, - &relForkNum)) + &relForkNum, &mark)) { /* Never exclude init forks */ if (relForkNum != INIT_FORKNUM) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index a2512e750c..6384b4efbe 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -37,6 +37,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3154,6 +3155,93 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* init fork is always BM_PERMANENT. See BufferAlloc */ + if (bufHdr->tag.forkNum != INIT_FORKNUM) + buf_state &= ~BM_PERMANENT; + + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 14b77f2861..2fc9f17c28 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel); static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel); -static int fsync_parent_path(const char *fname, int elevel); - /* * pg_fsync --- do fsync with or without writethrough @@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel) * This is aimed at making file operations persistent on disk in case of * an OS crash or power failure. */ -static int +int fsync_parent_path(const char *fname, int elevel) { char parentpath[MAXPGPATH]; diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index f053fe0495..1124e95d0d 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,49 @@ #include <unistd.h> +#include "access/xlog.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" #include "postmaster/startup.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. + * + * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the + * init fork along with the mark file. + * + * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the + * whole relation along with the mark file. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. @@ -72,7 +92,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -81,13 +101,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + + Assert(tspid != 0); + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -103,7 +129,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s", dbspace_path); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create tons of unlogged relations in + * the same database & tablespace, so we'd better use a hash table rather + * than an array or linked list to keep track of which files need to be + * reset. Otherwise, this cleanup operation would be O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT fork and mark files in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + StorageMarks mark; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum, &mark)) + continue; + + if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has + * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty + * state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_init = true; + else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + DropRelFileNodesAllBuffers(srels, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + StorageMarks mark; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) + &forkNum, &mark)) continue; /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { @@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; @@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char mainpath[MAXPGPATH]; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, - ForkNumber *fork) + ForkNumber *fork, StorageMarks *mark) { int pos; @@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars, for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar) ; - if (segchar <= 1) - return false; - pos += segchar; + if (segchar > 1) + pos += segchar; } + /* mark file? */ + if (name[pos] == '.' && name[pos + 1] != 0) + { + *mark = name[pos + 1]; + pos += 2; + } + else + *mark = SMGR_MARK_NONE; + /* Now we should be at the end. */ if (name[pos] != '\0') return false; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index d26c915f90..007efe68a5 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno, bool skipFsync, int behavior); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); - +static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum, + StorageMarks mark); /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -169,6 +170,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum) return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL); } +/* + * mdcreatemark() -- Create a mark file. + * + * If isRedo is true, it's okay for the file to exist already. + */ +void +mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path =markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + /* See mdcreate for details.. */ + TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode, + reln->smgr_rnode.node.dbNode, + isRedo); + + fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL); + if (fd < 0 && (!isRedo || errno != EEXIST)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not crete mark file \"%s\": %m", path))); + + pg_fsync(fd); + close(fd); + + /* + * To guarantee that the creation of the file is persistent, fsync its + * parent directory. + */ + fsync_parent_path(path, ERROR); + + pfree(path); +} + + +/* + * mdunlinkmark() -- Delete the mark file + * + * If isRedo is true, it's okay for the file being not found. + */ +void +mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + + if (!isRedo || mdmarkexists(reln, forkNum, mark)) + durable_unlink(path, ERROR); + + pfree(path); +} + +/* + * mdmarkexists() -- Check if the file exists. + */ +static bool +mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + fd = BasicOpenFile(path, O_RDONLY); + if (fd < 0 && errno != ENOENT) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not access mark file \"%s\": %m", path))); + pfree(path); + + if (fd < 0) + return false; + + close(fd); + return true; +} + /* * mdcreate() -- Create a new relation on magnetic disk. * @@ -1025,6 +1102,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ @@ -1378,12 +1464,14 @@ mdsyncfiletag(const FileTag *ftag, char *path) * Return 0 on success, -1 on failure, with errno set. */ int -mdunlinkfiletag(const FileTag *ftag, char *path) +mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark) { char *p; /* Compute the path. */ - p = relpathperm(ftag->rnode, MAIN_FORKNUM); + p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode, + ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM, + mark); strlcpy(path, p, MAXPGPATH); pfree(p); diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index eb701dce57..4819b5c404 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -62,6 +62,10 @@ typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); + void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); + void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); } f_smgr; static const f_smgr smgrsw[] = { @@ -82,6 +86,8 @@ static const f_smgr smgrsw[] = { .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, .smgr_immedsync = mdimmedsync, + .smgr_createmark = mdcreatemark, + .smgr_unlinkmark = mdunlinkmark, } }; @@ -335,6 +341,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo); } +/* + * smgrcreatemark() -- Create a mark file + */ +void +smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo); +} + +/* + * smgrunlinkmark() -- Delete a mark file + */ +void +smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo); +} + /* * smgrdosyncall() -- Immediately sync all forks of all given relations * @@ -662,6 +688,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index 11fa17ddea..ddc344dad2 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -89,7 +89,8 @@ static CycleCtr checkpoint_cycle_ctr = 0; typedef struct SyncOps { int (*sync_syncfiletag) (const FileTag *ftag, char *path); - int (*sync_unlinkfiletag) (const FileTag *ftag, char *path); + int (*sync_unlinkfiletag) (const FileTag *ftag, char *path, + StorageMarks mark); bool (*sync_filetagmatches) (const FileTag *ftag, const FileTag *candidate); } SyncOps; @@ -222,7 +223,8 @@ SyncPostCheckpoint(void) /* Unlink the file */ if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag, - path) < 0) + path, + SMGR_MARK_NONE) < 0) { /* * There's a race condition, when the database is dropped at the @@ -236,6 +238,20 @@ SyncPostCheckpoint(void) (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } + else if (syncsw[entry->tag.handler].sync_unlinkfiletag( + &entry->tag, path, + SMGR_MARK_UNCOMMITTED) < 0) + { + /* + * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the + * fork files has been successfully removed. It's ok if the file + * does not exist. + */ + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); + } /* Mark the list entry as canceled, just in case */ entry->canceled = true; diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 9143797458..b21d01d04a 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -407,6 +407,30 @@ extractPageInfo(XLogReaderState *record) * source system. */ } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE) + { + /* + * We can safely ignore these. When we compare the sizes later on, + * we'll notice that they differ, and copy the missing tail from + * source system. + */ + } else if (rmid == RM_XACT_ID && ((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT || (rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED || diff --git a/src/common/relpath.c b/src/common/relpath.c index 636c96efd3..1c19e16fea 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode) */ char * GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber) + int backendId, ForkNumber forkNumber, char mark) { char *path; + char markstr[4]; + + if (mark == 0) + markstr[0] = 0; + else + snprintf(markstr, sizeof(markstr), ".%c", mark); if (spcNode == GLOBALTABLESPACE_OID) { @@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, Assert(dbNode == 0); Assert(backendId == InvalidBackendId); if (forkNumber != MAIN_FORKNUM) - path = psprintf("global/%u_%s", - relNode, forkNames[forkNumber]); + path = psprintf("global/%u_%s%s", + relNode, forkNames[forkNumber], markstr); else - path = psprintf("global/%u", relNode); + path = psprintf("global/%u%s", relNode, markstr); } else if (spcNode == DEFAULTTABLESPACE_OID) { @@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/%u_%s", + path = psprintf("base/%u/%u_%s%s", dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/%u", - dbNode, relNode); + path = psprintf("base/%u/%u%s", + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/t%d_%u_%s", + path = psprintf("base/%u/t%d_%u_%s%s", dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/t%d_%u", - dbNode, backendId, relNode); + path = psprintf("base/%u/t%d_%u%s", + dbNode, backendId, relNode, markstr); } } else @@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/%u", + path = psprintf("pg_tblspc/%u/%s/%u/%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, relNode); + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, backendId, relNode); + dbNode, backendId, relNode, markstr); } } + return path; } diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 9ffc741913..d362d62ed2 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); @@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress); extern void smgrDoPendingDeletes(bool isCommit); extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker); extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr); +extern void smgrDoPendingCleanups(bool isCommit); extern void AtSubCommit_smgr(void); extern void AtSubAbort_smgr(void); extern void PostPrepare_smgr(void); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index 622de22b03..8139308634 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -18,17 +18,23 @@ #include "lib/stringinfo.h" #include "storage/block.h" #include "storage/relfilenode.h" +#include "storage/smgr.h" /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation and buffer persistence change here, + * but logging of deletion actions is handled mainly by xact.c, because it is + * part of transaction commit in most cases. However, there's a case where + * init forks are deleted outside control of transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_MARK 0x40 +#define XLOG_SMGR_BUFPERSISTENCE 0x50 typedef struct xl_smgr_create { @@ -36,6 +42,32 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef enum smgr_mark_action +{ + XLOG_SMGR_MARK_CREATE = 'c', + XLOG_SMGR_MARK_UNLINK = 'u' +} smgr_mark_action; + +typedef struct xl_smgr_mark +{ + RelFileNode rnode; + ForkNumber forkNum; + StorageMarks mark; + smgr_mark_action action; +} xl_smgr_mark; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a4b5dc853b..a864c91614 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork); extern char *GetDatabasePath(Oid dbNode, Oid spcNode); extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber); + int backendId, ForkNumber forkNumber, char mark); /* * Wrapper macros for GetRelationPath. Beware of multiple @@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, /* First argument is a RelFileNode */ #define relpathbackend(rnode, backend, forknum) \ GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \ - backend, forknum) + backend, forknum, 0) /* First argument is a RelFileNode */ #define relpathperm(rnode, forknum) \ @@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, #define relpath(rnode, forknum) \ relpathbackend((rnode).node, (rnode).backend, forknum) +/* First argument is a RelFileNodeBackend */ +#define markpath(rnode, forknum, mark) \ + GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \ + (rnode).node.relNode, \ + (rnode).backend, forknum, mark) #endif /* RELPATH_H */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index dd01841c30..739b386216 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 29209e2724..8bf746bf45 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd, extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); +extern int fsync_parent_path(const char *fname, int elevel); extern int durable_rename(const char *oldfile, const char *newfile, int loglevel); extern int durable_unlink(const char *fname, int loglevel); extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index ffffa40db7..046afdb5fb 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -23,6 +23,10 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); +extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); @@ -41,12 +45,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); /* md sync callbacks */ extern int mdsyncfiletag(const FileTag *ftag, char *path); -extern int mdunlinkfiletag(const FileTag *ftag, char *path); +extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark); extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate); #endif /* MD_H */ diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index bf2c10d443..e399aec0c7 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -16,13 +16,15 @@ #define REINIT_H #include "common/relpath.h" - +#include "storage/smgr.h" extern void ResetUnloggedRelations(int op); -extern bool parse_filename_for_nontemp_relation(const char *name, - int *oidchars, ForkNumber *fork); +extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, + ForkNumber *fork, + StorageMarks *mark); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 052e0b8426..48e69ab69b 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -18,6 +18,18 @@ #include "storage/block.h" #include "storage/relfilenode.h" +/* + * Storage marks is a file of which existence suggests something about a + * file. The name of such files is "<filename>.<mark>", where the mark is one + * of the values of StorageMarks. Since ".<digit>" means segment files so don't + * use digits for the mark character. + */ +typedef enum StorageMarks +{ + SMGR_MARK_NONE = 0, + SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */ +} StorageMarks; + /* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles. An SMgrRelation is created (if not already present) @@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); +extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, diff --git a/src/test/recovery/t/027_persistence_change.pl b/src/test/recovery/t/027_persistence_change.pl new file mode 100644 index 0000000000..261c4cf943 --- /dev/null +++ b/src/test/recovery/t/027_persistence_change.pl @@ -0,0 +1,263 @@ + +# Copyright (c) 2021, PostgreSQL Global Development Group + +# Test relation persistence change +use strict; +use warnings; +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use Test::More; +use Test::More tests => 30; +use IPC::Run qw(pump finish timer); +use Config; + +my $data_unit = 2000; + +# Initialize primary node. +my $node = PostgreSQL::Test::Cluster->new('node'); +$node->init; +# we don't want checkpointing +$node->append_conf('postgresql.conf', qq( +checkpoint_timeout = '24h' +)); +$node->start; +create($node); + +my $relfilenodes1 = relfilenodes(); + +# correctly recover empty tables +$node->stop('immediate'); +$node->start; +insert($node, 0, $data_unit, 0); + +# data persists after a crash +$node->stop('immediate'); +$node->start; +checkdataloss($data_unit, 'crash logged 1'); + +set_unlogged($node); +# SET UNLOGGED shouldn't change relfilenode +my $relfilenodes2 = relfilenodes(); +checkrelfilenodes($relfilenodes1, $relfilenodes2, 'logged->unlogged'); + +# data cleanly vanishes after a crash +$node->stop('immediate'); +$node->start; +checkdataloss(0, 'crash unlogged'); + +insert($node, 0, $data_unit, 0); +set_logged($node); + +$node->stop('immediate'); +$node->start; +# SET LOGGED shouldn't change relfilenode and data should survive the crash +my $relfilenodes3 = relfilenodes(); +checkrelfilenodes($relfilenodes2, $relfilenodes3, 'unlogged->logged'); +checkdataloss($data_unit, 'crash logged 2'); + +# unlogged insert -> graceful stop +set_unlogged($node); +insert($node, $data_unit, $data_unit, 0); +$node->stop; +$node->start; +checkdataloss($data_unit * 2, 'unlogged graceful restart'); + +# crash during transaction +set_logged($node); +$node->stop('immediate'); +$node->start; +insert($node, $data_unit * 2, $data_unit, 0); + +my $h; + +# insert(,,,1) requires IO::Pty. Skip the test if the module is not +# available, but do the insert to make the expected situation for the +# later tests. +eval { require IO::Pty; }; +if ($@) +{ + insert($node, $data_unit * 3, $data_unit, 0); + ok (1, 'SKIPPED: IO::Pty is needed'); + ok (1, 'SKIPPED: IO::Pty is needed'); +} +else +{ + $h = insert($node, $data_unit * 3, $data_unit, 1); ## this is aborted +} + +$node->stop('immediate'); + +# finishing $h stalls this case, just tear it off. +$h = undef; + +# check if indexes are working +$node->start; +# drop first half of data to reduce run time +$node->safe_psql('postgres', 'DELETE FROM t WHERE bt < ' . $data_unit * 2); +check($node, $data_unit * 2, $data_unit * 3 - 1, 'final check'); + +sub create +{ + my ($node) = @_; + + $node->psql('postgres', qq( + CREATE TABLE t (bt int, gin int[], gist point, hash int, + brin int, spgist point); + CREATE INDEX i_bt ON t USING btree (bt); + CREATE INDEX i_gin ON t USING gin (gin); + CREATE INDEX i_gist ON t USING gist (gist); + CREATE INDEX i_hash ON t USING hash (hash); + CREATE INDEX i_brin ON t USING brin (brin); + CREATE INDEX i_spgist ON t USING spgist (spgist);)); +} + + +sub insert +{ + my ($node, $st, $num, $interactive) = @_; + my $ed = $st + $num - 1; + my $query = qq(BEGIN; +INSERT INTO t + (SELECT i, ARRAY[i, i * 2], point(i, i * 2), i, i, point(i, i) + FROM generate_series($st, $ed) i); +); + + if ($interactive) + { + my $in = ''; + my $out = ''; + my $timer = timer(10); + + my $h = $node->interactive_psql('postgres', \$in, \$out, $timer); + like($out, qr/psql/, "print startup banner"); + + $in .= "$query\n"; + pump $h until ($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/ || + $timer->is_expired); + ok(($out =~ /[\n\r]+INSERT 0 $num[\n\r]+/), "inserted-$st-$num"); + return $h + # the trasaction is not terminated + } + else + { + $node->psql('postgres', $query . "COMMIT;"); + return undef; + } +} + +sub check +{ + my ($node, $st, $ed, $head) = @_; + my $num_data = $ed - $st + 1; + + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO true; + SET enable_indexscan TO false; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE bt = i)), + $num_data, "$head: heap is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE bt = i)), + $num_data, "$head: btree is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE gin = ARRAY[i, i * 2];)), + $num_data, "$head: gin is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE gist <@ box(point(i-0.5, i*2-0.5),point(i+0.5, i*2+0.5));)), + $num_data, "$head: gist is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE hash = i;)), + $num_data, "$head: hash is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE brin = i;)), + $num_data, "$head: brin is not broken"); + is($node->safe_psql('postgres', qq( + SET enable_seqscan TO false; + SET enable_indexscan TO true; + SELECT COUNT(*) FROM t, generate_series($st, $ed) i + WHERE spgist <@ box(point(i-0.5,i-0.5),point(i+0.5,i+0.5));)), + $num_data, "$head: spgist is not broken"); +} + +sub set_unlogged +{ + my ($node) = @_; + + $node->psql('postgres', qq( + ALTER TABLE t SET UNLOGGED; +)); +} + +sub set_logged +{ + my ($node) = @_; + + $node->psql('postgres', qq( + ALTER TABLE t SET LOGGED; +)); +} + +sub relfilenodes +{ + my $result = $node->safe_psql('postgres', qq{ + SELECT relname, relfilenode FROM pg_class + WHERE relname + IN ('t', 'i_bt','i_gin','i_gist','i_hash','i_brin','i_spgist');}); + + my %relfilenodes; + + foreach my $l (split(/\n/, $result)) + { + die "unexpected format: $l" if ($l !~ /^([^|]+)\|([0-9]+)$/); + $relfilenodes{$1} = $2; + } + + # the number must correspond to the in list above + is (scalar %relfilenodes, 7, "number of relations is correct"); + + return \%relfilenodes; +} + +sub checkrelfilenodes +{ + my ($rnodes1, $rnodes2, $s) = @_; + + foreach my $n (keys %{$rnodes1}) + { + if ($n eq 'i_gist') + { + # persistence of GiST index is not changed in-place + isnt($rnodes1->{$n}, $rnodes2->{$n}, + "$s: relfilenode is changed: $n"); + } + else + { + # otherwise all relations are processed in-place + is($rnodes1->{$n}, $rnodes2->{$n}, + "$s: relfilenode is not changed: $n"); + } + } +} + +sub checkdataloss +{ + my ($expected, $s) = @_; + + is($node->safe_psql('postgres', "SELECT count(*) FROM t;"), $expected, + "$s: data in table t is in the expected state"); +} -- 2.27.0 From f621f134e7c48b52a65e3b60ad42c0259e226a40 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v17 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- doc/src/sgml/ref/alter_table.sgml | 15 +++ src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 +++ src/backend/nodes/equalfuncs.c | 15 +++ src/backend/parser/gram.y | 42 +++++++ src/backend/tcop/utility.c | 11 ++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 10 ++ src/test/regress/expected/tablespace.out | 76 ++++++++++++ src/test/regress/sql/tablespace.sql | 41 +++++++ 11 files changed, 369 insertions(+) diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index a76e2e7322..6f108980af 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> SET SCHEMA <replaceable class="parameter">new_schema</replaceable> ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable>[, ... ] ] SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ] +ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable>[, ... ] ] + SET { LOGGED | UNLOGGED } [ NOWAIT ] ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable>| DEFAULT } ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> @@ -753,6 +755,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM (see <xref linkend="sql-createtable-unlogged"/>). It cannot be applied to a temporary table. </para> + + <para> + All tables in the current database in a tablespace can be changed by using + the <literal>ALL IN TABLESPACE</literal> form, which will lock all tables + to be changed first and then change each one. This form also supports + <literal>OWNED BY</literal>, which will only change tables owned by the + roles specified. If the <literal>NOWAIT</literal> option is specified + then the command will fail if it is unable to acquire all of the locks + required immediately. The <literal>information_schema</literal> + relations are not considered part of the system catalogs and will be + changed. See also + <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>. + </para> </listitem> </varlistentry> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 9e673ba68f..25bbdb5664 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -14769,6 +14769,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(stmt->roles); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 90b5da51c9..bbc9eb28e6 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4273,6 +4273,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5639,6 +5652,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index 06345da3ba..603bd2a044 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1916,6 +1916,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3636,6 +3648,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index b5966712ce..682684c2ee 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1984,6 +1984,48 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->roles = $9; + n->logged = true; + n->nowait = $12; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->roles = $9; + n->logged = false; + n->nowait = $12; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 83e4e37c78..750e0ecac9 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -162,6 +162,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1747,6 +1748,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2669,6 +2676,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 5d4037f26e..c381dad3e5 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index f9ddafd345..a83c66cad6 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -429,6 +429,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 3e9bdc781f..f19bd3c569 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2351,6 +2351,16 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + List *roles; /* List of roles to change objects of */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out index 2dfbcfdebe..c02afdcb68 100644 --- a/src/test/regress/expected/tablespace.out +++ b/src/test/regress/expected/tablespace.out @@ -943,5 +943,81 @@ drop cascades to table testschema.asexecute drop cascades to table testschema.part drop cascades to table testschema.atable drop cascades to table testschema.tablespace_acl +-- +-- Check persistence change in a tablespace +CREATE SCHEMA testschema; +GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1; +CREATE TABLESPACE regress_tablespace LOCATION ''; +GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1; +CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default; +SET ROLE regress_tablespace_user1; +CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | p + usu | regress_tablespace | u + lu1 | regress_tablespace | p + uu1 | regress_tablespace | u + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +ALTER TABLE ALL IN TABLESPACE regress_tablespace + OWNED BY regress_tablespace_user1 SET LOGGED; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | p + usu | regress_tablespace | u + lu1 | regress_tablespace | p + uu1 | regress_tablespace | p + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +RESET ROLE; +ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | u + usu | regress_tablespace | u + lu1 | regress_tablespace | u + uu1 | regress_tablespace | u + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +-- Should succeed +DROP SCHEMA testschema CASCADE; +NOTICE: drop cascades to 8 other objects +DETAIL: drop cascades to table testschema.lsu +drop cascades to table testschema.usu +drop cascades to table testschema._lsu +drop cascades to table testschema._usu +drop cascades to table testschema.lu1 +drop cascades to table testschema.uu1 +drop cascades to table testschema._lu1 +drop cascades to table testschema._uu1 +DROP TABLESPACE regress_tablespace; DROP ROLE regress_tablespace_user1; DROP ROLE regress_tablespace_user2; diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql index 896f05cea3..4e407eb8c0 100644 --- a/src/test/regress/sql/tablespace.sql +++ b/src/test/regress/sql/tablespace.sql @@ -419,5 +419,46 @@ DROP TABLESPACE regress_tblspace_renamed; DROP SCHEMA testschema CASCADE; + +-- +-- Check persistence change in a tablespace +CREATE SCHEMA testschema; +GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1; +CREATE TABLESPACE regress_tablespace LOCATION ''; +GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1; + +CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default; +SET ROLE regress_tablespace_user1; +CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +ALTER TABLE ALL IN TABLESPACE regress_tablespace + OWNED BY regress_tablespace_user1 SET LOGGED; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +RESET ROLE; + +ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +-- Should succeed +DROP SCHEMA testschema CASCADE; +DROP TABLESPACE regress_tablespace; + DROP ROLE regress_tablespace_user1; DROP ROLE regress_tablespace_user2; -- 2.27.0
Rebased on a recent xlog refactoring. No functional changes have been made. - Removed the default case in smgr_desc since it seems to me we don't assume out-of-definition values in xlog records elsewhere. - Simplified some added to storage.c. - Fix copy-pasto'ed comments in extractPageInfo(). - The previous version smgrDoPendingCleanups() assumes that init-fork are not loaded onto shared buffer but it is wrong (SetRelationBuffersPersistence assumes the opposite.). Thus we need to drop buffers before unlink an init fork. But it is already guaranteed by logic so I rewrote the comment for for PCOP_UNLINK_FORK. > * Unlink the fork file. Currently we use this only for > * init forks and we're sure that the init fork is not > * loaded on shared buffers. For RelationDropInitFork > * case, the function dropped that buffers. For > * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true) > * is set and the buffers have been dropped just before. This logic has the same critical window as DropRelFilenodeBuffers. That is, if file deletion fails after successful buffer dropping, theoretically the file content of the init fork may be stale. However, AFAICS init-fork is write-once fork so I don't think that actually matters. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 420a9d9a0dae3bcfb1396c14997624ad67a3e557 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v18 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 49 ++ src/backend/access/transam/README | 9 + src/backend/access/transam/xact.c | 7 + src/backend/access/transam/xlogrecovery.c | 18 + src/backend/catalog/storage.c | 548 +++++++++++++++++++++- src/backend/commands/tablecmds.c | 266 +++++++++-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 86 ++++ src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 344 ++++++++++---- src/backend/storage/smgr/md.c | 94 +++- src/backend/storage/smgr/smgr.c | 32 ++ src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 22 + src/common/relpath.c | 47 +- src/include/catalog/storage.h | 3 + src/include/catalog/storage_xlog.h | 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h | 17 + 23 files changed, 1459 insertions(+), 182 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7547813254..f8908e2c0a 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + action = "CREATE"; + break; + case XLOG_SMGR_MARK_UNLINK: + action = "DELETE"; + break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +95,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_MARK: + id = "MARK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..2ecd8c8c7c 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -725,6 +725,15 @@ then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The Smgr MARK files +-------------------------------- + +An smgr mark file is created when a new relation file is created to +mark the relfilenode needs to be cleaned up at recovery time. In +contrast to the four actions above, failure to remove smgr mark files +will lead to data loss, in which case the server will shut down. + + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index adf763a8ea..559666b802 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2198,6 +2198,9 @@ CommitTransaction(void) */ smgrDoPendingSyncs(true, is_parallel_worker); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2448,6 +2451,9 @@ PrepareTransaction(void) */ smgrDoPendingSyncs(true, false); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2773,6 +2779,7 @@ AbortTransaction(void) AfterTriggerEndXact(false); /* 'false' means it's abort */ AtAbort_Portals(); smgrDoPendingSyncs(false, is_parallel_worker); + smgrDoPendingCleanups(false); AtEOXact_LargeObject(false); AtAbort_Notify(); AtEOXact_RelationMap(false, is_parallel_worker); diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index f9f212680b..2923b8ef8c 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -40,6 +40,7 @@ #include "access/xlogrecovery.h" #include "access/xlogutils.h" #include "catalog/pg_control.h" +#include "catalog/storage.h" #include "commands/tablespace.h" #include "miscadmin.h" #include "pgstat.h" @@ -53,6 +54,7 @@ #include "storage/pmsignal.h" #include "storage/proc.h" #include "storage/procarray.h" +#include "storage/reinit.h" #include "storage/spin.h" #include "utils/builtins.h" #include "utils/guc.h" @@ -1746,6 +1748,14 @@ PerformWalRecovery(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { @@ -3022,6 +3032,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 9b8075536a..cd1445713a 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -66,6 +67,23 @@ typedef struct PendingRelDelete struct PendingRelDelete *next; /* linked-list link */ } PendingRelDelete; +#define PCOP_UNLINK_FORK (1 << 0) +#define PCOP_UNLINK_MARK (1 << 1) +#define PCOP_SET_PERSISTENCE (1 << 2) + +typedef struct PendingCleanup +{ + RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ + StorageMarks unlink_mark; /* mark to unlink */ + BackendId backend; /* InvalidBackendId if not a temp rel */ + bool atCommit; /* T=delete at commit; F=delete at abort */ + int nestLevel; /* xact nesting level of request */ + struct PendingCleanup *next; /* linked-list link */ +} PendingCleanup; + typedef struct PendingRelSync { RelFileNode rnode; @@ -73,6 +91,7 @@ typedef struct PendingRelSync } PendingRelSync; static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ +static PendingCleanup *pendingCleanups = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; @@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode) SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence) { - PendingRelDelete *pending; + PendingRelDelete *pendingdel; + PendingCleanup *pendingclean; SMgrRelation srel; BackendId backend; bool needs_wal; @@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up. The + * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup. + */ srel = smgropen(rnode, backend); + log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false); smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM); /* Add the relation to the list of stuff to delete at abort */ - pending = (PendingRelDelete *) + pendingdel = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); - pending->relnode = rnode; - pending->backend = backend; - pending->atCommit = false; /* delete if abort */ - pending->nestLevel = GetCurrentTransactionNestLevel(); - pending->next = pendingDeletes; - pendingDeletes = pending; + pendingdel->relnode = rnode; + pendingdel->backend = backend; + pendingdel->atCommit = false; /* delete if abort */ + pendingdel->nestLevel = GetCurrentTransactionNestLevel(); + pendingdel->next = pendingDeletes; + pendingDeletes = pendingdel; + + /* drop mark files at commit */ + pendingclean = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pendingclean->relnode = rnode; + pendingclean->op = PCOP_UNLINK_MARK; + pendingclean->unlink_forknum = MAIN_FORKNUM; + pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED; + pendingclean->backend = backend; + pendingclean->atCommit = true; + pendingclean->nestLevel = GetCurrentTransactionNestLevel(); + pendingclean->next = pendingCleanups; + pendingCleanups = pendingclean; if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { @@ -168,6 +208,200 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + SMgrRelation srel; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false); + + /* + * If we have a pending-unlink for the init-fork of this relation, that + * means the init-fork exists since before the current transaction + * started. This function reverts that change just by removing the entry. + * See RelationDropInitFork. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum == INIT_FORKNUM) + { + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + create = false; + } + else + prev = pending; + } + + if (!create) + return; + + /* create the init fork, along with the commit-sentinel file */ + srel = smgropen(rnode, InvalidBackendId); + log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * init fork for indexes needs further initialization. ambuildempty should + * do WAL-log and file sync by itself but otherwise we do that by + * ourselves. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop the init fork, mark file and revert persistence at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* drop mark file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion is canceled. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false); + + /* + * If we have a pending-unlink for the init-fork of this relation, that + * means the init fork is created in the current transaction. We remove + * both the init fork and mark file immediately in that case. Otherwise + * just register a pending-unlink for the existing init fork. See + * RelationCreateInitFork. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum != INIT_FORKNUM) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + inxact_created = true; + } + else + prev = pending; + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT forks are never loaded to shared buffer so no point in dropping + * buffers for such files. + */ + log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +421,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL. + */ +void +log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_CREATE; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL. + */ +void +log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_UNLINK; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -673,6 +989,95 @@ smgrDoPendingDeletes(bool isCommit) } } +/* + * smgrDoPendingUnmark() -- Clean up work that emits WAL records + * + * The operations handled in the function emits WAL records, which must be + * part of the current transaction. + */ +void +smgrDoPendingCleanups(bool isCommit) +{ + int nestLevel = GetCurrentTransactionNestLevel(); + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + if (pending->nestLevel < nestLevel) + { + /* outer-level entries should not be processed yet */ + prev = pending; + } + else + { + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + /* do cleanup if called for */ + if (pending->atCommit == isCommit) + { + SMgrRelation srel; + + srel = smgropen(pending->relnode, pending->backend); + + Assert ((pending->op & + ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | + PCOP_SET_PERSISTENCE)) == 0); + + if (pending->op & PCOP_SET_PERSISTENCE) + { + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); + } + + if (pending->op & PCOP_UNLINK_FORK) + { + /* + * Unlink the fork file. Currently we use this only for + * init forks and we're sure that the init fork is not + * loaded on shared buffers. For RelationDropInitFork + * case, the function dropped that buffers. For + * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true) + * is set and the buffers have been dropped just before. + */ + Assert(pending->unlink_forknum == INIT_FORKNUM); + + /* Don't emit wal while recovery. */ + if (!InRecovery) + log_smgrunlink(&pending->relnode, + pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + } + + if (pending->op & PCOP_UNLINK_MARK) + { + SMgrRelation srel; + + if (!InRecovery) + log_smgrunlinkmark(&pending->relnode, + pending->unlink_forknum, + pending->unlink_mark); + srel = smgropen(pending->relnode, pending->backend); + smgrunlinkmark(srel, pending->unlink_forknum, + pending->unlink_mark, InRecovery); + smgrclose(srel); + } + } + + /* must explicitly free the list entry */ + pfree(pending); + /* prev does not change */ + } + } +} + /* * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact. */ @@ -933,6 +1338,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1021,6 +1435,124 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + bool created = false; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true); + created = true; + break; + case XLOG_SMGR_MARK_UNLINK: + smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true); + break; + default: + elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark); + } + + if (created) + { + /* revert mark file operation at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = xlrec->forkNum; + pending->unlink_mark = xlrec->mark; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + else + { + /* + * Delete pending action for this mark file if any. We should have + * at most one entry for this action. + */ + PendingCleanup *prev = NULL; + + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + pending->unlink_forknum == xlrec->forkNum && + (pending->op & PCOP_UNLINK_MARK) != 0) + { + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + } + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + PendingCleanup *prev = NULL; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* + * Delete pending action for persistence change if any. We should have + * at most one entry for this action. + */ + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + (pending->op & PCOP_SET_PERSISTENCE) != 0) + { + Assert (pending->bufpersistence == xlrec->persistence); + + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + + /* + * Revert buffer-persistence changes at abort if the relation is going + * to different persistence from before this transaction. + */ + if (!pending) + { + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 3e83f375b5..9e5b77e94a 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -53,6 +53,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -5347,6 +5348,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform in-place persistnce change"); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * XXXX: Some access methods do not bear up an in-place persistence + * change. Specifically, GiST uses page LSNs to figure out whether a + * block has changed, where UNLOGGED GiST indexes use fake LSNs that + * are incompatible with real LSNs used for LOGGED ones. + * + * Maybe if gistGetFakeLSN behaved the same way for permanent and + * unlogged indexes, we could skip index rebuild in exchange of some + * extra WAL records emitted while it is unlogged. + * + * Check relam against a positive list so that we take this way for + * unknown AMs. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + /* GiST is excluded */ + r->rd_rel->relam != BTREE_AM_OID && + r->rd_rel->relam != HASH_AM_OID && + r->rd_rel->relam != GIN_AM_OID && + r->rd_rel->relam != SPGIST_AM_OID && + r->rd_rel->relam != BRIN_AM_OID) + { + int reindex_flags; + ReindexParams params = {0}; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, ¶ms); + + continue; + } + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(RelationGetSmgr(r), i)) + smgrimmedsync(RelationGetSmgr(r), i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + * We don't emit this fhile wal_level = minimal. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + xl_smgr_truncate xlrec; + + xlrec.blkno = 0; + xlrec.rnode = r->rd_node; + xlrec.flags = SMGR_TRUNCATE_ALL; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + + XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(RelationGetSmgr(r), fork)) + log_newpage_range(r, fork, 0, + smgrnblocks(RelationGetSmgr(r), fork), + false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5477,47 +5659,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod, - persistence, lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, + NewAccessMethod, + persistence, lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); - InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + } } else { diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 0bf28b55d7..17185f4e55 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1209,6 +1209,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, bool excludeFound; ForkNumber relForkNum; /* Type of fork if file is a relation */ int relOidChars; /* Chars in filename that are the rel oid */ + StorageMarks mark; /* Skip special stuff */ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) @@ -1259,7 +1260,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, /* Exclude all forks for unlogged tables except the init fork */ if (isDbDir && parse_filename_for_nontemp_relation(de->d_name, &relOidChars, - &relForkNum)) + &relForkNum, &mark)) { /* Never exclude init forks */ if (relForkNum != INIT_FORKNUM) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index f5459c68f8..6cd010429a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -38,6 +38,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3155,6 +3156,91 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* There shouldn't be an init fork */ + Assert(bufHdr->tag.forkNum != INIT_FORKNUM); + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 14b77f2861..2fc9f17c28 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel); static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel); -static int fsync_parent_path(const char *fname, int elevel); - /* * pg_fsync --- do fsync with or without writethrough @@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel) * This is aimed at making file operations persistent on disk in case of * an OS crash or power failure. */ -static int +int fsync_parent_path(const char *fname, int elevel) { char parentpath[MAXPGPATH]; diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index f053fe0495..f28f55baa6 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,49 @@ #include <unistd.h> +#include "access/xlogrecovery.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" #include "postmaster/startup.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. + * + * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the + * init fork along with the mark file. + * + * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the + * whole relation along with the mark file. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. @@ -72,7 +92,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -81,13 +101,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + + Assert(tspid != 0); + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -103,7 +129,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s", dbspace_path); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create tons of unlogged relations in + * the same database & tablespace, so we'd better use a hash table rather + * than an array or linked list to keep track of which files need to be + * reset. Otherwise, this cleanup operation would be O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT fork and mark files in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + StorageMarks mark; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum, &mark)) + continue; + + if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has + * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty + * state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_init = true; + else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + DropRelFileNodesAllBuffers(srels, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + StorageMarks mark; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) + &forkNum, &mark)) continue; /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { @@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; @@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char mainpath[MAXPGPATH]; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, - ForkNumber *fork) + ForkNumber *fork, StorageMarks *mark) { int pos; @@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars, for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar) ; - if (segchar <= 1) - return false; - pos += segchar; + if (segchar > 1) + pos += segchar; } + /* mark file? */ + if (name[pos] == '.' && name[pos + 1] != 0) + { + *mark = name[pos + 1]; + pos += 2; + } + else + *mark = SMGR_MARK_NONE; + /* Now we should be at the end. */ if (name[pos] != '\0') return false; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 879f647dbc..4d44bdd78b 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno, bool skipFsync, int behavior); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); - +static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum, + StorageMarks mark); /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -169,6 +170,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum) return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL); } +/* + * mdcreatemark() -- Create a mark file. + * + * If isRedo is true, it's okay for the file to exist already. + */ +void +mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path =markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + /* See mdcreate for details.. */ + TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode, + reln->smgr_rnode.node.dbNode, + isRedo); + + fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL); + if (fd < 0 && (!isRedo || errno != EEXIST)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not crete mark file \"%s\": %m", path))); + + pg_fsync(fd); + close(fd); + + /* + * To guarantee that the creation of the file is persistent, fsync its + * parent directory. + */ + fsync_parent_path(path, ERROR); + + pfree(path); +} + + +/* + * mdunlinkmark() -- Delete the mark file + * + * If isRedo is true, it's okay for the file being not found. + */ +void +mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + + if (!isRedo || mdmarkexists(reln, forkNum, mark)) + durable_unlink(path, ERROR); + + pfree(path); +} + +/* + * mdmarkexists() -- Check if the file exists. + */ +static bool +mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + fd = BasicOpenFile(path, O_RDONLY); + if (fd < 0 && errno != ENOENT) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not access mark file \"%s\": %m", path))); + pfree(path); + + if (fd < 0) + return false; + + close(fd); + return true; +} + /* * mdcreate() -- Create a new relation on magnetic disk. * @@ -1031,6 +1108,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ @@ -1384,12 +1470,14 @@ mdsyncfiletag(const FileTag *ftag, char *path) * Return 0 on success, -1 on failure, with errno set. */ int -mdunlinkfiletag(const FileTag *ftag, char *path) +mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark) { char *p; /* Compute the path. */ - p = relpathperm(ftag->rnode, MAIN_FORKNUM); + p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode, + ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM, + mark); strlcpy(path, p, MAXPGPATH); pfree(p); diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index d71a557a35..0710e8b145 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -63,6 +63,10 @@ typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); + void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); + void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); } f_smgr; static const f_smgr smgrsw[] = { @@ -84,6 +88,8 @@ static const f_smgr smgrsw[] = { .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, .smgr_immedsync = mdimmedsync, + .smgr_createmark = mdcreatemark, + .smgr_unlinkmark = mdunlinkmark, } }; @@ -337,6 +343,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo); } +/* + * smgrcreatemark() -- Create a mark file + */ +void +smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo); +} + +/* + * smgrunlinkmark() -- Delete a mark file + */ +void +smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo); +} + /* * smgrdosyncall() -- Immediately sync all forks of all given relations * @@ -664,6 +690,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index e161d57761..f5ded7cb34 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -90,7 +90,8 @@ static CycleCtr checkpoint_cycle_ctr = 0; typedef struct SyncOps { int (*sync_syncfiletag) (const FileTag *ftag, char *path); - int (*sync_unlinkfiletag) (const FileTag *ftag, char *path); + int (*sync_unlinkfiletag) (const FileTag *ftag, char *path, + StorageMarks mark); bool (*sync_filetagmatches) (const FileTag *ftag, const FileTag *candidate); } SyncOps; @@ -223,7 +224,8 @@ SyncPostCheckpoint(void) /* Unlink the file */ if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag, - path) < 0) + path, + SMGR_MARK_NONE) < 0) { /* * There's a race condition, when the database is dropped at the @@ -237,6 +239,20 @@ SyncPostCheckpoint(void) (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } + else if (syncsw[entry->tag.handler].sync_unlinkfiletag( + &entry->tag, path, + SMGR_MARK_UNCOMMITTED) < 0) + { + /* + * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the + * fork files has been successfully removed. It's ok if the file + * does not exist. + */ + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); + } /* Mark the list entry as canceled, just in case */ entry->canceled = true; diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 56df08c64f..f1382d4c4f 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -407,6 +407,28 @@ extractPageInfo(XLogReaderState *record) * source system. */ } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK) + { + /* + * We can safely ignore there. We'll see that the file don't exist in + * the target data dir, and copy them in from the source system. No + * need to do anything special here. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK) + { + /* + * We can safely ignore these, The file will be removed from the + * target, if it doesn't exist in the source system. The files are + * empty so we don't need to bother the content. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE) + { + /* + * We can safely ignore these. These don't make any on-disk changes. + */ + } else if (rmid == RM_XACT_ID && ((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT || (rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED || diff --git a/src/common/relpath.c b/src/common/relpath.c index 636c96efd3..1c19e16fea 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode) */ char * GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber) + int backendId, ForkNumber forkNumber, char mark) { char *path; + char markstr[4]; + + if (mark == 0) + markstr[0] = 0; + else + snprintf(markstr, sizeof(markstr), ".%c", mark); if (spcNode == GLOBALTABLESPACE_OID) { @@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, Assert(dbNode == 0); Assert(backendId == InvalidBackendId); if (forkNumber != MAIN_FORKNUM) - path = psprintf("global/%u_%s", - relNode, forkNames[forkNumber]); + path = psprintf("global/%u_%s%s", + relNode, forkNames[forkNumber], markstr); else - path = psprintf("global/%u", relNode); + path = psprintf("global/%u%s", relNode, markstr); } else if (spcNode == DEFAULTTABLESPACE_OID) { @@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/%u_%s", + path = psprintf("base/%u/%u_%s%s", dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/%u", - dbNode, relNode); + path = psprintf("base/%u/%u%s", + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/t%d_%u_%s", + path = psprintf("base/%u/t%d_%u_%s%s", dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/t%d_%u", - dbNode, backendId, relNode); + path = psprintf("base/%u/t%d_%u%s", + dbNode, backendId, relNode, markstr); } } else @@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/%u", + path = psprintf("pg_tblspc/%u/%s/%u/%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, relNode); + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, backendId, relNode); + dbNode, backendId, relNode, markstr); } } + return path; } diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 9ffc741913..d362d62ed2 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); @@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress); extern void smgrDoPendingDeletes(bool isCommit); extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker); extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr); +extern void smgrDoPendingCleanups(bool isCommit); extern void AtSubCommit_smgr(void); extern void AtSubAbort_smgr(void); extern void PostPrepare_smgr(void); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index 622de22b03..8139308634 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -18,17 +18,23 @@ #include "lib/stringinfo.h" #include "storage/block.h" #include "storage/relfilenode.h" +#include "storage/smgr.h" /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation and buffer persistence change here, + * but logging of deletion actions is handled mainly by xact.c, because it is + * part of transaction commit in most cases. However, there's a case where + * init forks are deleted outside control of transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_MARK 0x40 +#define XLOG_SMGR_BUFPERSISTENCE 0x50 typedef struct xl_smgr_create { @@ -36,6 +42,32 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef enum smgr_mark_action +{ + XLOG_SMGR_MARK_CREATE = 'c', + XLOG_SMGR_MARK_UNLINK = 'u' +} smgr_mark_action; + +typedef struct xl_smgr_mark +{ + RelFileNode rnode; + ForkNumber forkNum; + StorageMarks mark; + smgr_mark_action action; +} xl_smgr_mark; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a4b5dc853b..a864c91614 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork); extern char *GetDatabasePath(Oid dbNode, Oid spcNode); extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber); + int backendId, ForkNumber forkNumber, char mark); /* * Wrapper macros for GetRelationPath. Beware of multiple @@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, /* First argument is a RelFileNode */ #define relpathbackend(rnode, backend, forknum) \ GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \ - backend, forknum) + backend, forknum, 0) /* First argument is a RelFileNode */ #define relpathperm(rnode, forknum) \ @@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, #define relpath(rnode, forknum) \ relpathbackend((rnode).node, (rnode).backend, forknum) +/* First argument is a RelFileNodeBackend */ +#define markpath(rnode, forknum, mark) \ + GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \ + (rnode).node.relNode, \ + (rnode).backend, forknum, mark) #endif /* RELPATH_H */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index dd01841c30..739b386216 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 29209e2724..8bf746bf45 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd, extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); +extern int fsync_parent_path(const char *fname, int elevel); extern int durable_rename(const char *oldfile, const char *newfile, int loglevel); extern int durable_unlink(const char *fname, int loglevel); extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 6e46d8d96a..ef5fdaf4f8 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -24,6 +24,10 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); extern void mdrelease(void); +extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); @@ -42,12 +46,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); /* md sync callbacks */ extern int mdsyncfiletag(const FileTag *ftag, char *path); -extern int mdunlinkfiletag(const FileTag *ftag, char *path); +extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark); extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate); #endif /* MD_H */ diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index bf2c10d443..e399aec0c7 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -16,13 +16,15 @@ #define REINIT_H #include "common/relpath.h" - +#include "storage/smgr.h" extern void ResetUnloggedRelations(int op); -extern bool parse_filename_for_nontemp_relation(const char *name, - int *oidchars, ForkNumber *fork); +extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, + ForkNumber *fork, + StorageMarks *mark); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 8e3ef92cda..022654b7b2 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -18,6 +18,18 @@ #include "storage/block.h" #include "storage/relfilenode.h" +/* + * Storage marks is a file of which existence suggests something about a + * file. The name of such files is "<filename>.<mark>", where the mark is one + * of the values of StorageMarks. Since ".<digit>" means segment files so don't + * use digits for the mark character. + */ +typedef enum StorageMarks +{ + SMGR_MARK_NONE = 0, + SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */ +} StorageMarks; + /* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles. An SMgrRelation is created (if not already present) @@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); +extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, -- 2.27.0 From d7caa6b33f364ad1a88a8f74306a255e607a6639 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v18 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- doc/src/sgml/ref/alter_table.sgml | 15 +++ src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 +++ src/backend/nodes/equalfuncs.c | 15 +++ src/backend/parser/gram.y | 42 +++++++ src/backend/tcop/utility.c | 11 ++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 10 ++ src/test/regress/expected/tablespace.out | 76 ++++++++++++ src/test/regress/sql/tablespace.sql | 41 +++++++ 11 files changed, 369 insertions(+) diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 5c0735e08a..b03d5511a6 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> SET SCHEMA <replaceable class="parameter">new_schema</replaceable> ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable>[, ... ] ] SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ] +ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable>[, ... ] ] + SET { LOGGED | UNLOGGED } [ NOWAIT ] ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable>| DEFAULT } ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> @@ -753,6 +755,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM (see <xref linkend="sql-createtable-unlogged"/>). It cannot be applied to a temporary table. </para> + + <para> + All tables in the current database in a tablespace can be changed by using + the <literal>ALL IN TABLESPACE</literal> form, which will lock all tables + to be changed first and then change each one. This form also supports + <literal>OWNED BY</literal>, which will only change tables owned by the + roles specified. If the <literal>NOWAIT</literal> option is specified + then the command will fail if it is unable to acquire all of the locks + required immediately. The <literal>information_schema</literal> + relations are not considered part of the system catalogs and will be + changed. See also + <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>. + </para> </listitem> </varlistentry> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 9e5b77e94a..0724d0e1d2 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -14770,6 +14770,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(stmt->roles); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index d4f8455a2b..ba605405a9 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4285,6 +4285,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5655,6 +5668,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index f1002afe7a..b76fc872a5 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1925,6 +1925,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3650,6 +3662,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index a03b33b53b..f8a41de2dd 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1985,6 +1985,48 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->roles = $9; + n->logged = true; + n->nowait = $12; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->roles = $9; + n->logged = false; + n->nowait = $12; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 3780c6e812..80d1e360b3 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -163,6 +163,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1753,6 +1754,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2675,6 +2682,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 5d4037f26e..c381dad3e5 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index 5d075f0c34..d8e1f223c8 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -430,6 +430,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 1617702d9d..4fa9d9360f 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2352,6 +2352,16 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + List *roles; /* List of roles to change objects of */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out index 2dfbcfdebe..c02afdcb68 100644 --- a/src/test/regress/expected/tablespace.out +++ b/src/test/regress/expected/tablespace.out @@ -943,5 +943,81 @@ drop cascades to table testschema.asexecute drop cascades to table testschema.part drop cascades to table testschema.atable drop cascades to table testschema.tablespace_acl +-- +-- Check persistence change in a tablespace +CREATE SCHEMA testschema; +GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1; +CREATE TABLESPACE regress_tablespace LOCATION ''; +GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1; +CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default; +SET ROLE regress_tablespace_user1; +CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | p + usu | regress_tablespace | u + lu1 | regress_tablespace | p + uu1 | regress_tablespace | u + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +ALTER TABLE ALL IN TABLESPACE regress_tablespace + OWNED BY regress_tablespace_user1 SET LOGGED; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | p + usu | regress_tablespace | u + lu1 | regress_tablespace | p + uu1 | regress_tablespace | p + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +RESET ROLE; +ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | u + usu | regress_tablespace | u + lu1 | regress_tablespace | u + uu1 | regress_tablespace | u + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +-- Should succeed +DROP SCHEMA testschema CASCADE; +NOTICE: drop cascades to 8 other objects +DETAIL: drop cascades to table testschema.lsu +drop cascades to table testschema.usu +drop cascades to table testschema._lsu +drop cascades to table testschema._usu +drop cascades to table testschema.lu1 +drop cascades to table testschema.uu1 +drop cascades to table testschema._lu1 +drop cascades to table testschema._uu1 +DROP TABLESPACE regress_tablespace; DROP ROLE regress_tablespace_user1; DROP ROLE regress_tablespace_user2; diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql index 896f05cea3..4e407eb8c0 100644 --- a/src/test/regress/sql/tablespace.sql +++ b/src/test/regress/sql/tablespace.sql @@ -419,5 +419,46 @@ DROP TABLESPACE regress_tblspace_renamed; DROP SCHEMA testschema CASCADE; + +-- +-- Check persistence change in a tablespace +CREATE SCHEMA testschema; +GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1; +CREATE TABLESPACE regress_tablespace LOCATION ''; +GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1; + +CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default; +SET ROLE regress_tablespace_user1; +CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +ALTER TABLE ALL IN TABLESPACE regress_tablespace + OWNED BY regress_tablespace_user1 SET LOGGED; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +RESET ROLE; + +ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +-- Should succeed +DROP SCHEMA testschema CASCADE; +DROP TABLESPACE regress_tablespace; + DROP ROLE regress_tablespace_user1; DROP ROLE regress_tablespace_user2; -- 2.27.0
At Tue, 01 Mar 2022 14:14:13 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > - Removed the default case in smgr_desc since it seems to me we don't > assume out-of-definition values in xlog records elsewhere. Stupid. The complier on the CI environemnt complains for uninitialized variable even though it (presumably) knows that the all paths of the switch statement set the variable. Added default value to try to silence compiler. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From ec75c49ffd939f6db8e0d840ef043c18845d1b9d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 21:51:11 +0900 Subject: [PATCH v19 1/2] In-place table persistence change Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data rewriting, currently it runs heap rewrite which causes large amount of file I/O. This patch makes the command run without heap rewrite. Addition to that, SET LOGGED while wal_level > minimal emits WAL using XLOG_FPI instead of massive number of HEAP_INSERT's, which should be smaller. Also this allows for the cleanup of files left behind in the crash of the transaction that created it. --- src/backend/access/rmgrdesc/smgrdesc.c | 49 ++ src/backend/access/transam/README | 9 + src/backend/access/transam/xact.c | 7 + src/backend/access/transam/xlogrecovery.c | 18 + src/backend/catalog/storage.c | 548 +++++++++++++++++++++- src/backend/commands/tablecmds.c | 266 +++++++++-- src/backend/replication/basebackup.c | 3 +- src/backend/storage/buffer/bufmgr.c | 86 ++++ src/backend/storage/file/fd.c | 4 +- src/backend/storage/file/reinit.c | 344 ++++++++++---- src/backend/storage/smgr/md.c | 94 +++- src/backend/storage/smgr/smgr.c | 32 ++ src/backend/storage/sync/sync.c | 20 +- src/bin/pg_rewind/parsexlog.c | 22 + src/bin/pg_rewind/pg_rewind.c | 1 - src/common/relpath.c | 47 +- src/include/catalog/storage.h | 3 + src/include/catalog/storage_xlog.h | 42 +- src/include/common/relpath.h | 9 +- src/include/storage/bufmgr.h | 2 + src/include/storage/fd.h | 1 + src/include/storage/md.h | 8 +- src/include/storage/reinit.h | 10 +- src/include/storage/smgr.h | 17 + 24 files changed, 1459 insertions(+), 183 deletions(-) diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c index 7547813254..225ffbafef 100644 --- a/src/backend/access/rmgrdesc/smgrdesc.c +++ b/src/backend/access/rmgrdesc/smgrdesc.c @@ -40,6 +40,46 @@ smgr_desc(StringInfo buf, XLogReaderState *record) xlrec->blkno, xlrec->flags); pfree(path); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec; + char *path = relpathperm(xlrec->rnode, xlrec->forkNum); + + appendStringInfoString(buf, path); + pfree(path); + } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) rec; + char *path = GetRelationPath(xlrec->rnode.dbNode, + xlrec->rnode.spcNode, + xlrec->rnode.relNode, + InvalidBackendId, + xlrec->forkNum, xlrec->mark); + char *action = "<none>"; + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + action = "CREATE"; + break; + case XLOG_SMGR_MARK_UNLINK: + action = "DELETE"; + break; + } + + appendStringInfo(buf, "%s %s", action, path); + pfree(path); + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec; + char *path = relpathperm(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfoString(buf, path); + appendStringInfo(buf, " persistence %d", xlrec->persistence); + pfree(path); + } } const char * @@ -55,6 +95,15 @@ smgr_identify(uint8 info) case XLOG_SMGR_TRUNCATE: id = "TRUNCATE"; break; + case XLOG_SMGR_UNLINK: + id = "UNLINK"; + break; + case XLOG_SMGR_MARK: + id = "MARK"; + break; + case XLOG_SMGR_BUFPERSISTENCE: + id = "BUFPERSISTENCE"; + break; } return id; diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README index 1edc8180c1..2ecd8c8c7c 100644 --- a/src/backend/access/transam/README +++ b/src/backend/access/transam/README @@ -725,6 +725,15 @@ then restart recovery. This is part of the reason for not writing a WAL entry until we've successfully done the original action. +The Smgr MARK files +-------------------------------- + +An smgr mark file is created when a new relation file is created to +mark the relfilenode needs to be cleaned up at recovery time. In +contrast to the four actions above, failure to remove smgr mark files +will lead to data loss, in which case the server will shut down. + + Skipping WAL for New RelFileNode -------------------------------- diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index adf763a8ea..559666b802 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2198,6 +2198,9 @@ CommitTransaction(void) */ smgrDoPendingSyncs(true, is_parallel_worker); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2448,6 +2451,9 @@ PrepareTransaction(void) */ smgrDoPendingSyncs(true, false); + /* Likewise delete mark files for files created during this transaction. */ + smgrDoPendingCleanups(true); + /* close large objects before lower-level cleanup */ AtEOXact_LargeObject(true); @@ -2773,6 +2779,7 @@ AbortTransaction(void) AfterTriggerEndXact(false); /* 'false' means it's abort */ AtAbort_Portals(); smgrDoPendingSyncs(false, is_parallel_worker); + smgrDoPendingCleanups(false); AtEOXact_LargeObject(false); AtAbort_Notify(); AtEOXact_RelationMap(false, is_parallel_worker); diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index f9f212680b..2923b8ef8c 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -40,6 +40,7 @@ #include "access/xlogrecovery.h" #include "access/xlogutils.h" #include "catalog/pg_control.h" +#include "catalog/storage.h" #include "commands/tablespace.h" #include "miscadmin.h" #include "pgstat.h" @@ -53,6 +54,7 @@ #include "storage/pmsignal.h" #include "storage/proc.h" #include "storage/procarray.h" +#include "storage/reinit.h" #include "storage/spin.h" #include "utils/builtins.h" #include "utils/guc.h" @@ -1746,6 +1748,14 @@ PerformWalRecovery(void) } } + /* cleanup garbage files left during crash recovery */ + if (!InArchiveRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + /* Allow resource managers to do any required cleanup. */ for (rmid = 0; rmid <= RM_MAX_ID; rmid++) { @@ -3022,6 +3032,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { ereport(DEBUG1, (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + + /* cleanup garbage files left during crash recovery */ + ResetUnloggedRelations(UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_CLEANUP); + + /* run rollback cleanup if any */ + smgrDoPendingDeletes(false); + InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index 9b8075536a..cd1445713a 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -19,6 +19,7 @@ #include "postgres.h" +#include "access/amapi.h" #include "access/parallel.h" #include "access/visibilitymap.h" #include "access/xact.h" @@ -66,6 +67,23 @@ typedef struct PendingRelDelete struct PendingRelDelete *next; /* linked-list link */ } PendingRelDelete; +#define PCOP_UNLINK_FORK (1 << 0) +#define PCOP_UNLINK_MARK (1 << 1) +#define PCOP_SET_PERSISTENCE (1 << 2) + +typedef struct PendingCleanup +{ + RelFileNode relnode; /* relation that may need to be deleted */ + int op; /* operation mask */ + bool bufpersistence; /* buffer persistence to set */ + int unlink_forknum; /* forknum to unlink */ + StorageMarks unlink_mark; /* mark to unlink */ + BackendId backend; /* InvalidBackendId if not a temp rel */ + bool atCommit; /* T=delete at commit; F=delete at abort */ + int nestLevel; /* xact nesting level of request */ + struct PendingCleanup *next; /* linked-list link */ +} PendingCleanup; + typedef struct PendingRelSync { RelFileNode rnode; @@ -73,6 +91,7 @@ typedef struct PendingRelSync } PendingRelSync; static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ +static PendingCleanup *pendingCleanups = NULL; /* head of linked list */ HTAB *pendingSyncHash = NULL; @@ -117,7 +136,8 @@ AddPendingSync(const RelFileNode *rnode) SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence) { - PendingRelDelete *pending; + PendingRelDelete *pendingdel; + PendingCleanup *pendingclean; SMgrRelation srel; BackendId backend; bool needs_wal; @@ -143,21 +163,41 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return NULL; /* placate compiler */ } + /* + * We are going to create a new storage file. If server crashes before the + * current transaction ends the file needs to be cleaned up. The + * SMGR_MARK_UNCOMMITED mark file prompts that work at the next startup. + */ srel = smgropen(rnode, backend); + log_smgrcreatemark(&rnode, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, MAIN_FORKNUM, SMGR_MARK_UNCOMMITTED, false); smgrcreate(srel, MAIN_FORKNUM, false); if (needs_wal) log_smgrcreate(&srel->smgr_rnode.node, MAIN_FORKNUM); /* Add the relation to the list of stuff to delete at abort */ - pending = (PendingRelDelete *) + pendingdel = (PendingRelDelete *) MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); - pending->relnode = rnode; - pending->backend = backend; - pending->atCommit = false; /* delete if abort */ - pending->nestLevel = GetCurrentTransactionNestLevel(); - pending->next = pendingDeletes; - pendingDeletes = pending; + pendingdel->relnode = rnode; + pendingdel->backend = backend; + pendingdel->atCommit = false; /* delete if abort */ + pendingdel->nestLevel = GetCurrentTransactionNestLevel(); + pendingdel->next = pendingDeletes; + pendingDeletes = pendingdel; + + /* drop mark files at commit */ + pendingclean = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pendingclean->relnode = rnode; + pendingclean->op = PCOP_UNLINK_MARK; + pendingclean->unlink_forknum = MAIN_FORKNUM; + pendingclean->unlink_mark = SMGR_MARK_UNCOMMITTED; + pendingclean->backend = backend; + pendingclean->atCommit = true; + pendingclean->nestLevel = GetCurrentTransactionNestLevel(); + pendingclean->next = pendingCleanups; + pendingCleanups = pendingclean; if (relpersistence == RELPERSISTENCE_PERMANENT && !XLogIsNeeded()) { @@ -168,6 +208,200 @@ RelationCreateStorage(RelFileNode rnode, char relpersistence) return srel; } +/* + * RelationCreateInitFork + * Create physical storage for the init fork of a relation. + * + * Create the init fork for the relation. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the init fork will be removed. + */ +void +RelationCreateInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + SMgrRelation srel; + bool create = true; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), false, false); + + /* + * If we have a pending-unlink for the init-fork of this relation, that + * means the init-fork exists since before the current transaction + * started. This function reverts that change just by removing the entry. + * See RelationDropInitFork. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum == INIT_FORKNUM) + { + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + create = false; + } + else + prev = pending; + } + + if (!create) + return; + + /* create the init fork, along with the commit-sentinel file */ + srel = smgropen(rnode, InvalidBackendId); + log_smgrcreatemark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrcreatemark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + + /* We don't have existing init fork, create it. */ + smgrcreate(srel, INIT_FORKNUM, false); + + /* + * init fork for indexes needs further initialization. ambuildempty should + * do WAL-log and file sync by itself but otherwise we do that by + * ourselves. + */ + if (rel->rd_rel->relkind == RELKIND_INDEX) + rel->rd_indam->ambuildempty(rel); + else + { + log_smgrcreate(&rnode, INIT_FORKNUM); + smgrimmedsync(srel, INIT_FORKNUM); + } + + /* drop the init fork, mark file and revert persistence at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | PCOP_SET_PERSISTENCE; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->bufpersistence = true; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* drop mark file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = INIT_FORKNUM; + pending->unlink_mark = SMGR_MARK_UNCOMMITTED; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + +/* + * RelationDropInitFork + * Delete physical storage for the init fork of a relation. + * + * Register pending-delete of the init fork. The real deletion is performed by + * smgrDoPendingDeletes at commit. + * + * This function is transactional. If the transaction aborts later on, the + * deletion is canceled. + */ +void +RelationDropInitFork(Relation rel) +{ + RelFileNode rnode = rel->rd_node; + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + bool inxact_created = false; + + /* switch buffer persistence */ + SetRelationBuffersPersistence(RelationGetSmgr(rel), true, false); + + /* + * If we have a pending-unlink for the init-fork of this relation, that + * means the init fork is created in the current transaction. We remove + * both the init fork and mark file immediately in that case. Otherwise + * just register a pending-unlink for the existing init fork. See + * RelationCreateInitFork. + */ + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + + if (RelFileNodeEquals(rnode, pending->relnode) && + pending->unlink_forknum != INIT_FORKNUM) + { + /* unlink list entry */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + pfree(pending); + /* prev does not change */ + + inxact_created = true; + } + else + prev = pending; + } + + if (inxact_created) + { + SMgrRelation srel = smgropen(rnode, InvalidBackendId); + + /* + * INIT forks are never loaded to shared buffer so no point in dropping + * buffers for such files. + */ + log_smgrunlinkmark(&rnode, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED); + smgrunlinkmark(srel, INIT_FORKNUM, SMGR_MARK_UNCOMMITTED, false); + log_smgrunlink(&rnode, INIT_FORKNUM); + smgrunlink(srel, INIT_FORKNUM, false); + return; + } + + /* register drop of this init fork file at commit */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_UNLINK_FORK; + pending->unlink_forknum = INIT_FORKNUM; + pending->backend = InvalidBackendId; + pending->atCommit = true; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + + /* revert buffer-persistence changes at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = false; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; +} + /* * Perform XLogInsert of an XLOG_SMGR_CREATE record to WAL. */ @@ -187,6 +421,88 @@ log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum) XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE); } +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINK record to WAL. + */ +void +log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum) +{ + xl_smgr_unlink xlrec; + + /* + * Make an XLOG entry reporting the file unlink. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_UNLINK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_CREATEMARK record to WAL. + */ +void +log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_CREATE; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_UNLINKMARK record to WAL. + */ +void +log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark) +{ + xl_smgr_mark xlrec; + + /* + * Make an XLOG entry reporting the file creation. + */ + xlrec.rnode = *rnode; + xlrec.forkNum = forkNum; + xlrec.mark = mark; + xlrec.action = XLOG_SMGR_MARK_UNLINK; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_MARK | XLR_SPECIAL_REL_UPDATE); +} + +/* + * Perform XLogInsert of an XLOG_SMGR_BUFPERSISTENCE record to WAL. + */ +void +log_smgrbufpersistence(const RelFileNode *rnode, bool persistence) +{ + xl_smgr_bufpersistence xlrec; + + /* + * Make an XLOG entry reporting the change of buffer persistence. + */ + xlrec.rnode = *rnode; + xlrec.persistence = persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_SMGR_ID, XLOG_SMGR_BUFPERSISTENCE | XLR_SPECIAL_REL_UPDATE); +} + /* * RelationDropStorage * Schedule unlinking of physical storage at transaction commit. @@ -673,6 +989,95 @@ smgrDoPendingDeletes(bool isCommit) } } +/* + * smgrDoPendingUnmark() -- Clean up work that emits WAL records + * + * The operations handled in the function emits WAL records, which must be + * part of the current transaction. + */ +void +smgrDoPendingCleanups(bool isCommit) +{ + int nestLevel = GetCurrentTransactionNestLevel(); + PendingCleanup *pending; + PendingCleanup *prev; + PendingCleanup *next; + + prev = NULL; + for (pending = pendingCleanups; pending != NULL; pending = next) + { + next = pending->next; + if (pending->nestLevel < nestLevel) + { + /* outer-level entries should not be processed yet */ + prev = pending; + } + else + { + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; + else + pendingCleanups = next; + + /* do cleanup if called for */ + if (pending->atCommit == isCommit) + { + SMgrRelation srel; + + srel = smgropen(pending->relnode, pending->backend); + + Assert ((pending->op & + ~(PCOP_UNLINK_FORK | PCOP_UNLINK_MARK | + PCOP_SET_PERSISTENCE)) == 0); + + if (pending->op & PCOP_SET_PERSISTENCE) + { + SetRelationBuffersPersistence(srel, pending->bufpersistence, + InRecovery); + } + + if (pending->op & PCOP_UNLINK_FORK) + { + /* + * Unlink the fork file. Currently we use this only for + * init forks and we're sure that the init fork is not + * loaded on shared buffers. For RelationDropInitFork + * case, the function dropped that buffers. For + * RelationCreateInitFork case, PCOP_SET_PERSISTENCE(true) + * is set and the buffers have been dropped just before. + */ + Assert(pending->unlink_forknum == INIT_FORKNUM); + + /* Don't emit wal while recovery. */ + if (!InRecovery) + log_smgrunlink(&pending->relnode, + pending->unlink_forknum); + smgrunlink(srel, pending->unlink_forknum, false); + } + + if (pending->op & PCOP_UNLINK_MARK) + { + SMgrRelation srel; + + if (!InRecovery) + log_smgrunlinkmark(&pending->relnode, + pending->unlink_forknum, + pending->unlink_mark); + srel = smgropen(pending->relnode, pending->backend); + smgrunlinkmark(srel, pending->unlink_forknum, + pending->unlink_mark, InRecovery); + smgrclose(srel); + } + } + + /* must explicitly free the list entry */ + pfree(pending); + /* prev does not change */ + } + } +} + /* * smgrDoPendingSyncs() -- Take care of relation syncs at end of xact. */ @@ -933,6 +1338,15 @@ smgr_redo(XLogReaderState *record) reln = smgropen(xlrec->rnode, InvalidBackendId); smgrcreate(reln, xlrec->forkNum, true); } + else if (info == XLOG_SMGR_UNLINK) + { + xl_smgr_unlink *xlrec = (xl_smgr_unlink *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + smgrunlink(reln, xlrec->forkNum, true); + smgrclose(reln); + } else if (info == XLOG_SMGR_TRUNCATE) { xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); @@ -1021,6 +1435,124 @@ smgr_redo(XLogReaderState *record) FreeFakeRelcacheEntry(rel); } + else if (info == XLOG_SMGR_MARK) + { + xl_smgr_mark *xlrec = (xl_smgr_mark *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + bool created = false; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + + switch (xlrec->action) + { + case XLOG_SMGR_MARK_CREATE: + smgrcreatemark(reln, xlrec->forkNum, xlrec->mark, true); + created = true; + break; + case XLOG_SMGR_MARK_UNLINK: + smgrunlinkmark(reln, xlrec->forkNum, xlrec->mark, true); + break; + default: + elog(ERROR, "unknown smgr_mark action \"%c\"", xlrec->mark); + } + + if (created) + { + /* revert mark file operation at abort */ + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_UNLINK_MARK; + pending->unlink_forknum = xlrec->forkNum; + pending->unlink_mark = xlrec->mark; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + else + { + /* + * Delete pending action for this mark file if any. We should have + * at most one entry for this action. + */ + PendingCleanup *prev = NULL; + + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + pending->unlink_forknum == xlrec->forkNum && + (pending->op & PCOP_UNLINK_MARK) != 0) + { + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + } + } + else if (info == XLOG_SMGR_BUFPERSISTENCE) + { + xl_smgr_bufpersistence *xlrec = + (xl_smgr_bufpersistence *) XLogRecGetData(record); + SMgrRelation reln; + PendingCleanup *pending; + PendingCleanup *prev = NULL; + + reln = smgropen(xlrec->rnode, InvalidBackendId); + SetRelationBuffersPersistence(reln, xlrec->persistence, true); + + /* + * Delete pending action for persistence change if any. We should have + * at most one entry for this action. + */ + for (pending = pendingCleanups; pending != NULL; + pending = pending->next) + { + if (RelFileNodeEquals(xlrec->rnode, pending->relnode) && + (pending->op & PCOP_SET_PERSISTENCE) != 0) + { + Assert (pending->bufpersistence == xlrec->persistence); + + if (prev) + prev->next = pending->next; + else + pendingCleanups = pending->next; + + pfree(pending); + break; + } + + prev = pending; + } + + /* + * Revert buffer-persistence changes at abort if the relation is going + * to different persistence from before this transaction. + */ + if (!pending) + { + pending = (PendingCleanup *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingCleanup)); + pending->relnode = xlrec->rnode; + pending->op = PCOP_SET_PERSISTENCE; + pending->bufpersistence = !xlrec->persistence; + pending->backend = InvalidBackendId; + pending->atCommit = false; + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingCleanups; + pendingCleanups = pending; + } + } else elog(PANIC, "smgr_redo: unknown op code %u", info); } diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 3e83f375b5..9e5b77e94a 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -53,6 +53,7 @@ #include "commands/defrem.h" #include "commands/event_trigger.h" #include "commands/policy.h" +#include "commands/progress.h" #include "commands/sequence.h" #include "commands/tablecmds.h" #include "commands/tablespace.h" @@ -5347,6 +5348,187 @@ ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab, Relation rel, return newcmd; } +/* + * RelationChangePersistence: do in-place persistence change of a relation + */ +static void +RelationChangePersistence(AlteredTableInfo *tab, char persistence, + LOCKMODE lockmode) +{ + Relation rel; + Relation classRel; + HeapTuple tuple, + newtuple; + Datum new_val[Natts_pg_class]; + bool new_null[Natts_pg_class], + new_repl[Natts_pg_class]; + int i; + List *relids; + ListCell *lc_oid; + + Assert(tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE); + Assert(lockmode == AccessExclusiveLock); + + /* + * Under the following condition, we need to call ATRewriteTable, which + * cannot be false in the AT_REWRITE_ALTER_PERSISTENCE case. + */ + Assert(tab->constraints == NULL && tab->partition_constraint == NULL && + tab->newvals == NULL && !tab->verify_new_notnull); + + rel = table_open(tab->relid, lockmode); + + Assert(rel->rd_rel->relpersistence != persistence); + + elog(DEBUG1, "perform in-place persistnce change"); + + /* + * First we collect all relations that we need to change persistence. + */ + + /* Collect OIDs of indexes and toast relations */ + relids = RelationGetIndexList(rel); + relids = lcons_oid(rel->rd_id, relids); + + /* Add toast relation if any */ + if (OidIsValid(rel->rd_rel->reltoastrelid)) + { + List *toastidx; + Relation toastrel = table_open(rel->rd_rel->reltoastrelid, lockmode); + + relids = lappend_oid(relids, rel->rd_rel->reltoastrelid); + toastidx = RelationGetIndexList(toastrel); + relids = list_concat(relids, toastidx); + pfree(toastidx); + table_close(toastrel, NoLock); + } + + table_close(rel, NoLock); + + /* Make changes in storage */ + classRel = table_open(RelationRelationId, RowExclusiveLock); + + foreach (lc_oid, relids) + { + Oid reloid = lfirst_oid(lc_oid); + Relation r = relation_open(reloid, lockmode); + + /* + * XXXX: Some access methods do not bear up an in-place persistence + * change. Specifically, GiST uses page LSNs to figure out whether a + * block has changed, where UNLOGGED GiST indexes use fake LSNs that + * are incompatible with real LSNs used for LOGGED ones. + * + * Maybe if gistGetFakeLSN behaved the same way for permanent and + * unlogged indexes, we could skip index rebuild in exchange of some + * extra WAL records emitted while it is unlogged. + * + * Check relam against a positive list so that we take this way for + * unknown AMs. + */ + if (r->rd_rel->relkind == RELKIND_INDEX && + /* GiST is excluded */ + r->rd_rel->relam != BTREE_AM_OID && + r->rd_rel->relam != HASH_AM_OID && + r->rd_rel->relam != GIN_AM_OID && + r->rd_rel->relam != SPGIST_AM_OID && + r->rd_rel->relam != BRIN_AM_OID) + { + int reindex_flags; + ReindexParams params = {0}; + + /* reindex doesn't allow concurrent use of the index */ + table_close(r, NoLock); + + reindex_flags = + REINDEX_REL_SUPPRESS_INDEX_USE | + REINDEX_REL_CHECK_CONSTRAINTS; + + /* Set the same persistence with the parent relation. */ + if (persistence == RELPERSISTENCE_UNLOGGED) + reindex_flags |= REINDEX_REL_FORCE_INDEXES_UNLOGGED; + else + reindex_flags |= REINDEX_REL_FORCE_INDEXES_PERMANENT; + + reindex_index(reloid, reindex_flags, persistence, ¶ms); + + continue; + } + + /* Create or drop init fork */ + if (persistence == RELPERSISTENCE_UNLOGGED) + RelationCreateInitFork(r); + else + RelationDropInitFork(r); + + /* + * When this relation gets WAL-logged, immediately sync all files but + * initfork to establish the initial state on storage. Buffers have + * already flushed out by RelationCreate(Drop)InitFork called just + * above. Initfork should have been synced as needed. + */ + if (persistence == RELPERSISTENCE_PERMANENT) + { + for (i = 0 ; i < INIT_FORKNUM ; i++) + { + if (smgrexists(RelationGetSmgr(r), i)) + smgrimmedsync(RelationGetSmgr(r), i); + } + } + + /* Update catalog */ + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + + memset(new_val, 0, sizeof(new_val)); + memset(new_null, false, sizeof(new_null)); + memset(new_repl, false, sizeof(new_repl)); + + new_val[Anum_pg_class_relpersistence - 1] = CharGetDatum(persistence); + new_null[Anum_pg_class_relpersistence - 1] = false; + new_repl[Anum_pg_class_relpersistence - 1] = true; + + newtuple = heap_modify_tuple(tuple, RelationGetDescr(classRel), + new_val, new_null, new_repl); + + CatalogTupleUpdate(classRel, &newtuple->t_self, newtuple); + heap_freetuple(newtuple); + + /* + * While wal_level >= replica, switching to LOGGED requires the + * relation content to be WAL-logged to recover the table. + * We don't emit this fhile wal_level = minimal. + */ + if (persistence == RELPERSISTENCE_PERMANENT && XLogIsNeeded()) + { + ForkNumber fork; + xl_smgr_truncate xlrec; + + xlrec.blkno = 0; + xlrec.rnode = r->rd_node; + xlrec.flags = SMGR_TRUNCATE_ALL; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + + XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE); + + for (fork = 0; fork < INIT_FORKNUM ; fork++) + { + if (smgrexists(RelationGetSmgr(r), fork)) + log_newpage_range(r, fork, 0, + smgrnblocks(RelationGetSmgr(r), fork), + false); + } + } + + table_close(r, NoLock); + } + + table_close(classRel, NoLock); +} + /* * ATRewriteTables: ALTER TABLE phase 3 */ @@ -5477,47 +5659,55 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode, tab->relid, tab->rewrite); - /* - * Create transient table that will receive the modified data. - * - * Ensure it is marked correctly as logged or unlogged. We have - * to do this here so that buffers for the new relfilenode will - * have the right persistence set, and at the same time ensure - * that the original filenode's buffers will get read in with the - * correct setting (i.e. the original one). Otherwise a rollback - * after the rewrite would possibly result with buffers for the - * original filenode having the wrong persistence setting. - * - * NB: This relies on swap_relation_files() also swapping the - * persistence. That wouldn't work for pg_class, but that can't be - * unlogged anyway. - */ - OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod, - persistence, lockmode); + if (tab->rewrite == AT_REWRITE_ALTER_PERSISTENCE) + RelationChangePersistence(tab, persistence, lockmode); + else + { + /* + * Create transient table that will receive the modified data. + * + * Ensure it is marked correctly as logged or unlogged. We + * have to do this here so that buffers for the new relfilenode + * will have the right persistence set, and at the same time + * ensure that the original filenode's buffers will get read in + * with the correct setting (i.e. the original one). Otherwise + * a rollback after the rewrite would possibly result with + * buffers for the original filenode having the wrong + * persistence setting. + * + * NB: This relies on swap_relation_files() also swapping the + * persistence. That wouldn't work for pg_class, but that can't + * be unlogged anyway. + */ + OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, + NewAccessMethod, + persistence, lockmode); - /* - * Copy the heap data into the new table with the desired - * modifications, and test the current data within the table - * against new constraints generated by ALTER TABLE commands. - */ - ATRewriteTable(tab, OIDNewHeap, lockmode); + /* + * Copy the heap data into the new table with the desired + * modifications, and test the current data within the table + * against new constraints generated by ALTER TABLE commands. + */ + ATRewriteTable(tab, OIDNewHeap, lockmode); - /* - * Swap the physical files of the old and new heaps, then rebuild - * indexes and discard the old heap. We can use RecentXmin for - * the table's new relfrozenxid because we rewrote all the tuples - * in ATRewriteTable, so no older Xid remains in the table. Also, - * we never try to swap toast tables by content, since we have no - * interest in letting this code work on system catalogs. - */ - finish_heap_swap(tab->relid, OIDNewHeap, - false, false, true, - !OidIsValid(tab->newTableSpace), - RecentXmin, - ReadNextMultiXactId(), - persistence); + /* + * Swap the physical files of the old and new heaps, then + * rebuild indexes and discard the old heap. We can use + * RecentXmin for the table's new relfrozenxid because we + * rewrote all the tuples in ATRewriteTable, so no older Xid + * remains in the table. Also, we never try to swap toast + * tables by content, since we have no interest in letting this + * code work on system catalogs. + */ + finish_heap_swap(tab->relid, OIDNewHeap, + false, false, true, + !OidIsValid(tab->newTableSpace), + RecentXmin, + ReadNextMultiXactId(), + persistence); - InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0); + } } else { diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c index 0bf28b55d7..17185f4e55 100644 --- a/src/backend/replication/basebackup.c +++ b/src/backend/replication/basebackup.c @@ -1209,6 +1209,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, bool excludeFound; ForkNumber relForkNum; /* Type of fork if file is a relation */ int relOidChars; /* Chars in filename that are the rel oid */ + StorageMarks mark; /* Skip special stuff */ if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) @@ -1259,7 +1260,7 @@ sendDir(bbsink *sink, const char *path, int basepathlen, bool sizeonly, /* Exclude all forks for unlogged tables except the init fork */ if (isDbDir && parse_filename_for_nontemp_relation(de->d_name, &relOidChars, - &relForkNum)) + &relForkNum, &mark)) { /* Never exclude init forks */ if (relForkNum != INIT_FORKNUM) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index f5459c68f8..6cd010429a 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -38,6 +38,7 @@ #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/storage.h" +#include "catalog/storage_xlog.h" #include "executor/instrument.h" #include "lib/binaryheap.h" #include "miscadmin.h" @@ -3155,6 +3156,91 @@ DropRelFileNodeBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum, } } +/* --------------------------------------------------------------------- + * SetRelFileNodeBuffersPersistence + * + * This function changes the persistence of all buffer pages of a relation + * then writes all dirty pages of the relation out to disk when switching + * to PERMANENT. (or more accurately, out to kernel disk buffers), + * ensuring that the kernel has an up-to-date view of the relation. + * + * Generally, the caller should be holding AccessExclusiveLock on the + * target relation to ensure that no other backend is busy dirtying + * more blocks of the relation; the effects can't be expected to last + * after the lock is released. + * + * XXX currently it sequentially searches the buffer pool, should be + * changed to more clever ways of searching. This routine is not + * used in any performance-critical code paths, so it's not worth + * adding additional overhead to normal paths to make it go faster; + * but see also DropRelFileNodeBuffers. + * -------------------------------------------------------------------- + */ +void +SetRelationBuffersPersistence(SMgrRelation srel, bool permanent, bool isRedo) +{ + int i; + RelFileNodeBackend rnode = srel->smgr_rnode; + + Assert (!RelFileNodeBackendIsTemp(rnode)); + + if (!isRedo) + log_smgrbufpersistence(&srel->smgr_rnode.node, permanent); + + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + for (i = 0; i < NBuffers; i++) + { + BufferDesc *bufHdr = GetBufferDescriptor(i); + uint32 buf_state; + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + continue; + + ReservePrivateRefCountEntry(); + + buf_state = LockBufHdr(bufHdr); + + if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node)) + { + UnlockBufHdr(bufHdr, buf_state); + continue; + } + + if (permanent) + { + /* Init fork is being dropped, drop buffers for it. */ + if (bufHdr->tag.forkNum == INIT_FORKNUM) + { + InvalidateBuffer(bufHdr); + continue; + } + + buf_state |= BM_PERMANENT; + pg_atomic_write_u32(&bufHdr->state, buf_state); + + /* we flush this buffer when switching to PERMANENT */ + if ((buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY)) + { + PinBuffer_Locked(bufHdr); + LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), + LW_SHARED); + FlushBuffer(bufHdr, srel); + LWLockRelease(BufferDescriptorGetContentLock(bufHdr)); + UnpinBuffer(bufHdr, true); + } + else + UnlockBufHdr(bufHdr, buf_state); + } + else + { + /* There shouldn't be an init fork */ + Assert(bufHdr->tag.forkNum != INIT_FORKNUM); + UnlockBufHdr(bufHdr, buf_state); + } + } +} + /* --------------------------------------------------------------------- * DropRelFileNodesAllBuffers * diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 14b77f2861..2fc9f17c28 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -349,8 +349,6 @@ static void pre_sync_fname(const char *fname, bool isdir, int elevel); static void datadir_fsync_fname(const char *fname, bool isdir, int elevel); static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel); -static int fsync_parent_path(const char *fname, int elevel); - /* * pg_fsync --- do fsync with or without writethrough @@ -3759,7 +3757,7 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel) * This is aimed at making file operations persistent on disk in case of * an OS crash or power failure. */ -static int +int fsync_parent_path(const char *fname, int elevel) { char parentpath[MAXPGPATH]; diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index f053fe0495..f28f55baa6 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -16,29 +16,49 @@ #include <unistd.h> +#include "access/xlogrecovery.h" +#include "catalog/pg_tablespace_d.h" #include "common/relpath.h" #include "postmaster/startup.h" +#include "storage/bufmgr.h" #include "storage/copydir.h" #include "storage/fd.h" +#include "storage/md.h" #include "storage/reinit.h" +#include "storage/smgr.h" #include "utils/hsearch.h" #include "utils/memutils.h" static void ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, - int op); + Oid tspid, int op); static void ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, - int op); + Oid tspid, Oid dbid, int op); typedef struct { Oid reloid; /* hash key */ -} unlogged_relation_entry; + bool has_init; /* has INIT fork */ + bool dirty_init; /* needs to remove INIT fork */ + bool dirty_all; /* needs to remove all forks */ +} relfile_entry; /* - * Reset unlogged relations from before the last restart. + * Clean up and reset relation files from before the last restart. * - * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any - * relation with an "init" fork, except for the "init" fork itself. + * If op includes UNLOGGED_RELATION_CLEANUP, we perform different operations + * depending on the existence of the "cleanup" forks. + * + * If SMGR_MARK_UNCOMMITTED mark file for init fork is present, we remove the + * init fork along with the mark file. + * + * If SMGR_MARK_UNCOMMITTED mark file for main fork is present we remove the + * whole relation along with the mark file. + * + * Otherwise, if the "init" fork is found. we remove all forks of any relation + * with the "init" fork, except for the "init" fork itself. + * + * If op includes UNLOGGED_RELATION_DROP_BUFFER, we drop all buffers for all + * relations that have the "cleanup" and/or the "init" forks. * * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main * fork. @@ -72,7 +92,7 @@ ResetUnloggedRelations(int op) /* * First process unlogged files in pg_default ($PGDATA/base) */ - ResetUnloggedRelationsInTablespaceDir("base", op); + ResetUnloggedRelationsInTablespaceDir("base", DEFAULTTABLESPACE_OID, op); /* * Cycle through directories for all non-default tablespaces. @@ -81,13 +101,19 @@ ResetUnloggedRelations(int op) while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL) { + Oid tspid; + if (strcmp(spc_de->d_name, ".") == 0 || strcmp(spc_de->d_name, "..") == 0) continue; snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s", spc_de->d_name, TABLESPACE_VERSION_DIRECTORY); - ResetUnloggedRelationsInTablespaceDir(temp_path, op); + + tspid = atooid(spc_de->d_name); + + Assert(tspid != 0); + ResetUnloggedRelationsInTablespaceDir(temp_path, tspid, op); } FreeDir(spc_dir); @@ -103,7 +129,8 @@ ResetUnloggedRelations(int op) * Process one tablespace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) +ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, + Oid tspid, int op) { DIR *ts_dir; struct dirent *de; @@ -130,6 +157,8 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) while ((de = ReadDir(ts_dir, tsdirname)) != NULL) { + Oid dbid; + /* * We're only interested in the per-database directories, which have * numeric names. Note that this code will also (properly) ignore "." @@ -148,7 +177,10 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) ereport_startup_progress("resetting unlogged relations (cleanup), elapsed time: %ld.%02d s, current path: %s", dbspace_path); - ResetUnloggedRelationsInDbspaceDir(dbspace_path, op); + dbid = atooid(de->d_name); + Assert(dbid != 0); + + ResetUnloggedRelationsInDbspaceDir(dbspace_path, tspid, dbid, op); } FreeDir(ts_dir); @@ -158,125 +190,227 @@ ResetUnloggedRelationsInTablespaceDir(const char *tsdirname, int op) * Process one per-dbspace directory for ResetUnloggedRelations */ static void -ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) +ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, + Oid tspid, Oid dbid, int op) { DIR *dbspace_dir; struct dirent *de; char rm_path[MAXPGPATH * 2]; + HTAB *hash; + HASHCTL ctl; /* Caller must specify at least one operation. */ - Assert((op & (UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT)) != 0); + Assert((op & (UNLOGGED_RELATION_CLEANUP | + UNLOGGED_RELATION_DROP_BUFFER | + UNLOGGED_RELATION_INIT)) != 0); /* * Cleanup is a two-pass operation. First, we go through and identify all * the files with init forks. Then, we go through again and nuke * everything with the same OID except the init fork. */ + + /* + * It's possible that someone could create tons of unlogged relations in + * the same database & tablespace, so we'd better use a hash table rather + * than an array or linked list to keep track of which files need to be + * reset. Otherwise, this cleanup operation would be O(n^2). + */ + memset(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(relfile_entry); + hash = hash_create("relfilenode cleanup hash", + 32, &ctl, HASH_ELEM | HASH_BLOBS); + + /* Collect INIT fork and mark files in the directory. */ + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + int oidchars; + ForkNumber forkNum; + StorageMarks mark; + + /* Skip anything that doesn't look like a relation data file. */ + if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, + &forkNum, &mark)) + continue; + + if (forkNum == INIT_FORKNUM || mark == SMGR_MARK_UNCOMMITTED) + { + Oid key; + relfile_entry *ent; + bool found; + + /* + * Record the relfilenode information. If it has + * SMGR_MARK_UNCOMMITTED mark files, the relfilenode is in dirty + * state, where clean up is needed. + */ + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_ENTER, &found); + + if (!found) + { + ent->has_init = false; + ent->dirty_init = false; + ent->dirty_all = false; + } + + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_init = true; + else if (forkNum == MAIN_FORKNUM && mark == SMGR_MARK_UNCOMMITTED) + ent->dirty_all = true; + else + { + Assert(forkNum == INIT_FORKNUM); + ent->has_init = true; + } + } + } + + /* Done with the first pass. */ + FreeDir(dbspace_dir); + + /* nothing to do if we don't have init nor cleanup forks */ + if (hash_get_num_entries(hash) < 1) + { + hash_destroy(hash); + return; + } + + if ((op & UNLOGGED_RELATION_DROP_BUFFER) != 0) + { + /* + * When we come here after recovery, smgr object for this file might + * have been created. In that case we need to drop all buffers then the + * smgr object before initializing the unlogged relation. This is safe + * as far as no other backends have accessed the relation before + * starting archive recovery. + */ + HASH_SEQ_STATUS status; + relfile_entry *ent; + SMgrRelation *srels = palloc(sizeof(SMgrRelation) * 8); + int maxrels = 8; + int nrels = 0; + int i; + + Assert(!HotStandbyActive()); + + hash_seq_init(&status, hash); + while((ent = (relfile_entry *) hash_seq_search(&status)) != NULL) + { + RelFileNodeBackend rel; + + /* + * The relation is persistent and stays remain persistent. Don't + * drop the buffers for this relation. + */ + if (ent->has_init && ent->dirty_init) + continue; + + if (maxrels <= nrels) + { + maxrels *= 2; + srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); + } + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = ent->reloid; + + srels[nrels++] = smgropen(rel.node, InvalidBackendId); + } + + DropRelFileNodesAllBuffers(srels, nrels); + + for (i = 0 ; i < nrels ; i++) + smgrclose(srels[i]); + } + + /* + * Now, make a second pass and remove anything that matches. + */ if ((op & UNLOGGED_RELATION_CLEANUP) != 0) { - HTAB *hash; - HASHCTL ctl; - - /* - * It's possible that someone could create a ton of unlogged relations - * in the same database & tablespace, so we'd better use a hash table - * rather than an array or linked list to keep track of which files - * need to be reset. Otherwise, this cleanup operation would be - * O(n^2). - */ - ctl.keysize = sizeof(Oid); - ctl.entrysize = sizeof(unlogged_relation_entry); - ctl.hcxt = CurrentMemoryContext; - hash = hash_create("unlogged relation OIDs", 32, &ctl, - HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); - - /* Scan the directory. */ dbspace_dir = AllocateDir(dbspacedirname); while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; + ForkNumber forkNum; + StorageMarks mark; + int oidchars; + Oid key; + relfile_entry *ent; + RelFileNodeBackend rel; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* Also skip it unless this is the init fork. */ - if (forkNum != INIT_FORKNUM) - continue; - - /* - * Put the OID portion of the name into the hash table, if it - * isn't already. - */ - ent.reloid = atooid(de->d_name); - (void) hash_search(hash, &ent, HASH_ENTER, NULL); - } - - /* Done with the first pass. */ - FreeDir(dbspace_dir); - - /* - * If we didn't find any init forks, there's no point in continuing; - * we can bail out now. - */ - if (hash_get_num_entries(hash) == 0) - { - hash_destroy(hash); - return; - } - - /* - * Now, make a second pass and remove anything that matches. - */ - dbspace_dir = AllocateDir(dbspacedirname); - while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) - { - ForkNumber forkNum; - int oidchars; - unlogged_relation_entry ent; - - /* Skip anything that doesn't look like a relation data file. */ - if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) - continue; - - /* We never remove the init fork. */ - if (forkNum == INIT_FORKNUM) + &forkNum, &mark)) continue; /* * See whether the OID portion of the name shows up in the hash * table. If so, nuke it! */ - ent.reloid = atooid(de->d_name); - if (hash_search(hash, &ent, HASH_FIND, NULL)) + key = atooid(de->d_name); + ent = hash_search(hash, &key, HASH_FIND, NULL); + + if (!ent) + continue; + + if (!ent->dirty_all) { - snprintf(rm_path, sizeof(rm_path), "%s/%s", - dbspacedirname, de->d_name); - if (unlink(rm_path) < 0) - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not remove file \"%s\": %m", - rm_path))); + /* clean permanent relations don't need cleanup */ + if (!ent->has_init) + continue; + + if (ent->dirty_init) + { + /* + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. + */ + if (forkNum != INIT_FORKNUM) + continue; + } else - elog(DEBUG2, "unlinked file \"%s\"", rm_path); + { + /* + * we don't remove the INIT fork of a non-dirty + * relfilenode + */ + if (forkNum == INIT_FORKNUM && mark == SMGR_MARK_NONE) + continue; + } } + + /* so, nuke it! */ + snprintf(rm_path, sizeof(rm_path), "%s/%s", + dbspacedirname, de->d_name); + if (unlink(rm_path) < 0) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", + rm_path))); + + rel.backend = InvalidBackendId; + rel.node.spcNode = tspid; + rel.node.dbNode = dbid; + rel.node.relNode = atooid(de->d_name); + + ForgetRelationForkSyncRequests(rel, forkNum); } /* Cleanup is complete. */ FreeDir(dbspace_dir); - hash_destroy(hash); } + hash_destroy(hash); + hash = NULL; + /* * Initialization happens after cleanup is complete: we copy each init - * fork file to the corresponding main fork file. Note that if we are - * asked to do both cleanup and init, we may never get here: if the - * cleanup code determines that there are no init forks in this dbspace, - * it will return before we get to this point. + * fork file to the corresponding main fork file. */ if ((op & UNLOGGED_RELATION_INIT) != 0) { @@ -285,6 +419,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char srcpath[MAXPGPATH * 2]; @@ -292,9 +427,11 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -328,15 +465,18 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) { ForkNumber forkNum; + StorageMarks mark; int oidchars; char oidbuf[OIDCHARS + 1]; char mainpath[MAXPGPATH]; /* Skip anything that doesn't look like a relation data file. */ if (!parse_filename_for_nontemp_relation(de->d_name, &oidchars, - &forkNum)) + &forkNum, &mark)) continue; + Assert(mark == SMGR_MARK_NONE); + /* Also skip it unless this is the init fork. */ if (forkNum != INIT_FORKNUM) continue; @@ -379,7 +519,7 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) */ bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, - ForkNumber *fork) + ForkNumber *fork, StorageMarks *mark) { int pos; @@ -410,11 +550,19 @@ parse_filename_for_nontemp_relation(const char *name, int *oidchars, for (segchar = 1; isdigit((unsigned char) name[pos + segchar]); ++segchar) ; - if (segchar <= 1) - return false; - pos += segchar; + if (segchar > 1) + pos += segchar; } + /* mark file? */ + if (name[pos] == '.' && name[pos + 1] != 0) + { + *mark = name[pos + 1]; + pos += 2; + } + else + *mark = SMGR_MARK_NONE; + /* Now we should be at the end. */ if (name[pos] != '\0') return false; diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 879f647dbc..4d44bdd78b 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -139,7 +139,8 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno, BlockNumber blkno, bool skipFsync, int behavior); static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); - +static bool mdmarkexists(SMgrRelation reln, ForkNumber forkNum, + StorageMarks mark); /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -169,6 +170,82 @@ mdexists(SMgrRelation reln, ForkNumber forkNum) return (mdopenfork(reln, forkNum, EXTENSION_RETURN_NULL) != NULL); } +/* + * mdcreatemark() -- Create a mark file. + * + * If isRedo is true, it's okay for the file to exist already. + */ +void +mdcreatemark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path =markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + /* See mdcreate for details.. */ + TablespaceCreateDbspace(reln->smgr_rnode.node.spcNode, + reln->smgr_rnode.node.dbNode, + isRedo); + + fd = BasicOpenFile(path, O_WRONLY | O_CREAT | O_EXCL); + if (fd < 0 && (!isRedo || errno != EEXIST)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not crete mark file \"%s\": %m", path))); + + pg_fsync(fd); + close(fd); + + /* + * To guarantee that the creation of the file is persistent, fsync its + * parent directory. + */ + fsync_parent_path(path, ERROR); + + pfree(path); +} + + +/* + * mdunlinkmark() -- Delete the mark file + * + * If isRedo is true, it's okay for the file being not found. + */ +void +mdunlinkmark(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark, + bool isRedo) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + + if (!isRedo || mdmarkexists(reln, forkNum, mark)) + durable_unlink(path, ERROR); + + pfree(path); +} + +/* + * mdmarkexists() -- Check if the file exists. + */ +static bool +mdmarkexists(SMgrRelation reln, ForkNumber forkNum, StorageMarks mark) +{ + char *path = markpath(reln->smgr_rnode, forkNum, mark); + int fd; + + fd = BasicOpenFile(path, O_RDONLY); + if (fd < 0 && errno != ENOENT) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not access mark file \"%s\": %m", path))); + pfree(path); + + if (fd < 0) + return false; + + close(fd); + return true; +} + /* * mdcreate() -- Create a new relation on magnetic disk. * @@ -1031,6 +1108,15 @@ register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum, RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ ); } +/* + * ForgetRelationForkSyncRequests -- forget any fsyncs and unlinks for a fork + */ +void +ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, ForkNumber forknum) +{ + register_forget_request(rnode, forknum, 0); +} + /* * ForgetDatabaseSyncRequests -- forget any fsyncs and unlinks for a DB */ @@ -1384,12 +1470,14 @@ mdsyncfiletag(const FileTag *ftag, char *path) * Return 0 on success, -1 on failure, with errno set. */ int -mdunlinkfiletag(const FileTag *ftag, char *path) +mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark) { char *p; /* Compute the path. */ - p = relpathperm(ftag->rnode, MAIN_FORKNUM); + p = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode, + ftag->rnode.relNode, InvalidBackendId, MAIN_FORKNUM, + mark); strlcpy(path, p, MAXPGPATH); pfree(p); diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index d71a557a35..0710e8b145 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -63,6 +63,10 @@ typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); + void (*smgr_createmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); + void (*smgr_unlinkmark) (SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); } f_smgr; static const f_smgr smgrsw[] = { @@ -84,6 +88,8 @@ static const f_smgr smgrsw[] = { .smgr_nblocks = mdnblocks, .smgr_truncate = mdtruncate, .smgr_immedsync = mdimmedsync, + .smgr_createmark = mdcreatemark, + .smgr_unlinkmark = mdunlinkmark, } }; @@ -337,6 +343,26 @@ smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) smgrsw[reln->smgr_which].smgr_create(reln, forknum, isRedo); } +/* + * smgrcreatemark() -- Create a mark file + */ +void +smgrcreatemark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_createmark(reln, forknum, mark, isRedo); +} + +/* + * smgrunlinkmark() -- Delete a mark file + */ +void +smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, StorageMarks mark, + bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlinkmark(reln, forknum, mark, isRedo); +} + /* * smgrdosyncall() -- Immediately sync all forks of all given relations * @@ -664,6 +690,12 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum) smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum); } +void +smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo) +{ + smgrsw[reln->smgr_which].smgr_unlink(reln->smgr_rnode, forknum, isRedo); +} + /* * AtEOXact_SMgr * diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c index e161d57761..f5ded7cb34 100644 --- a/src/backend/storage/sync/sync.c +++ b/src/backend/storage/sync/sync.c @@ -90,7 +90,8 @@ static CycleCtr checkpoint_cycle_ctr = 0; typedef struct SyncOps { int (*sync_syncfiletag) (const FileTag *ftag, char *path); - int (*sync_unlinkfiletag) (const FileTag *ftag, char *path); + int (*sync_unlinkfiletag) (const FileTag *ftag, char *path, + StorageMarks mark); bool (*sync_filetagmatches) (const FileTag *ftag, const FileTag *candidate); } SyncOps; @@ -223,7 +224,8 @@ SyncPostCheckpoint(void) /* Unlink the file */ if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag, - path) < 0) + path, + SMGR_MARK_NONE) < 0) { /* * There's a race condition, when the database is dropped at the @@ -237,6 +239,20 @@ SyncPostCheckpoint(void) (errcode_for_file_access(), errmsg("could not remove file \"%s\": %m", path))); } + else if (syncsw[entry->tag.handler].sync_unlinkfiletag( + &entry->tag, path, + SMGR_MARK_UNCOMMITTED) < 0) + { + /* + * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the + * fork files has been successfully removed. It's ok if the file + * does not exist. + */ + if (errno != ENOENT) + ereport(WARNING, + (errcode_for_file_access(), + errmsg("could not remove file \"%s\": %m", path))); + } /* Mark the list entry as canceled, just in case */ entry->canceled = true; diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c index 56df08c64f..f1382d4c4f 100644 --- a/src/bin/pg_rewind/parsexlog.c +++ b/src/bin/pg_rewind/parsexlog.c @@ -407,6 +407,28 @@ extractPageInfo(XLogReaderState *record) * source system. */ } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_UNLINK) + { + /* + * We can safely ignore there. We'll see that the file don't exist in + * the target data dir, and copy them in from the source system. No + * need to do anything special here. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_MARK) + { + /* + * We can safely ignore these, The file will be removed from the + * target, if it doesn't exist in the source system. The files are + * empty so we don't need to bother the content. + */ + } + else if (rmid == RM_SMGR_ID && rminfo == XLOG_SMGR_BUFPERSISTENCE) + { + /* + * We can safely ignore these. These don't make any on-disk changes. + */ + } else if (rmid == RM_XACT_ID && ((rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT || (rminfo & XLOG_XACT_OPMASK) == XLOG_XACT_COMMIT_PREPARED || diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index efb82a4034..b289df4060 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -412,7 +412,6 @@ main(int argc, char **argv) if (showprogress) pg_log_info("reading source file list"); source->traverse_files(source, &process_source_file); - if (showprogress) pg_log_info("reading target file list"); traverse_datadir(datadir_target, &process_target_file); diff --git a/src/common/relpath.c b/src/common/relpath.c index 636c96efd3..1c19e16fea 100644 --- a/src/common/relpath.c +++ b/src/common/relpath.c @@ -139,9 +139,15 @@ GetDatabasePath(Oid dbNode, Oid spcNode) */ char * GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber) + int backendId, ForkNumber forkNumber, char mark) { char *path; + char markstr[4]; + + if (mark == 0) + markstr[0] = 0; + else + snprintf(markstr, sizeof(markstr), ".%c", mark); if (spcNode == GLOBALTABLESPACE_OID) { @@ -149,10 +155,10 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, Assert(dbNode == 0); Assert(backendId == InvalidBackendId); if (forkNumber != MAIN_FORKNUM) - path = psprintf("global/%u_%s", - relNode, forkNames[forkNumber]); + path = psprintf("global/%u_%s%s", + relNode, forkNames[forkNumber], markstr); else - path = psprintf("global/%u", relNode); + path = psprintf("global/%u%s", relNode, markstr); } else if (spcNode == DEFAULTTABLESPACE_OID) { @@ -160,22 +166,22 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/%u_%s", + path = psprintf("base/%u/%u_%s%s", dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/%u", - dbNode, relNode); + path = psprintf("base/%u/%u%s", + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("base/%u/t%d_%u_%s", + path = psprintf("base/%u/t%d_%u_%s%s", dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("base/%u/t%d_%u", - dbNode, backendId, relNode); + path = psprintf("base/%u/t%d_%u%s", + dbNode, backendId, relNode, markstr); } } else @@ -184,27 +190,28 @@ GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, if (backendId == InvalidBackendId) { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/%u", + path = psprintf("pg_tblspc/%u/%s/%u/%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, relNode); + dbNode, relNode, markstr); } else { if (forkNumber != MAIN_FORKNUM) - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u_%s%s", spcNode, TABLESPACE_VERSION_DIRECTORY, dbNode, backendId, relNode, - forkNames[forkNumber]); + forkNames[forkNumber], markstr); else - path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u", + path = psprintf("pg_tblspc/%u/%s/%u/t%d_%u%s", spcNode, TABLESPACE_VERSION_DIRECTORY, - dbNode, backendId, relNode); + dbNode, backendId, relNode, markstr); } } + return path; } diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h index 9ffc741913..d362d62ed2 100644 --- a/src/include/catalog/storage.h +++ b/src/include/catalog/storage.h @@ -23,6 +23,8 @@ extern int wal_skip_threshold; extern SMgrRelation RelationCreateStorage(RelFileNode rnode, char relpersistence); +extern void RelationCreateInitFork(Relation rel); +extern void RelationDropInitFork(Relation rel); extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void RelationPreTruncate(Relation rel); @@ -41,6 +43,7 @@ extern void RestorePendingSyncs(char *startAddress); extern void smgrDoPendingDeletes(bool isCommit); extern void smgrDoPendingSyncs(bool isCommit, bool isParallelWorker); extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr); +extern void smgrDoPendingCleanups(bool isCommit); extern void AtSubCommit_smgr(void); extern void AtSubAbort_smgr(void); extern void PostPrepare_smgr(void); diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h index 622de22b03..8139308634 100644 --- a/src/include/catalog/storage_xlog.h +++ b/src/include/catalog/storage_xlog.h @@ -18,17 +18,23 @@ #include "lib/stringinfo.h" #include "storage/block.h" #include "storage/relfilenode.h" +#include "storage/smgr.h" /* * Declarations for smgr-related XLOG records * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. + * Note: we log file creation, truncation and buffer persistence change here, + * but logging of deletion actions is handled mainly by xact.c, because it is + * part of transaction commit in most cases. However, there's a case where + * init forks are deleted outside control of transaction. */ /* XLOG gives us high 4 bits */ #define XLOG_SMGR_CREATE 0x10 #define XLOG_SMGR_TRUNCATE 0x20 +#define XLOG_SMGR_UNLINK 0x30 +#define XLOG_SMGR_MARK 0x40 +#define XLOG_SMGR_BUFPERSISTENCE 0x50 typedef struct xl_smgr_create { @@ -36,6 +42,32 @@ typedef struct xl_smgr_create ForkNumber forkNum; } xl_smgr_create; +typedef struct xl_smgr_unlink +{ + RelFileNode rnode; + ForkNumber forkNum; +} xl_smgr_unlink; + +typedef enum smgr_mark_action +{ + XLOG_SMGR_MARK_CREATE = 'c', + XLOG_SMGR_MARK_UNLINK = 'u' +} smgr_mark_action; + +typedef struct xl_smgr_mark +{ + RelFileNode rnode; + ForkNumber forkNum; + StorageMarks mark; + smgr_mark_action action; +} xl_smgr_mark; + +typedef struct xl_smgr_bufpersistence +{ + RelFileNode rnode; + bool persistence; +} xl_smgr_bufpersistence; + /* flags for xl_smgr_truncate */ #define SMGR_TRUNCATE_HEAP 0x0001 #define SMGR_TRUNCATE_VM 0x0002 @@ -51,6 +83,12 @@ typedef struct xl_smgr_truncate } xl_smgr_truncate; extern void log_smgrcreate(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrunlink(const RelFileNode *rnode, ForkNumber forkNum); +extern void log_smgrcreatemark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrunlinkmark(const RelFileNode *rnode, ForkNumber forkNum, + StorageMarks mark); +extern void log_smgrbufpersistence(const RelFileNode *rnode, bool persistence); extern void smgr_redo(XLogReaderState *record); extern void smgr_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/common/relpath.h b/src/include/common/relpath.h index a4b5dc853b..a864c91614 100644 --- a/src/include/common/relpath.h +++ b/src/include/common/relpath.h @@ -67,7 +67,7 @@ extern int forkname_chars(const char *str, ForkNumber *fork); extern char *GetDatabasePath(Oid dbNode, Oid spcNode); extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, - int backendId, ForkNumber forkNumber); + int backendId, ForkNumber forkNumber, char mark); /* * Wrapper macros for GetRelationPath. Beware of multiple @@ -77,7 +77,7 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, /* First argument is a RelFileNode */ #define relpathbackend(rnode, backend, forknum) \ GetRelationPath((rnode).dbNode, (rnode).spcNode, (rnode).relNode, \ - backend, forknum) + backend, forknum, 0) /* First argument is a RelFileNode */ #define relpathperm(rnode, forknum) \ @@ -87,4 +87,9 @@ extern char *GetRelationPath(Oid dbNode, Oid spcNode, Oid relNode, #define relpath(rnode, forknum) \ relpathbackend((rnode).node, (rnode).backend, forknum) +/* First argument is a RelFileNodeBackend */ +#define markpath(rnode, forknum, mark) \ + GetRelationPath((rnode).node.dbNode, (rnode).node.spcNode, \ + (rnode).node.relNode, \ + (rnode).backend, forknum, mark) #endif /* RELPATH_H */ diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index dd01841c30..739b386216 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -206,6 +206,8 @@ extern void FlushRelationsAllBuffers(struct SMgrRelationData **smgrs, int nrels) extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(struct SMgrRelationData *smgr_reln, ForkNumber *forkNum, int nforks, BlockNumber *firstDelBlock); +extern void SetRelationBuffersPersistence(struct SMgrRelationData *srel, + bool permanent, bool isRedo); extern void DropRelFileNodesAllBuffers(struct SMgrRelationData **smgr_reln, int nnodes); extern void DropDatabaseBuffers(Oid dbid); diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 29209e2724..8bf746bf45 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -185,6 +185,7 @@ extern ssize_t pg_pwritev_with_retry(int fd, extern int pg_truncate(const char *path, off_t length); extern void fsync_fname(const char *fname, bool isdir); extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel); +extern int fsync_parent_path(const char *fname, int elevel); extern int durable_rename(const char *oldfile, const char *newfile, int loglevel); extern int durable_unlink(const char *fname, int loglevel); extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel); diff --git a/src/include/storage/md.h b/src/include/storage/md.h index 6e46d8d96a..ef5fdaf4f8 100644 --- a/src/include/storage/md.h +++ b/src/include/storage/md.h @@ -24,6 +24,10 @@ extern void mdinit(void); extern void mdopen(SMgrRelation reln); extern void mdclose(SMgrRelation reln, ForkNumber forknum); extern void mdrelease(void); +extern void mdcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void mdunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern bool mdexists(SMgrRelation reln, ForkNumber forknum); extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo); @@ -42,12 +46,14 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks); extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum); +extern void ForgetRelationForkSyncRequests(RelFileNodeBackend rnode, + ForkNumber forknum); extern void ForgetDatabaseSyncRequests(Oid dbid); extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo); /* md sync callbacks */ extern int mdsyncfiletag(const FileTag *ftag, char *path); -extern int mdunlinkfiletag(const FileTag *ftag, char *path); +extern int mdunlinkfiletag(const FileTag *ftag, char *path, StorageMarks mark); extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate); #endif /* MD_H */ diff --git a/src/include/storage/reinit.h b/src/include/storage/reinit.h index bf2c10d443..e399aec0c7 100644 --- a/src/include/storage/reinit.h +++ b/src/include/storage/reinit.h @@ -16,13 +16,15 @@ #define REINIT_H #include "common/relpath.h" - +#include "storage/smgr.h" extern void ResetUnloggedRelations(int op); -extern bool parse_filename_for_nontemp_relation(const char *name, - int *oidchars, ForkNumber *fork); +extern bool parse_filename_for_nontemp_relation(const char *name, int *oidchars, + ForkNumber *fork, + StorageMarks *mark); #define UNLOGGED_RELATION_CLEANUP 0x0001 -#define UNLOGGED_RELATION_INIT 0x0002 +#define UNLOGGED_RELATION_DROP_BUFFER 0x0002 +#define UNLOGGED_RELATION_INIT 0x0004 #endif /* REINIT_H */ diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index 8e3ef92cda..022654b7b2 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -18,6 +18,18 @@ #include "storage/block.h" #include "storage/relfilenode.h" +/* + * Storage marks is a file of which existence suggests something about a + * file. The name of such files is "<filename>.<mark>", where the mark is one + * of the values of StorageMarks. Since ".<digit>" means segment files so don't + * use digits for the mark character. + */ +typedef enum StorageMarks +{ + SMGR_MARK_NONE = 0, + SMGR_MARK_UNCOMMITTED = 'u' /* the file is not committed yet */ +} StorageMarks; + /* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles. An SMgrRelation is created (if not already present) @@ -85,7 +97,12 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNodeBackend rnode); +extern void smgrcreatemark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); +extern void smgrunlinkmark(SMgrRelation reln, ForkNumber forknum, + StorageMarks mark, bool isRedo); extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); +extern void smgrunlink(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdosyncall(SMgrRelation *rels, int nrels); extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, -- 2.27.0 From 26cac5c8a65ff27e294996198333924c7e839a00 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Wed, 11 Nov 2020 23:21:09 +0900 Subject: [PATCH v19 2/2] New command ALTER TABLE ALL IN TABLESPACE SET LOGGED/UNLOGGED To ease invoking ALTER TABLE SET LOGGED/UNLOGGED, this command changes relation persistence of all tables in the specified tablespace. --- doc/src/sgml/ref/alter_table.sgml | 15 +++ src/backend/commands/tablecmds.c | 140 +++++++++++++++++++++++ src/backend/nodes/copyfuncs.c | 16 +++ src/backend/nodes/equalfuncs.c | 15 +++ src/backend/parser/gram.y | 42 +++++++ src/backend/tcop/utility.c | 11 ++ src/include/commands/tablecmds.h | 2 + src/include/nodes/nodes.h | 1 + src/include/nodes/parsenodes.h | 10 ++ src/test/regress/expected/tablespace.out | 76 ++++++++++++ src/test/regress/sql/tablespace.sql | 41 +++++++ 11 files changed, 369 insertions(+) diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml index 5c0735e08a..b03d5511a6 100644 --- a/doc/src/sgml/ref/alter_table.sgml +++ b/doc/src/sgml/ref/alter_table.sgml @@ -33,6 +33,8 @@ ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> SET SCHEMA <replaceable class="parameter">new_schema</replaceable> ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable>[, ... ] ] SET TABLESPACE <replaceable class="parameter">new_tablespace</replaceable> [ NOWAIT ] +ALTER TABLE ALL IN TABLESPACE <replaceable class="parameter">name</replaceable> [ OWNED BY <replaceable class="parameter">role_name</replaceable>[, ... ] ] + SET { LOGGED | UNLOGGED } [ NOWAIT ] ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> ATTACH PARTITION <replaceable class="parameter">partition_name</replaceable> { FOR VALUES <replaceable class="parameter">partition_bound_spec</replaceable>| DEFAULT } ALTER TABLE [ IF EXISTS ] <replaceable class="parameter">name</replaceable> @@ -753,6 +755,19 @@ WITH ( MODULUS <replaceable class="parameter">numeric_literal</replaceable>, REM (see <xref linkend="sql-createtable-unlogged"/>). It cannot be applied to a temporary table. </para> + + <para> + All tables in the current database in a tablespace can be changed by using + the <literal>ALL IN TABLESPACE</literal> form, which will lock all tables + to be changed first and then change each one. This form also supports + <literal>OWNED BY</literal>, which will only change tables owned by the + roles specified. If the <literal>NOWAIT</literal> option is specified + then the command will fail if it is unable to acquire all of the locks + required immediately. The <literal>information_schema</literal> + relations are not considered part of the system catalogs and will be + changed. See also + <link linkend="sql-createtablespace"><command>CREATE TABLESPACE</command></link>. + </para> </listitem> </varlistentry> diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 9e5b77e94a..0724d0e1d2 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -14770,6 +14770,146 @@ AlterTableMoveAll(AlterTableMoveAllStmt *stmt) return new_tablespaceoid; } +/* + * Alter Table ALL ... SET LOGGED/UNLOGGED + * + * Allows a user to change persistence of all objects in a given tablespace in + * the current database. Objects can be chosen based on the owner of the + * object also, to allow users to change persistene only their objects. The + * main permissions handling is done by the lower-level change persistence + * function. + * + * All to-be-modified objects are locked first. If NOWAIT is specified and the + * lock can't be acquired then we ereport(ERROR). + */ +void +AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt) +{ + List *relations = NIL; + ListCell *l; + ScanKeyData key[1]; + Relation rel; + TableScanDesc scan; + HeapTuple tuple; + Oid tablespaceoid; + List *role_oids = roleSpecsToIds(stmt->roles); + + /* Ensure we were not asked to change something we can't */ + if (stmt->objtype != OBJECT_TABLE) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("only tables can be specified"))); + + /* Get the tablespace OID */ + tablespaceoid = get_tablespace_oid(stmt->tablespacename, false); + + /* + * Now that the checks are done, check if we should set either to + * InvalidOid because it is our database's default tablespace. + */ + if (tablespaceoid == MyDatabaseTableSpace) + tablespaceoid = InvalidOid; + + /* + * Walk the list of objects in the tablespace to pick up them. This will + * only find objects in our database, of course. + */ + ScanKeyInit(&key[0], + Anum_pg_class_reltablespace, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(tablespaceoid)); + + rel = table_open(RelationRelationId, AccessShareLock); + scan = table_beginscan_catalog(rel, 1, key); + while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) + { + Form_pg_class relForm = (Form_pg_class) GETSTRUCT(tuple); + Oid relOid = relForm->oid; + + /* + * Do not pick-up objects in pg_catalog as part of this, if an admin + * really wishes to do so, they can issue the individual ALTER + * commands directly. + * + * Also, explicitly avoid any shared tables, temp tables, or TOAST + * (TOAST will be changed with the main table). + */ + if (IsCatalogNamespace(relForm->relnamespace) || + relForm->relisshared || + isAnyTempNamespace(relForm->relnamespace) || + IsToastNamespace(relForm->relnamespace)) + continue; + + /* Only pick up the object type requested */ + if (relForm->relkind != RELKIND_RELATION) + continue; + + /* Check if we are only picking-up objects owned by certain roles */ + if (role_oids != NIL && !list_member_oid(role_oids, relForm->relowner)) + continue; + + /* + * Handle permissions-checking here since we are locking the tables + * and also to avoid doing a bunch of work only to fail part-way. Note + * that permissions will also be checked by AlterTableInternal(). + * + * Caller must be considered an owner on the table of which we're going + * to change persistence. + */ + if (!pg_class_ownercheck(relOid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, get_relkind_objtype(get_rel_relkind(relOid)), + NameStr(relForm->relname)); + + if (stmt->nowait && + !ConditionalLockRelationOid(relOid, AccessExclusiveLock)) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), + errmsg("aborting because lock on relation \"%s.%s\" is not available", + get_namespace_name(relForm->relnamespace), + NameStr(relForm->relname)))); + else + LockRelationOid(relOid, AccessExclusiveLock); + + /* + * Add to our list of objects of which we're going to change + * persistence. + */ + relations = lappend_oid(relations, relOid); + } + + table_endscan(scan); + table_close(rel, AccessShareLock); + + if (relations == NIL) + ereport(NOTICE, + (errcode(ERRCODE_NO_DATA_FOUND), + errmsg("no matching relations in tablespace \"%s\" found", + tablespaceoid == InvalidOid ? "(database default)" : + get_tablespace_name(tablespaceoid)))); + + /* + * Everything is locked, loop through and change persistence of all of the + * relations. + */ + foreach(l, relations) + { + List *cmds = NIL; + AlterTableCmd *cmd = makeNode(AlterTableCmd); + + if (stmt->logged) + cmd->subtype = AT_SetLogged; + else + cmd->subtype = AT_SetUnLogged; + + cmds = lappend(cmds, cmd); + + EventTriggerAlterTableStart((Node *) stmt); + /* OID is set by AlterTableInternal */ + AlterTableInternal(lfirst_oid(l), cmds, false); + EventTriggerAlterTableEnd(); + } +} + static void index_copy_data(Relation rel, RelFileNode newrnode) { diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index d4f8455a2b..ba605405a9 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -4285,6 +4285,19 @@ _copyAlterTableMoveAllStmt(const AlterTableMoveAllStmt *from) return newnode; } +static AlterTableSetLoggedAllStmt * +_copyAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *from) +{ + AlterTableSetLoggedAllStmt *newnode = makeNode(AlterTableSetLoggedAllStmt); + + COPY_STRING_FIELD(tablespacename); + COPY_SCALAR_FIELD(objtype); + COPY_SCALAR_FIELD(logged); + COPY_SCALAR_FIELD(nowait); + + return newnode; +} + static CreateExtensionStmt * _copyCreateExtensionStmt(const CreateExtensionStmt *from) { @@ -5655,6 +5668,9 @@ copyObjectImpl(const void *from) case T_AlterTableMoveAllStmt: retval = _copyAlterTableMoveAllStmt(from); break; + case T_AlterTableSetLoggedAllStmt: + retval = _copyAlterTableSetLoggedAllStmt(from); + break; case T_CreateExtensionStmt: retval = _copyCreateExtensionStmt(from); break; diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index f1002afe7a..b76fc872a5 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1925,6 +1925,18 @@ _equalAlterTableMoveAllStmt(const AlterTableMoveAllStmt *a, return true; } +static bool +_equalAlterTableSetLoggedAllStmt(const AlterTableSetLoggedAllStmt *a, + const AlterTableSetLoggedAllStmt *b) +{ + COMPARE_STRING_FIELD(tablespacename); + COMPARE_SCALAR_FIELD(objtype); + COMPARE_SCALAR_FIELD(logged); + COMPARE_SCALAR_FIELD(nowait); + + return true; +} + static bool _equalCreateExtensionStmt(const CreateExtensionStmt *a, const CreateExtensionStmt *b) { @@ -3650,6 +3662,9 @@ equal(const void *a, const void *b) case T_AlterTableMoveAllStmt: retval = _equalAlterTableMoveAllStmt(a, b); break; + case T_AlterTableSetLoggedAllStmt: + retval = _equalAlterTableSetLoggedAllStmt(a, b); + break; case T_CreateExtensionStmt: retval = _equalCreateExtensionStmt(a, b); break; diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index a03b33b53b..f8a41de2dd 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -1985,6 +1985,48 @@ AlterTableStmt: n->nowait = $13; $$ = (Node *)n; } + | ALTER TABLE ALL IN_P TABLESPACE name SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = true; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET LOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->roles = $9; + n->logged = true; + n->nowait = $12; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->logged = false; + n->nowait = $9; + $$ = (Node *)n; + } + | ALTER TABLE ALL IN_P TABLESPACE name OWNED BY role_list SET UNLOGGED opt_nowait + { + AlterTableSetLoggedAllStmt *n = + makeNode(AlterTableSetLoggedAllStmt); + n->tablespacename = $6; + n->objtype = OBJECT_TABLE; + n->roles = $9; + n->logged = false; + n->nowait = $12; + $$ = (Node *)n; + } | ALTER INDEX qualified_name alter_table_cmds { AlterTableStmt *n = makeNode(AlterTableStmt); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 3780c6e812..80d1e360b3 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -163,6 +163,7 @@ ClassifyUtilityCommandAsReadOnly(Node *parsetree) case T_AlterTSConfigurationStmt: case T_AlterTSDictionaryStmt: case T_AlterTableMoveAllStmt: + case T_AlterTableSetLoggedAllStmt: case T_AlterTableSpaceOptionsStmt: case T_AlterTableStmt: case T_AlterTypeStmt: @@ -1753,6 +1754,12 @@ ProcessUtilitySlow(ParseState *pstate, commandCollected = true; break; + case T_AlterTableSetLoggedAllStmt: + AlterTableSetLoggedAll((AlterTableSetLoggedAllStmt *) parsetree); + /* commands are stashed in AlterTableSetLoggedAll */ + commandCollected = true; + break; + case T_DropStmt: ExecDropStmt((DropStmt *) parsetree, isTopLevel); /* no commands stashed for DROP */ @@ -2675,6 +2682,10 @@ CreateCommandTag(Node *parsetree) tag = AlterObjectTypeCommandTag(((AlterTableMoveAllStmt *) parsetree)->objtype); break; + case T_AlterTableSetLoggedAllStmt: + tag = AlterObjectTypeCommandTag(((AlterTableSetLoggedAllStmt *) parsetree)->objtype); + break; + case T_AlterTableStmt: tag = AlterObjectTypeCommandTag(((AlterTableStmt *) parsetree)->objtype); break; diff --git a/src/include/commands/tablecmds.h b/src/include/commands/tablecmds.h index 5d4037f26e..c381dad3e5 100644 --- a/src/include/commands/tablecmds.h +++ b/src/include/commands/tablecmds.h @@ -42,6 +42,8 @@ extern void AlterTableInternal(Oid relid, List *cmds, bool recurse); extern Oid AlterTableMoveAll(AlterTableMoveAllStmt *stmt); +extern void AlterTableSetLoggedAll(AlterTableSetLoggedAllStmt *stmt); + extern ObjectAddress AlterTableNamespace(AlterObjectSchemaStmt *stmt, Oid *oldschema); diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h index 5d075f0c34..d8e1f223c8 100644 --- a/src/include/nodes/nodes.h +++ b/src/include/nodes/nodes.h @@ -430,6 +430,7 @@ typedef enum NodeTag T_AlterCollationStmt, T_CallStmt, T_AlterStatsStmt, + T_AlterTableSetLoggedAllStmt, /* * TAGS FOR PARSE TREE NODES (parsenodes.h) diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 1617702d9d..4fa9d9360f 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2352,6 +2352,16 @@ typedef struct AlterTableMoveAllStmt bool nowait; } AlterTableMoveAllStmt; +typedef struct AlterTableSetLoggedAllStmt +{ + NodeTag type; + char *tablespacename; + ObjectType objtype; /* Object type to move */ + List *roles; /* List of roles to change objects of */ + bool logged; + bool nowait; +} AlterTableSetLoggedAllStmt; + /* ---------------------- * Create/Alter Extension Statements * ---------------------- diff --git a/src/test/regress/expected/tablespace.out b/src/test/regress/expected/tablespace.out index 2dfbcfdebe..c02afdcb68 100644 --- a/src/test/regress/expected/tablespace.out +++ b/src/test/regress/expected/tablespace.out @@ -943,5 +943,81 @@ drop cascades to table testschema.asexecute drop cascades to table testschema.part drop cascades to table testschema.atable drop cascades to table testschema.tablespace_acl +-- +-- Check persistence change in a tablespace +CREATE SCHEMA testschema; +GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1; +CREATE TABLESPACE regress_tablespace LOCATION ''; +GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1; +CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default; +SET ROLE regress_tablespace_user1; +CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | p + usu | regress_tablespace | u + lu1 | regress_tablespace | p + uu1 | regress_tablespace | u + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +ALTER TABLE ALL IN TABLESPACE regress_tablespace + OWNED BY regress_tablespace_user1 SET LOGGED; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | p + usu | regress_tablespace | u + lu1 | regress_tablespace | p + uu1 | regress_tablespace | p + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +RESET ROLE; +ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED; +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + relname | spcname | relpersistence +---------+--------------------+---------------- + lsu | regress_tablespace | u + usu | regress_tablespace | u + lu1 | regress_tablespace | u + uu1 | regress_tablespace | u + _lsu | | p + _usu | | u + _lu1 | | p + _uu1 | | u +(8 rows) + +-- Should succeed +DROP SCHEMA testschema CASCADE; +NOTICE: drop cascades to 8 other objects +DETAIL: drop cascades to table testschema.lsu +drop cascades to table testschema.usu +drop cascades to table testschema._lsu +drop cascades to table testschema._usu +drop cascades to table testschema.lu1 +drop cascades to table testschema.uu1 +drop cascades to table testschema._lu1 +drop cascades to table testschema._uu1 +DROP TABLESPACE regress_tablespace; DROP ROLE regress_tablespace_user1; DROP ROLE regress_tablespace_user2; diff --git a/src/test/regress/sql/tablespace.sql b/src/test/regress/sql/tablespace.sql index 896f05cea3..4e407eb8c0 100644 --- a/src/test/regress/sql/tablespace.sql +++ b/src/test/regress/sql/tablespace.sql @@ -419,5 +419,46 @@ DROP TABLESPACE regress_tblspace_renamed; DROP SCHEMA testschema CASCADE; + +-- +-- Check persistence change in a tablespace +CREATE SCHEMA testschema; +GRANT CREATE ON SCHEMA testschema TO regress_tablespace_user1; +CREATE TABLESPACE regress_tablespace LOCATION ''; +GRANT CREATE ON TABLESPACE regress_tablespace TO regress_tablespace_user1; + +CREATE TABLE testschema.lsu(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.usu(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lsu(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._usu(a int) TABLESPACE pg_default; +SET ROLE regress_tablespace_user1; +CREATE TABLE testschema.lu1(a int) TABLESPACE regress_tablespace; +CREATE UNLOGGED TABLE testschema.uu1(a int) TABLESPACE regress_tablespace; +CREATE TABLE testschema._lu1(a int) TABLESPACE pg_default; +CREATE UNLOGGED TABLE testschema._uu1(a int) TABLESPACE pg_default; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +ALTER TABLE ALL IN TABLESPACE regress_tablespace + OWNED BY regress_tablespace_user1 SET LOGGED; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +RESET ROLE; + +ALTER TABLE ALL IN TABLESPACE regress_tablespace SET UNLOGGED; + +SELECT relname, t.spcname, relpersistence + FROM pg_class c LEFT JOIN pg_tablespace t ON (c.reltablespace = t.oid) + WHERE relnamespace = 'testschema'::regnamespace ORDER BY spcname, c.oid; + +-- Should succeed +DROP SCHEMA testschema CASCADE; +DROP TABLESPACE regress_tablespace; + DROP ROLE regress_tablespace_user1; DROP ROLE regress_tablespace_user2; -- 2.27.0
On Tue, Mar 01, 2022 at 02:14:13PM +0900, Kyotaro Horiguchi wrote: > Rebased on a recent xlog refactoring. It'll come as no surprise that this neds to be rebased again. At least a few typos I reported in January aren't fixed. Set to "waiting".
Thanks! Version 20 is attached. At Wed, 30 Mar 2022 08:44:02 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in > On Tue, Mar 01, 2022 at 02:14:13PM +0900, Kyotaro Horiguchi wrote: > > Rebased on a recent xlog refactoring. > > It'll come as no surprise that this neds to be rebased again. > At least a few typos I reported in January aren't fixed. > Set to "waiting". Oh, I'm sorry for overlooking it. It somehow didn't show up on my mailer. > I started looking at this and reviewed docs and comments again. > > > +typedef struct PendingCleanup > > +{ > > + RelFileNode relnode; /* relation that may need to be deleted */ > > + int op; /* operation mask */ > > + bool bufpersistence; /* buffer persistence to set */ > > + int unlink_forknum; /* forknum to unlink */ > > This can be of data type "ForkNumber" Right. Fixed. > > + * We are going to create an init fork. If server crashes before the > > + * current transaction ends the init fork left alone corrupts data while > > + * recovery. The mark file works as the sentinel to identify that > > + * situation. > > s/while/during/ This was in v17, but dissapeared in v18. > > + * index-init fork needs further initialization. ambuildempty shoud do > > should (I reported this before) > > > + if (inxact_created) > > + { > > + SMgrRelation srel = smgropen(rnode, InvalidBackendId); > > + > > + /* > > + * INIT forks never be loaded to shared buffer so no point in dropping > > "are never loaded" If was fixed in v18. > > + elog(DEBUG1, "perform in-place persistnce change"); > > persistence (I reported this before) Sorry. Fixed. > > + /* > > + * While wal_level >= replica, switching to LOGGED requires the > > + * relation content to be WAL-logged to recover the table. > > + * We don't emit this fhile wal_level = minimal. > > while (or "if") There are "While" and "fhile". I changed the latter to "if". > > + * The relation is persistent and stays remain persistent. Don't > > + * drop the buffers for this relation. > > "stays remain" is redundant (I reported this before) Thanks. I changed it to "stays persistent". > > + if (unlink(rm_path) < 0) > > + ereport(ERROR, > > + (errcode_for_file_access(), > > + errmsg("could not remove file \"%s\": %m", > > + rm_path))); > > The parens around errcode are unnecessary since last year. > I suggest to avoid using them here and elsewhere. It is just moved from elsewhere without editing, but of course I can do that. I didn't know about that change of ereport and not found the corresponding commit, but I found that Tom mentioned that change. https://www.postgresql.org/message-id/flat/5063.1584641224%40sss.pgh.pa.us#63e611c30800133bbddb48de857668e8 > Now that we can rely on having varargs macros, I think we could > stop requiring the extra level of parentheses, ie instead of ... > ereport(ERROR, > errcode(ERRCODE_DIVISION_BY_ZERO), > errmsg("division by zero")); > > (The old syntax had better still work, of course. I'm not advocating > running around and changing existing calls.) I changed all ereport calls added by this patch to this style. > > + * And we may have SMGR_MARK_UNCOMMITTED file. Remove it if the > > + * fork files has been successfully removed. It's ok if the file > > file Fixed. > > + <para> > > + All tables in the current database in a tablespace can be changed by using > > given tablespace I did /database in a tablespace/database in the given tablespace/. Is it right? > > + the <literal>ALL IN TABLESPACE</literal> form, which will lock all tables > > which will first lock > > > + to be changed first and then change each one. This form also supports > > remove "first" here This is almost a dead copy of the description of SET TABLESPACE. This change makes the two almost the same description vary slightly in that wordings. Anyway I did that as suggested only for the part this patch adds in this version. > > + <literal>OWNED BY</literal>, which will only change tables owned by the > > + roles specified. If the <literal>NOWAIT</literal> option is specified > > specified roles. > is specified, (comma) This is the same as above. I did that but it makes the description differ from another almost-the-same description. > > + then the command will fail if it is unable to acquire all of the locks > > if it is unable to immediately acquire > > > + required immediately. The <literal>information_schema</literal> > > remove immediately Ditto. > > + relations are not considered part of the system catalogs and will be > > I think you need to first say that "relations in the pg_catalog schema cannot > be changed". Mmm. I don't agree on this. Aren't such "exceptions"-ish descriptions usually placed after the descriptions of how the feature works? This is also the same structure with SET TABLESPACE. > in patch 2/2: > typo: persistene Hmm. Bad. I checked the spellings of the whole patches and found some typos. + * The crashed trasaction did SET UNLOGGED. This relation + * is restored to a LOGGED relation. s/trasaction/transaction/ + errmsg("could not crete mark file \"%s\": %m", path)); s/crete/create/ Then rebased on 9c08aea6a3 then pgindent'ed. Thanks! -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Thu, Mar 31, 2022 at 01:58:45PM +0900, Kyotaro Horiguchi wrote: > Thanks! Version 20 is attached. The patch failed an all CI tasks, and seems to have caused the macos task to hang. http://cfbot.cputube.org/kyotaro-horiguchi.html Would you send a fixed patch, or remove this thread from the CFBOT ? Otherwise cirrrus will try to every day to rerun but take 1hr to time out, which is twice as slow as the slowest OS. I think this patch should be moved to the next CF and set to v16. Thanks, -- Justin
At Thu, 31 Mar 2022 00:37:07 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in > On Thu, Mar 31, 2022 at 01:58:45PM +0900, Kyotaro Horiguchi wrote: > > Thanks! Version 20 is attached. > > The patch failed an all CI tasks, and seems to have caused the macos task to > hang. > > http://cfbot.cputube.org/kyotaro-horiguchi.html > > Would you send a fixed patch, or remove this thread from the CFBOT ? Otherwis e > cirrrus will try to every day to rerun but take 1hr to time out, which is twice > as slow as the slowest OS. That is found to be a thinko that causes mark files left behind in new database created in the logged version of CREATE DATABASE. It is easily fixed. That being said, this failure revealed that pg_checksums or pg_basebackup dislikes the mark files. It happens even in a quite low possibility. This would need further consideration and extra rounds of reviews. > I think this patch should be moved to the next CF and set to v16. I don't think this can be commited to 15. So I post the fixed version then move this to the next CF. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
At Thu, 31 Mar 2022 18:33:18 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > I don't think this can be commited to 15. So I post the fixed version > then move this to the next CF. Then done. Thanks! regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Mar 31, 2022 at 2:36 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 31 Mar 2022 18:33:18 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > I don't think this can be commited to 15. So I post the fixed version > > then move this to the next CF. > > Then done. Thanks! Hello! This patchset will need to be rebased over latest -- looks like b74e94dc27f (Rethink PROCSIGNAL_BARRIER_SMGRRELEASE) and 5c279a6d350 (Custom WAL Resource Managers) are interfering. Thanks, --Jacob
At Wed, 6 Jul 2022 08:44:18 -0700, Jacob Champion <jchampion@timescale.com> wrote in > On Thu, Mar 31, 2022 at 2:36 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Thu, 31 Mar 2022 18:33:18 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > I don't think this can be commited to 15. So I post the fixed version > > > then move this to the next CF. > > > > Then done. Thanks! > > Hello! This patchset will need to be rebased over latest -- looks like > b74e94dc27f (Rethink PROCSIGNAL_BARRIER_SMGRRELEASE) and 5c279a6d350 > (Custom WAL Resource Managers) are interfering. Thank you for checking that! It got a wider attack by b0a55e4329 (RelFileNumber). The commit message suggests "relfilenode" as files should be replaced with "relation storage/file" so I did that in ResetUnloggedRelationsInDbspaceDir. This patch said that: > * INIT forks are never loaded to shared buffer so no point in > * dropping buffers for such files. But actually some *buildempty() functions use ReadBufferExtended() for INIT_FORK. So that's wrong. So, I did that but... I don't like that. Or I don't like that some AMs leave buffers for INIT fork after. But I feel I'm misunderstanding here since I don't understand how the INIT fork can work as expected after a crash that happens before the next checkpoint flushes the buffers. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
(Mmm. I haven't noticed an annoying misspelling in the subejct X( ) > Thank you for checking that! It got a wider attack by b0a55e4329 > (RelFileNumber). The commit message suggests "relfilenode" as files Then, now I stepped on my own foot. Rebased also on nodefuncs autogeneration. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
Just rebased. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
2022年9月28日(水) 17:21 Kyotaro Horiguchi <horikyota.ntt@gmail.com>: > > Just rebased. Hi cfbot reports the patch no longer applies. As CommitFest 2022-11 is currently underway, this would be an excellent time to update the patch. Thanks Ian Barwick
At Fri, 4 Nov 2022 09:32:52 +0900, Ian Lawrence Barwick <barwick@gmail.com> wrote in > cfbot reports the patch no longer applies. As CommitFest 2022-11 is > currently underway, this would be an excellent time to update the patch. Indeed, thanks! I'll do that in a few days. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Tue, 08 Nov 2022 11:33:53 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Indeed, thanks! I'll do that in a few days. Got too late, but rebased.. The contents of the two patches in the last version was a bit shuffled but they are fixed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
I want to call out this part of this patch: > Also this allows for the cleanup of files left behind in the crash of > the transaction that created it. This is interesting to a lot wider audience than ALTER TABLE SET LOGGED/UNLOGGED. It also adds most of the complexity, with the new marker files. Can you please split the first patch into two: 1. Cleanup of newly created relations on crash 2. ALTER TABLE SET LOGGED/UNLOGGED changes Then we can review the first part independently. Regarding the first part, I'm not sure the marker files are the best approach to implement it. You need to create an extra file for every relation, just to delete it at commit. It feels a bit silly, but maybe it's OK in practice. The undo log patch set solved this problem with the undo log, but it looks like that patch set isn't going anywhere. Maybe invent a very lightweight version of the undo log for this? - Heikki
Thank you for the comment! At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > I want to call out this part of this patch: > > > Also this allows for the cleanup of files left behind in the crash of > > the transaction that created it. > > This is interesting to a lot wider audience than ALTER TABLE SET > LOGGED/UNLOGGED. It also adds most of the complexity, with the new > marker files. Can you please split the first patch into two: > > 1. Cleanup of newly created relations on crash > > 2. ALTER TABLE SET LOGGED/UNLOGGED changes > > Then we can review the first part independently. Ah, indeed. I'll do that. > Regarding the first part, I'm not sure the marker files are the best > approach to implement it. You need to create an extra file for every > relation, just to delete it at commit. It feels a bit silly, but maybe Agreed. (But I didn't come up with better idea..) > it's OK in practice. The undo log patch set solved this problem with > the undo log, but it looks like that patch set isn't going > anywhere. Maybe invent a very lightweight version of the undo log for > this? I didn't thought on that line. Yes, indeed the marker files are a kind of undo log. Anyway, I'll split the current patch to two parts as suggested. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, 6 Feb 2023 at 23:48, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Thank you for the comment! > > At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > > I want to call out this part of this patch: Looks like this patch has received some solid feedback from Heikki and you have a path forward. It's not currently building in the build farm either. I'll set the patch to Waiting on Author for now. -- Gregory Stark As Commitfest Manager
At Wed, 1 Mar 2023 14:56:25 -0500, "Gregory Stark (as CFM)" <stark.cfm@gmail.com> wrote in > On Mon, 6 Feb 2023 at 23:48, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > > Thank you for the comment! > > > > At Fri, 3 Feb 2023 08:42:52 +0100, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > > > I want to call out this part of this patch: > > Looks like this patch has received some solid feedback from Heikki and > you have a path forward. It's not currently building in the build farm > either. > > I'll set the patch to Waiting on Author for now. Correctly they are three parts. Correctly they are three parts. The attached patch is the first part - the storage mark files, which are used to identify storage files that have not been committed and should be removed during the next startup. This feature resolves the issue of orphaned storage files that may result from a crash occurring during the execution of a transaction involving the creation of a new table. I'll post all of the three parts shortly. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
At Fri, 03 Mar 2023 18:03:53 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Correctly they are three parts. The attached patch is the first part - > the storage mark files, which are used to identify storage files that > have not been committed and should be removed during the next > startup. This feature resolves the issue of orphaned storage files > that may result from a crash occurring during the execution of a > transaction involving the creation of a new table. > > I'll post all of the three parts shortly. Mmm. It took longer than I said, but this is the patch set that includes all three parts. 1. "Mark files" to prevent orphan storage files for in-transaction created relations after a crash. 2. In-place persistence change: For ALTER TABLE SET LOGGED/UNLOGGED with wal_level minimal, and ALTER TABLE SET UNLOGGED with other wal_levels, the commands don't require a file copy for the relation storage. ALTER TABLE SET LOGGED with non-minimal wal_level emits bulk FPIs instead of a bunch of individual INSERTs. 3. An extension to ALTER TABLE SET (UN)LOGGED that can handle all tables in a tablespace at once. As a side note, I quickly go over the behavior of the mark files introduced by the first patch, particularly what happens when deletion fails. (1) The mark file for MAIN fork ("<oid>.u") corresponds to all forks, while the mark file for INIT fork ("<oid>_init.u") corresponds to INIT fork alone. (2) The mark file is created just before the the corresponding storage file is made. This is always logged in the WAL. (3) The mark file is deleted after removing the corresponding storage file during the commit and rollback. This action is logged in the WAL, too. If the deletion fails, an ERROR is output and the transaction aborts. (4) If a crash leaves a mark file behind, server will try to delete it after successfully removing the corresponding storage file during the subsequent startup that runs a recovery. If deletion fails, server leaves the mark file alone with emitting a WARNING. (The same behavior for non-mark files.) (5) If the deletion of the mark file fails, the leftover mark file prevents the creation of the corresponding storage file (causing an ERROR). The leftover mark files don't result in the removal of the wrong files due to that behavior. (6) The mark file for an INIT fork is created only when ALTER TABLE SET UNLOGGED is executed (not for CREATE UNLOGGED TABLE) to signal the crash-cleanup code to remove the INIT fork. (Otherwise the cleanup code removes the main fork instead. This is the main objective of introducing the mark files.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
At Fri, 17 Mar 2023 15:16:34 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > Mmm. It took longer than I said, but this is the patch set that > includes all three parts. > > 1. "Mark files" to prevent orphan storage files for in-transaction > created relations after a crash. > > 2. In-place persistence change: For ALTER TABLE SET LOGGED/UNLOGGED > with wal_level minimal, and ALTER TABLE SET UNLOGGED with other > wal_levels, the commands don't require a file copy for the relation > storage. ALTER TABLE SET LOGGED with non-minimal wal_level emits > bulk FPIs instead of a bunch of individual INSERTs. > > 3. An extension to ALTER TABLE SET (UN)LOGGED that can handle all > tables in a tablespace at once. > > > As a side note, I quickly go over the behavior of the mark files > introduced by the first patch, particularly what happens when deletion > fails. > > (1) The mark file for MAIN fork ("<oid>.u") corresponds to all forks, > while the mark file for INIT fork ("<oid>_init.u") corresponds to > INIT fork alone. > > (2) The mark file is created just before the the corresponding storage > file is made. This is always logged in the WAL. > > (3) The mark file is deleted after removing the corresponding storage > file during the commit and rollback. This action is logged in the > WAL, too. If the deletion fails, an ERROR is output and the > transaction aborts. > > (4) If a crash leaves a mark file behind, server will try to delete it > after successfully removing the corresponding storage file during > the subsequent startup that runs a recovery. If deletion fails, > server leaves the mark file alone with emitting a WARNING. (The > same behavior for non-mark files.) > > (5) If the deletion of the mark file fails, the leftover mark file > prevents the creation of the corresponding storage file (causing > an ERROR). The leftover mark files don't result in the removal of > the wrong files due to that behavior. > > (6) The mark file for an INIT fork is created only when ALTER TABLE > SET UNLOGGED is executed (not for CREATE UNLOGGED TABLE) to signal > the crash-cleanup code to remove the INIT fork. (Otherwise the > cleanup code removes the main fork instead. This is the main > objective of introducing the mark files.) Rebased. I fixed some code comments and commit messages. I fixed the wrong arrangement of some changes among patches. Most importantly, I fixed the a bug based on a wrong assumption that init-fork is not resides on shared buffers. Now smgrDoPendingCleanups drops buffer for a init-fork to be removed. The new fourth patch is a temporary fix for recently added code, which will soon be no longer needed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Tue, Apr 25, 2023 at 9:55 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Rebased. > > I fixed some code comments and commit messages. I fixed the wrong > arrangement of some changes among patches. Most importantly, I fixed > the a bug based on a wrong assumption that init-fork is not resides on > shared buffers. Now smgrDoPendingCleanups drops buffer for a init-fork > to be removed. > > The new fourth patch is a temporary fix for recently added code, which > will soon be no longer needed. > Hi Kyotaro, I've retested v28 of the patch with everything that came to my mind (basic tests, --enable-tap-tests, restarts/crashes along adding the data, checking if there were any files left over and I've checked for stuff that earlier was causing problems: GiST on geometry[PostGIS]). The only thing I've not tested this time were the performance runs done earlier. The patch passed all my very limited tests along with make check-world. Patch looks good to me on the surface from a usability point of view. I haven't looked at the code, so the patch might still need an in-depth review. Regards, -Jakub Wartak.
(I find the misspelled subject makes it difficult to find the thread..) At Thu, 27 Apr 2023 14:47:41 +0200, Jakub Wartak <jakub.wartak@enterprisedb.com> wrote in > On Tue, Apr 25, 2023 at 9:55 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > Rebased. > > > > I fixed some code comments and commit messages. I fixed the wrong > > arrangement of some changes among patches. Most importantly, I fixed > > the a bug based on a wrong assumption that init-fork is not resides on > > shared buffers. Now smgrDoPendingCleanups drops buffer for a init-fork > > to be removed. > > > > The new fourth patch is a temporary fix for recently added code, which > > will soon be no longer needed. This is no longer needed. Thank you, Thomas! > Hi Kyotaro, > > I've retested v28 of the patch with everything that came to my mind > (basic tests, --enable-tap-tests, restarts/crashes along adding the > data, checking if there were any files left over and I've checked for > stuff that earlier was causing problems: GiST on geometry[PostGIS]). Maybe it's fixed by dropping buffers. > The only thing I've not tested this time were the performance runs > done earlier. The patch passed all my very limited tests along with > make check-world. Patch looks good to me on the surface from a > usability point of view. I haven't looked at the code, so the patch > might still need an in-depth review. Thank you for conducting a thorough test. In this patchset, the first one might be useful on its own and it is the most complex part. I'll recheck it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
I think there are some good ideas here. I started to take a look at the patches, and I've attached a rebased version of the patch set. Apologies if I am repeating any discussions from upthread. First, I tested the time difference in ALTER TABLE SET UNLOGGED/LOGGED with the patch applied, and the results looked pretty impressive. before: postgres=# alter table test set unlogged; ALTER TABLE Time: 5108.071 ms (00:05.108) postgres=# alter table test set logged; ALTER TABLE Time: 6747.648 ms (00:06.748) after: postgres=# alter table test set unlogged; ALTER TABLE Time: 25.609 ms postgres=# alter table test set logged; ALTER TABLE Time: 1241.800 ms (00:01.242) My first question is whether 0001 is a prerequisite to 0002. I'm assuming it is, but the reason wasn't immediately obvious to me. If it's just nice-to-have, perhaps we could simplify the patch set a bit. I see that Heikki had some general concerns with the marker file approach [0], so perhaps it is at least worth brainstorming some alternatives if we _do_ need it. [0] https://postgr.es/m/9827ebd3-de2e-fd52-4091-a568387b1fc2%40iki.fi -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Thank you for looking this! At Mon, 14 Aug 2023 12:38:48 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in > I think there are some good ideas here. I started to take a look at the > patches, and I've attached a rebased version of the patch set. Apologies > if I am repeating any discussions from upthread. > > First, I tested the time difference in ALTER TABLE SET UNLOGGED/LOGGED with > the patch applied, and the results looked pretty impressive. > > before: > postgres=# alter table test set unlogged; > ALTER TABLE > Time: 5108.071 ms (00:05.108) > postgres=# alter table test set logged; > ALTER TABLE > Time: 6747.648 ms (00:06.748) > > after: > postgres=# alter table test set unlogged; > ALTER TABLE > Time: 25.609 ms > postgres=# alter table test set logged; > ALTER TABLE > Time: 1241.800 ms (00:01.242) Thanks for confirmation. The difference between the both directions is that making a table logged requires to emit WAL records for the entire content. > My first question is whether 0001 is a prerequisite to 0002. I'm assuming > it is, but the reason wasn't immediately obvious to me. If it's just In 0002, if a backend crashes after creating an init fork file but before the associated commit, a lingering fork file could result in data loss on the next startup. Thus, an utterly reliable file cleanup mechanism is essential. 0001 also addresses the orphan storage files issue arising from ALTER TABLE and similar commands. > nice-to-have, perhaps we could simplify the patch set a bit. I see that > Heikki had some general concerns with the marker file approach [0], so > perhaps it is at least worth brainstorming some alternatives if we _do_ > need it. The rationale behind the file-based implementation is that any leftover init fork file from a crash needs to be deleted before the reinit(INIT) process kicks in, which happens irrelevantly to WAL, before the start of crash recovery. I could implement it separately from the reinit module, but I didn't since that results in almost a duplication. As commented in xlog.c, the purpose of the pre-recovery reinit CLEANUP phase is to ensure hot standbys don't encounter erroneous unlogged relations. Based on that requirement, we need a mechanism to guarantee that additional crucial operations are executed reliably at the next startup post-crash, right before recovery kicks in (or reinit CLEANUP). 0001 persists this data on a per-operation basis tightly bonded to their target objects. I could turn this into something like undo longs in a simple form, but I'd rather not craft a general-purpose undo log system for this unelss it's absolutely necessary. > [0] https://postgr.es/m/9827ebd3-de2e-fd52-4091-a568387b1fc2%40iki.fi regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Thu, 24 Aug 2023 11:22:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > I could turn this into something like undo longs in a simple form, but > I'd rather not craft a general-purpose undo log system for this unelss > it's absolutely necessary. This is a patch for a basic undo log implementation. It looks like it works well for some orphan-files-after-a-crash and data-loss-on-reinit cases. However, it is far from complete and likely has issues with crash-safety and the durability of undo log files (and memory leaks and performance and..). I'm posting this to move the discussion forward. (This doesn't contain the third file "ALTER TABLE ..ALL IN TABLESPACE" part.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Mon, 4 Sept 2023 at 16:59, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 24 Aug 2023 11:22:32 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > I could turn this into something like undo longs in a simple form, but > > I'd rather not craft a general-purpose undo log system for this unelss > > it's absolutely necessary. > > This is a patch for a basic undo log implementation. It looks like it > works well for some orphan-files-after-a-crash and data-loss-on-reinit > cases. However, it is far from complete and likely has issues with > crash-safety and the durability of undo log files (and memory leaks > and performance and..). > > I'm posting this to move the discussion forward. > > (This doesn't contain the third file "ALTER TABLE ..ALL IN TABLESPACE" part.) CFBot shows compilation issues at [1] with: 09:34:44.987] /usr/bin/ld: src/backend/postgres_lib.a.p/access_transam_twophase.c.o: in function `FinishPreparedTransaction': [09:34:44.987] /tmp/cirrus-ci-build/build/../src/backend/access/transam/twophase.c:1569: undefined reference to `AtEOXact_SimpleUndoLog' [09:34:44.987] /usr/bin/ld: src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function `CommitTransaction': [09:34:44.987] /tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:2372: undefined reference to `AtEOXact_SimpleUndoLog' [09:34:44.987] /usr/bin/ld: src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function `AbortTransaction': [09:34:44.987] /tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:2878: undefined reference to `AtEOXact_SimpleUndoLog' [09:34:44.987] /usr/bin/ld: src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function `CommitSubTransaction': [09:34:44.987] /tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:5016: undefined reference to `AtEOXact_SimpleUndoLog' [09:34:44.987] /usr/bin/ld: src/backend/postgres_lib.a.p/access_transam_xact.c.o: in function `AbortSubTransaction': [09:34:44.987] /tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:5197: undefined reference to `AtEOXact_SimpleUndoLog' [09:34:44.987] /usr/bin/ld: src/backend/postgres_lib.a.p/access_transam_xact.c.o:/tmp/cirrus-ci-build/build/../src/backend/access/transam/xact.c:6080: more undefined references to `AtEOXact_SimpleUndoLog' follow [1] - https://cirrus-ci.com/task/5916232528953344 Regards, Vignesh
At Tue, 9 Jan 2024 15:07:20 +0530, vignesh C <vignesh21@gmail.com> wrote in > CFBot shows compilation issues at [1] with: Thanks! The reason for those errors was that I didn't consider Meson at the time. Additionally, the signature change of reindex_index() caused the build failure. I fixed both issues. While addressing these issues, I modified the simpleundolog module to honor wal_sync_method. Previously, the sync method for undo logs was determined independently, separate from xlog.c. However, I'm still not satisfied with the method for handling PG_O_DIRECT. In this version, I have added the changes to enable the use of wal_sync_method outside of xlog.c as the first part of the patchset. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
2024-01 Commitfest. Hi, This patch has a CF status of "Needs Review" [1], but it seems there was a CFbot test failure last time it was run [2]. Please have a look and post an updated version if necessary. ====== [1] https://commitfest.postgresql.org/46/3461/ [2] https://cirrus-ci.com/task/6050020441456640 Kind Regards, Peter Smith.
At Mon, 22 Jan 2024 15:36:31 +1100, Peter Smith <smithpb2250@gmail.com> wrote in > 2024-01 Commitfest. > > Hi, This patch has a CF status of "Needs Review" [1], but it seems > there was a CFbot test failure last time it was run [2]. Please have a > look and post an updated version if necessary. Thanks! I have added the necessary includes to the header file this patch adds. With this change, "make headerscheck" now passes. However, when I run "make cpluspluscheck" in my environment, it generates a large number of errors in other areas, but I didn't find one related to this patch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
Rebased. Along with rebasing, I changed the interface of XLogFsyncFile() to return a boolean instead of an error message. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Fri, 24 May 2024 at 00:09, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > Along with rebasing, I changed the interface of XLogFsyncFile() to > return a boolean instead of an error message. Two notes after looking at this quickly during the advanced patch feedback session: 1. I would maybe split 0003 into two separate patches. One to make SET UNLOGGED fast, which seems quite easy to do because no WAL is needed. And then a follow up to make SET LOGGED fast, which does all the XLOG_FPI stuff. 2. When wal_level = minital, still some WAL logging is needed. The pages that were changed since the last still need to be made available for crash recovery.
On Tue, May 28, 2024 at 04:49:45PM -0700, Jelte Fennema-Nio wrote: > Two notes after looking at this quickly during the advanced patch > feedback session: > > 1. I would maybe split 0003 into two separate patches. One to make SET > UNLOGGED fast, which seems quite easy to do because no WAL is needed. > And then a follow up to make SET LOGGED fast, which does all the > XLOG_FPI stuff. Yeah, that would make sense. The LOGGED->UNLOGGED part is straight-forward because we only care about the init fork. The UNLOGGED->LOGGED case bugs me, though, a lot. > 2. When wal_level = minitam, still some WAL logging is needed. The > pages that were changed since the last still need to be made available > for crash recovery. More notes from me, as I was part of this session. + * XXXX: Some access methods don't support in-place persistence + * changes. GiST uses page LSNs to figure out whether a block has been [...] + if (r->rd_rel->relkind == RELKIND_INDEX && + /* GiST is excluded */ + r->rd_rel->relam != BTREE_AM_OID && + r->rd_rel->relam != HASH_AM_OID && + r->rd_rel->relam != GIN_AM_OID && + r->rd_rel->relam != SPGIST_AM_OID && + r->rd_rel->relam != BRIN_AM_OID) This knowledge should not be encapsulated in the backend code. The index AMs should be able to tell, instead, if they are able to support this code path so as any out-of-core index AM can decide things on its own. This ought to be split in its own patch, simple enough as of a boolean or a routine telling how this backend path should behave. + for (fork = 0; fork < INIT_FORKNUM; fork++) + { + if (smgrexists(RelationGetSmgr(r), fork)) + log_newpage_range(r, fork, 0, + smgrnblocks(RelationGetSmgr(r), fork), + false); + } A simple copy of the blocks means that we keep anything bloated in them, while a rewrite in ALTER TABLE means that we would start afresh by deforming the tuples from the origin before giving them to the target, without any bloat. The compression of the FPWs and the removal of the holes in the pages would surely limit the impact, but this has not been discussed on this thread, and this is a nice property of the existing implementation that would get silently removed by this patch set. Another point that Nathan has made is that it may be more appealling to study how this is better than an integration with the multi-INSERT APIs into AMs, so as it is possible to group the inserts in batches rather than process them one-at-a-time, see [1]. I am ready to accept that what this patch does is more efficient as long as everything is block-based in some cases. Still there is a risk-vs-gain argument here, and I am not sure whether what we have here is a good tradeoff compared to the potential risk of breaking things. The amount of new infrastructure is large for this code path. Grouping the inserts in large batches may finish by being more efficient than a WAL stream full of FPWs, as well, even if toast values are deformed? So perhaps there is an argument for making that optional at query level, instead. As a hole, I can say that grouping the INSERTs will be always more efficient, while what we have here can be less efficient in some cases. I'm OK to be outvoted, but the level of complications created by this block-based copy and WAL-logging concerns me when it comes to tweaking the relpersistence like that. [1]: https://commitfest.postgresql.org/48/4777/ -- Michael
Attachment
Thank you for the comments. # The most significant feedback I received was that this approach is # not misdirected.. At Tue, 4 Jun 2024 09:09:12 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Tue, May 28, 2024 at 04:49:45PM -0700, Jelte Fennema-Nio wrote: > > Two notes after looking at this quickly during the advanced patch > > feedback session: > > > > 1. I would maybe split 0003 into two separate patches. One to make SET > > UNLOGGED fast, which seems quite easy to do because no WAL is needed. > > And then a follow up to make SET LOGGED fast, which does all the > > XLOG_FPI stuff. > > Yeah, that would make sense. The LOGGED->UNLOGGED part is > straight-forward because we only care about the init fork. The > UNLOGGED->LOGGED case bugs me, though, a lot. I indeed agree with that. Will do that in the next version. > > 2. When wal_level = minitam, still some WAL logging is needed. The > > pages that were changed since the last still need to be made available > > for crash recovery. I don't quite understand this. It seems that you are reffering to the LOGGED to UNLOGGED case. UNLOGGED tables are emptied after a crash, and the newly created INIT fork does that trick. Maybe I'm misunderstanding something, though. > More notes from me, as I was part of this session. > > + * XXXX: Some access methods don't support in-place persistence > + * changes. GiST uses page LSNs to figure out whether a block has been > [...] > + if (r->rd_rel->relkind == RELKIND_INDEX && > + /* GiST is excluded */ > + r->rd_rel->relam != BTREE_AM_OID && > + r->rd_rel->relam != HASH_AM_OID && > + r->rd_rel->relam != GIN_AM_OID && > + r->rd_rel->relam != SPGIST_AM_OID && > + r->rd_rel->relam != BRIN_AM_OID) > > This knowledge should not be encapsulated in the backend code. The > index AMs should be able to tell, instead, if they are able to support > this code path so as any out-of-core index AM can decide things on its > own. This ought to be split in its own patch, simple enough as of a > boolean or a routine telling how this backend path should behave. Right. I was hesitant to expand the scope before being certain that I can proceed in this direction without significant objections. Now I can include that in the next version. > + for (fork = 0; fork < INIT_FORKNUM; fork++) > + { > + if (smgrexists(RelationGetSmgr(r), fork)) > + log_newpage_range(r, fork, 0, > + smgrnblocks(RelationGetSmgr(r), fork), > + false); > + } > > A simple copy of the blocks means that we keep anything bloated in > them, while a rewrite in ALTER TABLE means that we would start afresh > by deforming the tuples from the origin before giving them to the > target, without any bloat. The compression of the FPWs and the > removal of the holes in the pages would surely limit the impact, but > this has not been discussed on this thread, and this is a nice > property of the existing implementation that would get silently > removed by this patch set. Sure. That bloat can be removed beforehand by explicitly running VACUUM on the table if needed, but it would be ideal if the same compression occurred automatically. Alternatively, it might be an option to fall back to the existing path when the target table is found to have excessive bloat (though I'm not sure how much should be considered excessive). We could also allow users to decide by adding a command option. > Another point that Nathan has made is that it may be more appealling > to study how this is better than an integration with the multi-INSERT > APIs into AMs, so as it is possible to group the inserts in batches > rather than process them one-at-a-time, see [1]. I am ready to accept > that what this patch does is more efficient as long as everything is > block-based in some cases. Still there is a risk-vs-gain argument > here, and I am not sure whether what we have here is a good tradeoff > compared to the potential risk of breaking things. The amount of new > infrastructure is large for this code path. Grouping the inserts in > large batches may finish by being more efficient than a WAL stream > full of FPWs, as well, even if toast values are deformed? So perhaps > there is an argument for making that optional at query level, instead. I agree about the uncertainties. With the switching feature mentioned above, it might be sufficient to use the multi-insert stuff in the existing path. However, the uncertainties regarding performance would still remain. > As a hole, I can say that grouping the INSERTs will be always more > efficient, while what we have here can be less efficient in some > cases. I'm OK to be outvoted, but the level of complications created > by this block-based copy and WAL-logging concerns me when it comes to > tweaking the relpersistence like that. Of course, it is a promising option to move away from the block-logging and fall back to the existing path using the multi-insert stuff in the UNLOGGED to LOGGED case. Let me consider that point. Besides the above, even though this discussion might become unnecessary, there was a concern that the blockwise logging might result in unexpected outcomes due to unflushed buffer data. (although I could be mistaken). I believe that is not the case because all buffer blocks are flushed out beforehand. > [1]: https://commitfest.postgresql.org/48/4777/ regards. -- Kyotaro Horiguchi NTT Open Source Software Center
+Bharath On Tue, Jun 04, 2024 at 04:00:32PM +0900, Kyotaro Horiguchi wrote: > At Tue, 4 Jun 2024 09:09:12 +0900, Michael Paquier <michael@paquier.xyz> wrote in >> Another point that Nathan has made is that it may be more appealling >> to study how this is better than an integration with the multi-INSERT >> APIs into AMs, so as it is possible to group the inserts in batches >> rather than process them one-at-a-time, see [1]. I am ready to accept >> that what this patch does is more efficient as long as everything is >> block-based in some cases. Still there is a risk-vs-gain argument >> here, and I am not sure whether what we have here is a good tradeoff >> compared to the potential risk of breaking things. The amount of new >> infrastructure is large for this code path. Grouping the inserts in >> large batches may finish by being more efficient than a WAL stream >> full of FPWs, as well, even if toast values are deformed? So perhaps >> there is an argument for making that optional at query level, instead. > > I agree about the uncertainties. With the switching feature mentioned > above, it might be sufficient to use the multi-insert stuff in the > existing path. However, the uncertainties regarding performance would > still remain. Bharath, does the multi-INSERT stuff apply when changing a table to be LOGGED? If so, I think it would be interesting to compare it with the FPI approach being discussed here. -- nathan
On Tue, Jun 04, 2024 at 03:50:51PM -0500, Nathan Bossart wrote: > Bharath, does the multi-INSERT stuff apply when changing a table to be > LOGGED? If so, I think it would be interesting to compare it with the FPI > approach being discussed here. The answer to this question is yes AFAIK. Look at patch 0002 in the latest series posted here, that touches ATRewriteTable() in tablecmds.c where the rewrite happens should a relation's relpersistence, AM, column or default requires a switch (particularly if more than one property is changed in a single command, grep for AT_REWRITE_*): https://www.postgresql.org/message-id/CALj2ACUz5+_YNEa4ZY-XG960_oXefM50MjD71VgSCAVDkF3bzQ@mail.gmail.com I've just read through the patch set, and they are rather pleasant to the eye. I have comments about them, actually, but that's a topic for the other thread. -- Michael
Attachment
On 31/08/2024 19:09, Kyotaro Horiguchi wrote: > - UNDO log(0002): This handles file deletion during transaction aborts, > which was previously managed, in part, by the commit XLOG record at > the end of a transaction. > > - Prevent orphan files after a crash (0005): This is another use-case > of the UNDO log system. Nice, I'm very excited if we can fix that long-standing issue! I'll try to review this properly later, but at a quick 5 minute glance, one thing caught my eye: This requires fsync()ing the per-xid undo log file every time a relation is created. I fear that can be a pretty big performance hit for workloads that repeatedly create and drop small tables. Especially if they're otherwise running with synchronous_commit=off. Instead of flushing the undo log file after every write, I'd suggest WAL-logging the undo log like regular relations and SLRUs. So before writing the entry to the undo log, WAL-log it. And with a little more effort, you could postpone creating the files altogether until a checkpoint happens, similar to how twophase state files are checkpointed nowadays. I wonder if the twophase state files and undo log files should be merged into one file. They're similar in many ways: there's one file per transaction, named using the XID. I haven't thought this fully through, just a thought.. > +static void > +undolog_set_filename(char *buf, TransactionId xid) > +{ > + snprintf(buf, MAXPGPATH, "%s/%08x", SIMPLE_UNDOLOG_DIR, xid); > +} I'd suggest using FullTransactionId. Doesn't matter much, but seems like a good future-proofing. -- Heikki Linnakangas Neon (https://neon.tech)
Hello. Thank you for the response. At Sun, 1 Sep 2024 22:15:00 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > On 31/08/2024 19:09, Kyotaro Horiguchi wrote: > > - UNDO log(0002): This handles file deletion during transaction aborts, > > which was previously managed, in part, by the commit XLOG record at > > the end of a transaction. > > - Prevent orphan files after a crash (0005): This is another use-case > > of the UNDO log system. > > Nice, I'm very excited if we can fix that long-standing issue! I'll > try to review this properly later, but at a quick 5 minute glance, one > thing caught my eye: > > This requires fsync()ing the per-xid undo log file every time a > relation is created. I fear that can be a pretty big performance hit > for workloads that repeatedly create and drop small tables. Especially I initially thought that one additional file manipulation during file creation wouldn't be an issue. However, the created storage file isn't being synced, so your concern seems valid. > if they're otherwise running with synchronous_commit=off. Instead of > flushing the undo log file after every write, I'd suggest WAL-logging > the undo log like regular relations and SLRUs. So before writing the > entry to the undo log, WAL-log it. And with a little more effort, you > could postpone creating the files altogether until a checkpoint > happens, similar to how twophase state files are checkpointed > nowadays. I thought that an UNDO log file not flushed before the last checkpoint might not survive a system crash. However, including UNDO files in the checkpointing process resolves that concern. Thansk you for the suggestion. > I wonder if the twophase state files and undo log files should be > merged into one file. They're similar in many ways: there's one file > per transaction, named using the XID. I haven't thought this fully > through, just a thought.. Precisely, UNDO log files are created per subtransaction, unlike twophase files. It might be possible if we allow the twophase files (as they are currently named) to be overwritten or modified at every subcommit. If ULOG contents are WAL-logged, these two things will become even more similar. However, I'm not planning to include that in the next version for now. > > +static void > > +undolog_set_filename(char *buf, TransactionId xid) > > +{ > > + snprintf(buf, MAXPGPATH, "%s/%08x", SIMPLE_UNDOLOG_DIR, xid); > > +} > > I'd suggest using FullTransactionId. Doesn't matter much, but seems > like a good future-proofing. Agreed. Will fix it in the next vesion. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Mon, 2 Sep 2024 09:30:20 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Sun, Sep 01, 2024 at 10:15:00PM +0300, Heikki Linnakangas wrote: > > I wonder if the twophase state files and undo log files should be merged > > into one file. They're similar in many ways: there's one file per > > transaction, named using the XID. I haven't thought this fully through, just > > a thought.. > > Hmm. It could be possible to extract some of this knowledge out of > twophase.c and design some APIs that could be used for both, but would > that be really necessary? The 2PC data and the LSNs used by the files > to check if things are replayed or on disk rely on > GlobalTransactionData that has its own idea of things and timings at > recovery. I'm not sure, but I feel that Heikki mentioned only about using the file format and in/out functions if the file formats of the two are sufficiently overlapping. > Or perhaps your point is actually to do that and add one layer for the > file handlings and their flush timings? I am not sure, TBH, what this > thread is trying to fix is complicated enough that it may be better to > live with two different code paths. But perhaps my gut feeling is > just wrong reading your paragraph. I believe this statement is valid, so I’m not in a hurry to do this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 31/08/2024 19:09, Kyotaro Horiguchi wrote: > Subject: [PATCH v34 03/16] Remove function for retaining files on outer > transaction aborts > > The function RelationPreserveStorage() was initially created to keep > storage files committed in a subtransaction on the abort of outer > transactions. It was introduced by commit b9b8831ad6 in 2010, but no > use case for this behavior has emerged since then. If we move the > at-commit removal feature of storage files from pendingDeletes to the > UNDO log system, the UNDO system would need to accept the cancellation > of already logged entries, which makes the system overly complex with > no benefit. Therefore, remove the feature. I don't think that's quite right. I don't think this was meant for subtransaction aborts, but to make sure that if the top-transaction aborts after AtEOXact_RelationMap() has already been called, we don't remove the new relation. AtEOXact_RelationMap() is called very late in the commit process to keep the window as small as possible, but if it nevertheless happens, the consequences are pretty bad if you remove a relation file that is in fact needed. -- Heikki Linnakangas Neon (https://neon.tech)
On 31/10/2024 10:01, Kyotaro Horiguchi wrote: > After some delays, here’s the new version. In this update, UNDO logs > are WAL-logged and processed in memory under most conditions. During > checkpoints, they’re flushed to files, which are then read when a > specific XID’s UNDO log is accessed for the first time during > recovery. > > The biggest changes are in patches 0001 through 0004 (equivalent to > the previous 0001-0002). After that, there aren’t any major > changes. Since this update involves removing some existing features, > I’ve split these parts into multiple smaller identity transformations > to make them clearer. > > As for changes beyond that, the main one is lifting the previous > restriction on PREPARE for transactions after a persistence > change. This was made possible because, with the shift to in-memory > processing of UNDO logs, commit-time crash recovery detection is now > simpler. Additional changes include completely removing the > abort-handling portion from the pendingDeletes mechanism (0008-0010). In this patch version, the undo log is kept in dynamic shared memory. It can grow indefinitely. On a checkpoint, it's flushed to disk. If I'm reading it correctly, the undo records are kept in the DSA area even after it's flushed to disk. That's not necessary; system never needs to read the undo log unless there's a crash, so there's no need to keep it in memory after it's been flushed to disk. That's true today; we could start relying on the undo log to clean up on abort even when there's no crash, but I think it's a good design to not do that and rely on backend-private state for non-crash transaction abort. I'd suggest doing this the other way 'round. Let's treat the on-disk representation as the primary representation, not the in-memory one. Let's use a small fixed-size shared memory area just as a write buffer to hold the dirty undo log entries that haven't been written to disk yet. Most transactions are short, so most undo log entries never need to be flushed to disk, but I think it'll be simpler to think of it that way. On checkpoint, flush all the buffered dirty entries from memory to disk and clear the buffer. Also do that if the buffer fills up. A high-level overview comment of the on-disk format would be nice. If I understand correctly, there's a magic constant at the beginning of each undo file, followed by UndoLogRecords. There are no other file headers and no page structure within the file. That format seems reasonable. For cross-checking, maybe add the XID to the file header too. There is a separate CRC value on each record, which is nice, but not strictly necessary since the writes to the UNDO log are WAL-logged. The WAL needs CRCs on each record to detect the end of log, but the UNDO log doesn't need that. Anyway, it's fine. I somehow dislike the file per subxid design. I'm sure it works, it's just more of a feeling that it doesn't feel right. I'm somewhat worried about ending up with lots of files, if you e.g. use temporary tables with subtransactions heavily. Could we have just one file per top-level XID? I guess that can become a problem too, if you have a lot of aborted subtransactions. The UNDO records for the aborted subtransactions would bloat the undo file. But maybe that's nevertheless better? -- Heikki Linnakangas Neon (https://neon.tech)
Thank you for the quick comments. At Thu, 31 Oct 2024 23:24:36 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > On 31/10/2024 10:01, Kyotaro Horiguchi wrote: > > After some delays, here’s the new version. In this update, UNDO logs > > are WAL-logged and processed in memory under most conditions. During > > checkpoints, they’re flushed to files, which are then read when a > > specific XID’s UNDO log is accessed for the first time during > > recovery. > > The biggest changes are in patches 0001 through 0004 (equivalent to > > the previous 0001-0002). After that, there aren’t any major > > changes. Since this update involves removing some existing features, > > I’ve split these parts into multiple smaller identity transformations > > to make them clearer. > > As for changes beyond that, the main one is lifting the previous > > restriction on PREPARE for transactions after a persistence > > change. This was made possible because, with the shift to in-memory > > processing of UNDO logs, commit-time crash recovery detection is now > > simpler. Additional changes include completely removing the > > abort-handling portion from the pendingDeletes mechanism (0008-0010). > > In this patch version, the undo log is kept in dynamic shared > memory. It can grow indefinitely. On a checkpoint, it's flushed to > disk. > > If I'm reading it correctly, the undo records are kept in the DSA area > even after it's flushed to disk. That's not necessary; system never > needs to read the undo log unless there's a crash, so there's no need The system also needs to read the undo log whenever additional undo logs are added. In this version, I’ve moved all abort-time pendingDeletes data entirely to the undo logs. In other words, the DSA area is expanded in exchange for reducing the pendingDelete list. As a result, there is minimal impact on overall memory usage. Additionally, the current flushing code is straightforward because it relies on the in-memory primary image. If we drop the in-memory image during flush, we might need exclusive locking or possibly some ordering techniques. Anyway, I’ll consider that approach. > to keep it in memory after it's been flushed to disk. That's true > today; we could start relying on the undo log to clean up on abort > even when there's no crash, but I think it's a good design to not do > that and rely on backend-private state for non-crash transaction > abort. Hmm. Sounds reasonable. In the next version, I'll revert the changes to pendingDeletes and adjust it to just discard the log on regular aborts. > I'd suggest doing this the other way 'round. Let's treat the on-disk > representation as the primary representation, not the in-memory > one. Let's use a small fixed-size shared memory area just as a write > buffer to hold the dirty undo log entries that haven't been written to > disk yet. Most transactions are short, so most undo log entries never > need to be flushed to disk, but I think it'll be simpler to think of > it that way. On checkpoint, flush all the buffered dirty entries from > memory to disk and clear the buffer. Also do that if the buffer fills > up. I'd like to clarify the specific concept of these fixed-length memory slots. Is it something like this: each slot is keyed by an XID, followed by an in-file offset and a series of, say, 1024-byte areas? When writing a log for a new XID, if no slot is available, the backend would immediately evict the slot with the smallest XID to disk to free up space. If an existing slot runs out of space while writing new logs, the backend would flush it immediately and continue using the area. Is this correct? Additionally, if multiple processes try to write to a single slot, stricter locking might be needed. For example, if a slot is evicted by a backend other than its user, exclusive control might be required during the file write. jjjIs there any effective way to avoid such locking? In the current patch set, I’m avoiding any impact on the backend from checkpointer file writes by treating the in-memory image as primary. And regarding the number of these areas… although I’m not entirely sure, it seems unlikely we’d have hundreds of sessions simultaneously creating tables, so would it make sense to make this configurable, with a default of around 32 areas? > A high-level overview comment of the on-disk format would be nice. If > I understand correctly, there's a magic constant at the beginning of > each undo file, followed by UndoLogRecords. There are no other file > headers and no page structure within the file. Right. > That format seems reasonable. For cross-checking, maybe add the XID to > the file header too. There is a separate CRC value on each record, > which is nice, but not strictly necessary since the writes to the UNDO > log are WAL-logged. The WAL needs CRCs on each record to detect the > end of log, but the UNDO log doesn't need that. Anyway, it's fine. For the first point, I considered it when designing the previous patch set but chose not to implement it. As for the CRC, you're right - it’s simply a leftover from the previous design. I have no issues with following both points. > I somehow dislike the file per subxid design. I'm sure it works, it's > just more of a feeling that it doesn't feel right. I'm somewhat > worried about ending up with lots of files, if you e.g. use temporary > tables with subtransactions heavily. Could we have just one file per I first thought the same thing when working on the previos patch. > top-level XID? I guess that can become a problem too, if you have a > lot of aborted subtransactions. The UNDO records for the aborted > subtransactions would bloat the undo file. But maybe that's > nevertheless better? In the current patch set, normal abort processing is handled by the UNDO log, so maintaining the performance of the UNDO process is essential. If we were to return this to pendingDeletes, it might also be feasible to add an XID cancellation record to the UNDO log and scan the entire file once before executing individual logs. I’ll give it some thought. At Mon, 28 Oct 2024 15:33:41 +0200, Heikki Linnakangas <hlinnaka@iki.fi> wrote in > On 31/08/2024 19:09, Kyotaro Horiguchi wrote: > > Subject: [PATCH v34 03/16] Remove function for retaining files on > > outer > > transaction aborts > > The function RelationPreserveStorage() was initially created to keep > > storage files committed in a subtransaction on the abort of outer > > transactions. It was introduced by commit b9b8831ad6 in 2010, but no > > use case for this behavior has emerged since then. If we move the > > at-commit removal feature of storage files from pendingDeletes to the > > UNDO log system, the UNDO system would need to accept the cancellation > > of already logged entries, which makes the system overly complex with > > no benefit. Therefore, remove the feature. > > I don't think that's quite right. I don't think this was meant for > subtransaction aborts, but to make sure that if the top-transaction > aborts after AtEOXact_RelationMap() has already been called, we don't > remove the new relation. AtEOXact_RelationMap() is called very late in Hmm. I believe I wrote that. It prevents storage removal once it’s committed in any subtransaction, even if that subtransaction is finally aborted, including by the top transaction. > the commit process to keep the window as small as possible, but if it > nevertheless happens, the consequences are pretty bad if you remove a > relation file that is in fact needed. However, on second thought, it does seem odd. I may have confused something here. If pendingDeletes is restored and undo cancellation is implemented, this change would be unnecessary. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
A bit out of the blue, but I remembered the reason why I could make that change I previously agreed seemed off. Just thought I’d let you know. At Tue, 05 Nov 2024 13:25:26 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in me> > the commit process to keep the window as small as possible, but if it me> > nevertheless happens, the consequences are pretty bad if you remove a me> > relation file that is in fact needed. me> me> However, on second thought, it does seem odd. I may have confused me> something here. If pendingDeletes is restored and undo cancellation is me> implemented, this change would be unnecessary. The change would indeed be incorrect if updates to mapped relations could occur within subtransactions. However, in reality, trying to perform such an operation raises an error (something like “cannot do this in a subtransaction”) and is rejected. So, there’s actually no path where the removed code would be used. That’s why I judged it was safe to remove that part. However, from that perspective, I think the explanations in the comments and commit messages were somewhat lacking or missed the point. Currently, I’m leaning toward implementing per-relation undo cancellation. Previously, this path was active even during normal aborts, so there were performance concerns, but now it only runs during recovery cleanup, so there are no performance issues with handling cancellation. In the current state, the code has been simplified overall. regards. -- Kyotaro Horiguchi NTT Open Source Software Center