Re: [HACKERS] WAL logging problem in 9.4.3? - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: [HACKERS] WAL logging problem in 9.4.3? |
Date | |
Msg-id | 20191226001521.GA1772687@rfd.leadboat.com Whole thread Raw |
In response to | Re: [HACKERS] WAL logging problem in 9.4.3? (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: [HACKERS] WAL logging problem in 9.4.3?
Re: [HACKERS] WAL logging problem in 9.4.3? Re: [HACKERS] WAL logging problem in 9.4.3? |
List | pgsql-hackers |
By improving AssertPendingSyncs_RelationCache() and by testing with -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm. Would you fix these? === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(), when running "make check" under wal_level=minimal. I test this way: printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf make check TEMP_CONFIG=$PWD/minimal.conf Self-contained demonstration: begin; create table t (c int); savepoint q; drop table t; rollback to q; -- forgets table is skipping wal commit; -- assertion failure === Defect 2: Forgets to skip WAL due to oversimplification in heap_create() In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need to transfer WAL-skipped state to the new index relation. Before v24nm, the new index relation skipped WAL unconditionally. Since v24nm, the new index relation never skips WAL. I've added a test to alter_table.sql that reveals this problem under wal_level=minimal. === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated. Is it possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a net size decrease? I haven't tested, but this sequence seems possible: TRUNCATE reduces MAIN_FORKNUM from 100 blocks to 0 blocks reduces FSM_FORKNUM from 3 blocks to 0 blocks COPY raises MAIN_FORKNUM from 0 blocks to 110 blocks does not change FSM_FORKNUM COMMIT should fsync, but wrongly chooses log_newpage_range() approach If that's indeed a problem, beside the obvious option of tracking every fork's max_truncated, we could convert max_truncated to a bool and use fsync anytime the relation experienced an mdtruncate(). (While FSM_FORKNUM is not critical for database operations, the choice to subject it to checksums entails protecting it here.) If that's not a problem, would you explain? === Non-defect notes Once you have a correct patch, would you run check-world with -DCLOBBER_CACHE_ALWAYS? That may reveal additional defects. It may take a day or more, but that's fine. The new smgrimmedsync() calls are potentially fragile, because they sometimes target a file of a dropped relation. However, the mdexists() test prevents anything bad from happening. No change is necessary. Example: SET wal_skip_threshold = 0; BEGIN; SAVEPOINT q; CREATE TABLE t (c) AS SELECT 1; ROLLBACK TO q; -- truncates the relfilenode CHECKPOINT; -- unlinks the relfilenode COMMIT; -- calls mdexists() on the relfilenode === Notable changes in v30nm - Changed "wal_skip_threshold * 1024" to an expression that can't overflow. Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0. - Changed AssertPendingSyncs_RelationCache() to open all relations on which the transaction holds locks. This permits detection of cases where RelationNeedsWAL() returns true but storage.c will sync the relation. Removed the assertions from RelationIdGetRelation(). Using "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't actually problematic, since invalidation updates rd_node while other code updates rd_firstRelfilenodeSubid. This is not a significant loss, now that AssertPendingSyncs_RelationCache() opens relations. (I considered making the update of rd_firstRelfilenodeSubid more like rd_node, where we store it somewhere until the next CommandCounterIncrement(), which would make it actually affect RelationNeedsWAL(). That might have been better in general, but it felt complex without clear benefits.) Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. Making that work no matter what does ereport(ERROR) would be tricky and low-value. - Extracted the RelationTruncate() changes into new function RelationPreTruncate(), so table access methods that can't use RelationTruncate() have another way to request that work. - Changed wal_skip_threshold default to 2MB. My second preference was for 4MB. In your data, 2MB and 4MB had similar performance at optimal wal_buffers, but 4MB performed worse at low wal_buffers. - Reverted most post-v24nm changes to swap_relation_files(). Under "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right if we're running CLUSTER for the second time in one transaction. I used relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an assertion failure if we hold no lock at that point. - Change toast_get_valid_index() to retain locks until end of transaction. When I adopted relation_open(r1, NoLock) in swap_relation_files(), that revealed that we retain no lock on the TOAST index. - Ran pgindent and perltidy. Updated some comments and names. On Mon, Dec 09, 2019 at 06:04:06PM +0900, Kyotaro Horiguchi wrote: > Anyway the default value ought to be defined based on the default > configuration. PostgreSQL does not follow that principle. Settings that change permanent resource consumption, such as wal_buffers, have small defaults. Settings that don't change permanent resource consumption can have defaults that favor a well-tuned system.
Attachment
pgsql-hackers by date: