Re: [HACKERS] WAL logging problem in 9.4.3? - Mailing list pgsql-hackers

From Noah Misch
Subject Re: [HACKERS] WAL logging problem in 9.4.3?
Date
Msg-id 20191226001521.GA1772687@rfd.leadboat.com
Whole thread Raw
In response to Re: [HACKERS] WAL logging problem in 9.4.3?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: [HACKERS] WAL logging problem in 9.4.3?
Re: [HACKERS] WAL logging problem in 9.4.3?
Re: [HACKERS] WAL logging problem in 9.4.3?
List pgsql-hackers
By improving AssertPendingSyncs_RelationCache() and by testing with
-DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
Would you fix these?


=== Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO

A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
when running "make check" under wal_level=minimal.  I test this way:

printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
make check TEMP_CONFIG=$PWD/minimal.conf

Self-contained demonstration:
  begin;
  create table t (c int);
  savepoint q; drop table t; rollback to q;  -- forgets table is skipping wal
  commit;  -- assertion failure


=== Defect 2: Forgets to skip WAL due to oversimplification in heap_create()

In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
to transfer WAL-skipped state to the new index relation.  Before v24nm, the
new index relation skipped WAL unconditionally.  Since v24nm, the new index
relation never skips WAL.  I've added a test to alter_table.sql that reveals
this problem under wal_level=minimal.


=== Defect 3: storage.c checks size decrease of MAIN_FORKNUM only

storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated.  Is it
possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
net size decrease?  I haven't tested, but this sequence seems possible:

  TRUNCATE
    reduces MAIN_FORKNUM from 100 blocks to 0 blocks
    reduces FSM_FORKNUM from 3 blocks to 0 blocks
  COPY
    raises MAIN_FORKNUM from 0 blocks to 110 blocks
    does not change FSM_FORKNUM
  COMMIT
    should fsync, but wrongly chooses log_newpage_range() approach

If that's indeed a problem, beside the obvious option of tracking every fork's
max_truncated, we could convert max_truncated to a bool and use fsync anytime
the relation experienced an mdtruncate().  (While FSM_FORKNUM is not critical
for database operations, the choice to subject it to checksums entails
protecting it here.)  If that's not a problem, would you explain?


=== Non-defect notes

Once you have a correct patch, would you run check-world with
-DCLOBBER_CACHE_ALWAYS?  That may reveal additional defects.  It may take a
day or more, but that's fine.

The new smgrimmedsync() calls are potentially fragile, because they sometimes
target a file of a dropped relation.  However, the mdexists() test prevents
anything bad from happening.  No change is necessary.  Example:

  SET wal_skip_threshold = 0;
  BEGIN;
  SAVEPOINT q;
  CREATE TABLE t (c) AS SELECT 1;
  ROLLBACK TO q;  -- truncates the relfilenode
  CHECKPOINT;  -- unlinks the relfilenode
  COMMIT;  -- calls mdexists() on the relfilenode


=== Notable changes in v30nm

- Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
  Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

- Changed AssertPendingSyncs_RelationCache() to open all relations on which
  the transaction holds locks.  This permits detection of cases where
  RelationNeedsWAL() returns true but storage.c will sync the relation.

  Removed the assertions from RelationIdGetRelation().  Using
  "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
  actually problematic, since invalidation updates rd_node while other code
  updates rd_firstRelfilenodeSubid.  This is not a significant loss, now that
  AssertPendingSyncs_RelationCache() opens relations.  (I considered making
  the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
  somewhere until the next CommandCounterIncrement(), which would make it
  actually affect RelationNeedsWAL().  That might have been better in general,
  but it felt complex without clear benefits.)

  Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did.  Making
  that work no matter what does ereport(ERROR) would be tricky and low-value.

- Extracted the RelationTruncate() changes into new function
  RelationPreTruncate(), so table access methods that can't use
  RelationTruncate() have another way to request that work.

- Changed wal_skip_threshold default to 2MB.  My second preference was for
  4MB.  In your data, 2MB and 4MB had similar performance at optimal
  wal_buffers, but 4MB performed worse at low wal_buffers.

- Reverted most post-v24nm changes to swap_relation_files().  Under
  "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
  rel1->rd_node.relNode update.  Clearing rel2->rd_createSubid is not right if
  we're running CLUSTER for the second time in one transaction.  I used
  relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an
  assertion failure if we hold no lock at that point.

- Change toast_get_valid_index() to retain locks until end of transaction.
  When I adopted relation_open(r1, NoLock) in swap_relation_files(), that
  revealed that we retain no lock on the TOAST index.

- Ran pgindent and perltidy.  Updated some comments and names.

On Mon, Dec 09, 2019 at 06:04:06PM +0900, Kyotaro Horiguchi wrote:
> Anyway the default value ought to be defined based on the default
> configuration.

PostgreSQL does not follow that principle.  Settings that change permanent
resource consumption, such as wal_buffers, have small defaults.  Settings that
don't change permanent resource consumption can have defaults that favor a
well-tuned system.

Attachment

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: proposal: schema variables
Next
From: Tatsuo Ishii
Date:
Subject: Re: Implementing Incremental View Maintenance