Home > mailing lists

Re: [HACKERS] WAL logging problem in 9.4.3? - Mailing list pgsql-hackers

From	Noah Misch
Subject	Re: [HACKERS] WAL logging problem in 9.4.3?
Date	March 21, 2020 22:49:20
Msg-id	20200321224920.GB1763544@rfd.leadboat.com Whole thread Raw
In response to	Re: [HACKERS] WAL logging problem in 9.4.3? (Noah Misch <noah@leadboat.com>)
Responses	Re: [HACKERS] WAL logging problem in 9.4.3?
List	pgsql-hackers

Tree view

On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> Pushed, after adding a missing "break" to gist_identify() and tweaking two
> more comments.  However, a diverse minority of buildfarm members are failing
> like this, in most branches:
> 
> Mar 21 13:16:37 #   Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
> Mar 21 13:16:37 #   at t/018_wal_optimize.pl line 231.
> Mar 21 13:16:37 #          got: '1'
> Mar 21 13:16:37 #     expected: '2'
> Mar 21 13:16:46 # Looks like you failed 1 test of 34.
> Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................ 
>   -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05
> 
> Since I run two of the failing animals, I expect to reproduce this soon.

force_parallel_regress was the setting needed to reproduce this:

  printf '%s\n%s\n%s\n' 'log_statement = all' 'force_parallel_mode = regress' >/tmp/force_parallel.conf
  make -C src/test/recovery check PROVE_TESTS=t/018_wal_optimize.pl TEMP_CONFIG=/tmp/force_parallel.conf

The proximate cause is the RelFileNodeSkippingWAL() call that we added to
MarkBufferDirtyHint().  MarkBufferDirtyHint() runs in parallel workers, but
parallel workers have zeroes for pendingSyncHash and rd_*Subid.  I hacked up
the attached patch to understand the scope of the problem (not to commit).  It
logs a message whenever a parallel worker uses pendingSyncHash or
RelationNeedsWAL().  Some of the cases happen often enough to make logs huge,
so the patch suppresses logging for them.  You can see the lower-volume calls
like this:

  printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' 'force_parallel_mode =
regress'>/tmp/minimal_parallel.conf

  make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
  find . -name log | xargs grep -rl 'nm0 invalid'

Not all are actual bugs.  For example, get_relation_info() behaves fine:

    /* Temporary and unlogged relations are inaccessible during recovery. */
    if (!RelationNeedsWAL(relation) && RecoveryInProgress())

Kyotaro, can you look through the affected code and propose a strategy for
good coexistence of parallel query with the WAL skipping mechanism?

Since I don't expect one strategy to win clearly and quickly, I plan to revert
the main patch around 2020-03-22 17:30 UTC.  That will give the patch about
twenty-four hours in the buildfarm, so more animals can report in.  I will
leave the three smaller patches in place.

> fairywren failed differently on 9.5; I have not yet studied it:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10

This did not remain specific to 9.5.  On platforms where SIZEOF_SIZE_T==4 or
SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB.  A simple s/1TB/1GB/ in
the test should fix this.

Attachment

debug-parallel-skip-wal-v0.patch

pgsql-hackers by date:

From: Bruce Momjian
Date: 21 March 2020, 22:13:03
Subject: Re: Ecpg dependency

From: Tom Lane
Date: 21 March 2020, 23:22:41
Subject: Re: Refactor compile-time assertion checks for C/C++

Re: [HACKERS] WAL logging problem in 9.4.3? - Mailing list pgsql-hackers

Attachment

Previous

Next