On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> Pushed, after adding a missing "break" to gist_identify() and tweaking two
> more comments. However, a diverse minority of buildfarm members are failing
> like this, in most branches:
>
> Mar 21 13:16:37 # Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
> Mar 21 13:16:37 # at t/018_wal_optimize.pl line 231.
> Mar 21 13:16:37 # got: '1'
> Mar 21 13:16:37 # expected: '2'
> Mar 21 13:16:46 # Looks like you failed 1 test of 34.
> Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................
> -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05
>
> Since I run two of the failing animals, I expect to reproduce this soon.
force_parallel_regress was the setting needed to reproduce this:
printf '%s\n%s\n%s\n' 'log_statement = all' 'force_parallel_mode = regress' >/tmp/force_parallel.conf
make -C src/test/recovery check PROVE_TESTS=t/018_wal_optimize.pl TEMP_CONFIG=/tmp/force_parallel.conf
The proximate cause is the RelFileNodeSkippingWAL() call that we added to
MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, but
parallel workers have zeroes for pendingSyncHash and rd_*Subid. I hacked up
the attached patch to understand the scope of the problem (not to commit). It
logs a message whenever a parallel worker uses pendingSyncHash or
RelationNeedsWAL(). Some of the cases happen often enough to make logs huge,
so the patch suppresses logging for them. You can see the lower-volume calls
like this:
printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' 'force_parallel_mode =
regress'>/tmp/minimal_parallel.conf
make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
find . -name log | xargs grep -rl 'nm0 invalid'
Not all are actual bugs. For example, get_relation_info() behaves fine:
/* Temporary and unlogged relations are inaccessible during recovery. */
if (!RelationNeedsWAL(relation) && RecoveryInProgress())
Kyotaro, can you look through the affected code and propose a strategy for
good coexistence of parallel query with the WAL skipping mechanism?
Since I don't expect one strategy to win clearly and quickly, I plan to revert
the main patch around 2020-03-22 17:30 UTC. That will give the patch about
twenty-four hours in the buildfarm, so more animals can report in. I will
leave the three smaller patches in place.
> fairywren failed differently on 9.5; I have not yet studied it:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10
This did not remain specific to 9.5. On platforms where SIZEOF_SIZE_T==4 or
SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB. A simple s/1TB/1GB/ in
the test should fix this.