Re: stats test intermittent failure - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: stats test intermittent failure
Date
Msg-id 8d526f29-c269-dc3d-38ac-10fa3a0a7fb3@gmail.com
Whole thread Raw
In response to stats test intermittent failure  (Melanie Plageman <melanieplageman@gmail.com>)
List pgsql-hackers
Hi Melanie,

10.07.2023 21:35, Melanie Plageman wrote:
Hi,

Jeff pointed out that one of the pg_stat_io tests has failed a few times
over the past months (here on morepork [1] and more recently here on
francolin [2]).

Failing test diff for those who prefer not to scroll:

+++ /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out   2023-07-07 18:48:25.976313231 +0000
@@ -1415,7 +1415,7 @@        :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;  ?column? | ?column? ----------+----------
- t        | t
+ t        | f

My theory about the test failure is that, when there is enough demand
for shared buffers, the flapping test fails because it expects buffer
access strategy *reuses* and concurrent queries already flushed those
buffers before they could be reused. Attached is a patch which I think
will fix the test while keeping some code coverage. If we count
evictions and reuses together, those should have increased.

I managed to reproduce that failure with the attached patch applied
(on master) and with the following script (that effectively multiplies
probability of the failure by 360):
CPPFLAGS="-O0" ./configure -q --enable-debug --enable-cassert --enable-tap-tests  && make  -s -j`nproc` && make -s check -C src/test/recovery
mkdir -p src/test/recovery00/t
cp src/test/recovery/t/027_stream_regress.pl src/test/recovery00/t/
cp src/test/recovery/Makefile src/test/recovery00/
for ((i=1;i<=9;i++)); do cp -r src/test/recovery00/ src/test/recovery$i; done

for ((i=1;i<=10;i++)); do echo "iteration $i"; NO_TEMP_INSTALL=1 parallel --halt now,fail=1 -j9 --linebuffer --tag make -s check -C src/test/{} ::: recovery1 recovery2 recovery3 recovery4 recovery5 recovery6 recovery7 recovery8 recovery9 || break; done

Without your patch, I get:
iteration 2
...
recovery5       #   Failed test 'regression tests pass'
recovery5       #   at t/027_stream_regress.pl line 92.
recovery5       #          got: '256'
recovery5       #     expected: '0'
...
src/test/recovery5/tmp_check/log/regress_log_027_stream_regress contains:
--- .../src/test/regress/expected/stats.out  2023-07-11 20:05:10.536059706 +0300
+++ .../src/test/recovery5/tmp_check/results/stats.out   2023-07-11 20:30:46.790551305 +0300
@@ -1418,7 +1418,7 @@
        :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
  ?column? | ?column?
 ----------+----------
- t        | t
+ t        | f
 (1 row)

With your patch applied, 10 iterations performed successfully for me.
So it looks like your theory and your fix are correct.

Best regards,
Alexander
Attachment

pgsql-hackers by date:

Previous
From: Jacob Champion
Date:
Subject: Re: [PoC] Federated Authn/z with OAUTHBEARER
Next
From: Alena Rybakina
Date:
Subject: Re: POC, WIP: OR-clause support for indexes