Hi Melanie,
10.07.2023 21:35, Melanie Plageman wrote:
Hi,
Jeff pointed out that one of the pg_stat_io tests has failed a few times
over the past months (here on morepork [1] and more recently here on
francolin [2]).
Failing test diff for those who prefer not to scroll:
+++ /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out 2023-07-07 18:48:25.976313231 +0000
@@ -1415,7 +1415,7 @@ :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses; ?column? | ?column? ----------+----------
- t | t
+ t | f
My theory about the test failure is that, when there is enough demand
for shared buffers, the flapping test fails because it expects buffer
access strategy *reuses* and concurrent queries already flushed those
buffers before they could be reused. Attached is a patch which I think
will fix the test while keeping some code coverage. If we count
evictions and reuses together, those should have increased.
I managed to reproduce that failure with the attached patch applied
(on master) and with the following script (that effectively multiplies
probability of the failure by 360):
CPPFLAGS="-O0" ./configure -q --enable-debug --enable-cassert --enable-tap-tests && make -s -j`nproc` && make -s check -C src/test/recovery
mkdir -p src/test/recovery00/t
cp src/test/recovery/t/027_stream_regress.pl src/test/recovery00/t/
cp src/test/recovery/Makefile src/test/recovery00/
for ((i=1;i<=9;i++)); do cp -r src/test/recovery00/ src/test/recovery$i; done
for ((i=1;i<=10;i++)); do echo "iteration $i"; NO_TEMP_INSTALL=1 parallel --halt now,fail=1 -j9 --linebuffer --tag make -s check -C src/test/{} ::: recovery1 recovery2 recovery3 recovery4 recovery5 recovery6 recovery7 recovery8 recovery9 || break; done
Without your patch, I get:
iteration 2
...
recovery5 # Failed test 'regression tests pass'
recovery5 # at t/027_stream_regress.pl line 92.
recovery5 # got: '256'
recovery5 # expected: '0'
...
src/test/recovery5/tmp_check/log/regress_log_027_stream_regress contains:
--- .../src/test/regress/expected/stats.out 2023-07-11 20:05:10.536059706 +0300
+++ .../src/test/recovery5/tmp_check/results/stats.out 2023-07-11 20:30:46.790551305 +0300
@@ -1418,7 +1418,7 @@
:io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
?column? | ?column?
----------+----------
- t | t
+ t | f
(1 row)
With your patch applied, 10 iterations performed successfully for me.
So it looks like your theory and your fix are correct.
Best regards,
Alexander