Thread: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)

Hello hackers,

This week dodo started failing on the 008_fsm_truncation test sporadically
due to pg_ctl timeout. For example, [1] (REL_14_STABLE):
### Starting node "standby"
# Running: pg_ctl -D 

/media/pi/250gb/proj/bf2/v17/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_008_fsm_truncation_standby_data/pgdata

-l 

/media/pi/250gb/proj/bf2/v17/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/log/008_fsm_truncation_standby.log

-o --cluster-name=standby start
waiting for server to 

start...........................................................................................................................

stopped waiting
pg_ctl: server did not start in time
# pg_ctl start failed; logfile:
2024-06-19 21:27:30.843 ACST [13244:1] LOG:  starting PostgreSQL 14.12 on armv7l-unknown-linux-gnueabihf, compiled by 
gcc (GCC) 15.0.0 20240617 (experimental), 32-bit
2024-06-19 21:27:31.768 ACST [13244:2] LOG:  listening on Unix socket 
"/media/pi/250gb/proj/bf2/v17/buildroot/tmp/vLgcHgvc7O/.s.PGSQL.50013"
2024-06-19 21:27:35.055 ACST [13246:1] LOG:  database system was interrupted; last known up at 2024-06-19 21:26:55
ACST

A successful run between two failures [2]:
2024-06-21 05:17:43.102 ACST [18033:1] LOG:  database system was interrupted; last known up at 2024-06-21 05:17:31
ACST
2024-06-21 05:18:06.718 ACST [18033:2] LOG:  entering standby mode
(That is, that start took around 20 seconds.)

We can also find longer "successful" duration at [3]:
008_fsm_truncation_standby.log:
2024-06-19 23:11:13.854 ACST [26202:1] LOG:  database system was interrupted; last known up at 2024-06-19 23:11:02
ACST
2024-06-19 23:12:07.115 ACST [26202:2] LOG:  entering standby mode
(That start took almost a minute.)

So it doesn't seem impossible for this operation to last for more than two
minutes.

The facts that SyncDataDirectory() is executed between these two messages
logged, 008_fsm_truncation is the only test which turns fsync on, and we
see no such failures in newer branches (because of a7f417107), make me
suspect that dodo is slow on fsync.

I managed to reproduce similar fsync degradation (and reached 40 seconds
duration of this start operation) on a slow armv7 device with a SD card,
which getting significantly slower after many test runs without executing
fstrim periodically.

So maybe fstrim could help dodo too...

On the other hand, backporting a7f417107 could fix the issue too, but I'm
afraid we'll still see other tests (027_stream_regress) failing like [4].
When similar failures were observed on Andres Freund's animals, Andres
came to conclusion that they were caused by fsync too ([5]).

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-19%2010%3A15%3A08
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-20%2018%3A30%3A53
[3] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dodo&dt=2024-06-19%2012%3A30%3A51&stg=recovery-check
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-21%2018%3A31%3A11
[5] https://www.postgresql.org/message-id/20240321063953.oyfojyq3wbc77xxb%40awork3.anarazel.de

Best regards,
Alexander




On Sat, 22 Jun 2024 at 18:30, Alexander Lakhin <exclusion@gmail.com> wrote:
So it doesn't seem impossible for this operation to last for more than two
minutes.

The facts that SyncDataDirectory() is executed between these two messages
logged, 008_fsm_truncation is the only test which turns fsync on, and we
see no such failures in newer branches (because of a7f417107), make me
suspect that dodo is slow on fsync.


Not sure if it helps but I can confirm that dodo is used for multiple tasks and that
it is using a (slow) external USB3 disk. Also, while using dodo last week (for
something unrelated), I noticed iotop at ~30MB/s usage & 1-min CPU around ~7.

Right now (while dodo's idle), via dd I see ~30MB/s is pretty much the max:

pi@pi4:/media/pi/250gb $ dd if=/dev/zero of=./test count=1024 oflag=direct bs=128k
1024+0 records in
1024+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 4.51225 s, 29.7 MB/s

pi@pi4:/media/pi/250gb $ dd if=/dev/zero of=./test count=1024 oflag=dsync bs=128k
1024+0 records in
1024+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 24.4916 s, 5.5 MB/s

-
robins
22.06.2024 12:00, Alexander Lakhin wrote:
> On the other hand, backporting a7f417107 could fix the issue too, but I'm
> afraid we'll still see other tests (027_stream_regress) failing like [4].
> When similar failures were observed on Andres Freund's animals, Andres
> came to conclusion that they were caused by fsync too ([5]).
>

It seems to me that another dodo failure [1] has the same cause:
t/001_emergency_vacuum.pl .. ok
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 2.
t/002_limits.pl ............
Dubious, test returned 29 (wstat 7424, 0x1d00)
All 2 subtests passed
t/003_wraparounds.pl ....... ok

Test Summary Report
-------------------
t/002_limits.pl          (Wstat: 7424 Tests: 2 Failed: 0)
   Non-zero exit status: 29
   Parse errors: No plan found in TAP output
Files=3, Tests=10, 4235 wallclock secs ( 0.10 usr  0.13 sys + 18.05 cusr 12.76 csys = 31.04 CPU)
Result: FAIL

Unfortunately, the buildfarm log doesn't contain regress_log_002_limits,
but I managed to reproduce the failure on that my device, when it's storage
as slow as:
$ dd if=/dev/zero of=./test count=1024 oflag=dsync bs=128k
1024+0 records in
1024+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 33.9446 s, 4.0 MB/s

The test fails as below:
[15:36:04.253](729.754s) ok 1 - warn-limit reached
#### Begin standard error
psql:<stdin>:1: WARNING:  database "postgres" must be vacuumed within 37483631 transactions
HINT:  To avoid XID assignment failures, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
#### End standard error
[15:36:45.220](40.968s) ok 2 - stop-limit
[15:36:45.222](0.002s) # issuing query via background psql: COMMIT
IPC::Run: timeout on timer #1 at /usr/share/perl5/IPC/Run.pm line 2944.

It looks like this bump (coined at [2]) is not enough for machines that are
that slow:
# Bump the query timeout to avoid false negatives on slow test systems.
my $psql_timeout_secs = 4 * $PostgreSQL::Test::Utils::timeout_default;

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-20%2007%3A18%3A46
[2] https://www.postgresql.org/message-id/CAD21AoBKBVkXyEwkApSUqN98CuOWw%3DYQdbkeE6gGJ0zH7z-TBw%40mail.gmail.com

Best regards,
Alexander



On Sun, 23 Jun 2024 at 22:30, Alexander Lakhin <exclusion@gmail.com> wrote:
 
Unfortunately, the buildfarm log doesn't contain regress_log_002_limits,
but I managed to reproduce the failure on that my device, when it's storage
as slow as:
$ dd if=/dev/zero of=./test count=1024 oflag=dsync bs=128k
1024+0 records in
1024+0 records out
134217728 bytes (134 MB, 128 MiB) copied, 33.9446 s, 4.0 MB/s


The past ~1 week, I tried to space out all other tasks on the machine, so as to ensure
that 1-min CPU is mostly <2 (and thus not many things hammering the disk) and with
that I see 0 failures these past few days. This isn't conclusive by any means, but it
does seem that reducing IO contention has helped remove the errors, like what 
Alexander suspects / repros here.

Just a note, that I've reverted some of those recent changes now, and so if the theory
holds true, I wouldn't be surprised if some of these errors restarted on dodo.

-
robins