recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?) - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?) |
Date | |
Msg-id | 95ca84ca-39b4-f6aa-260f-da5f73d05a90@gmail.com Whole thread Raw |
Responses |
Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?)
Re: recoveryCheck/008_fsm_truncation is failing on dodo in v14- (due to slow fsync?) |
List | pgsql-hackers |
Hello hackers, This week dodo started failing on the 008_fsm_truncation test sporadically due to pg_ctl timeout. For example, [1] (REL_14_STABLE): ### Starting node "standby" # Running: pg_ctl -D /media/pi/250gb/proj/bf2/v17/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_008_fsm_truncation_standby_data/pgdata -l /media/pi/250gb/proj/bf2/v17/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/log/008_fsm_truncation_standby.log -o --cluster-name=standby start waiting for server to start........................................................................................................................... stopped waiting pg_ctl: server did not start in time # pg_ctl start failed; logfile: 2024-06-19 21:27:30.843 ACST [13244:1] LOG: starting PostgreSQL 14.12 on armv7l-unknown-linux-gnueabihf, compiled by gcc (GCC) 15.0.0 20240617 (experimental), 32-bit 2024-06-19 21:27:31.768 ACST [13244:2] LOG: listening on Unix socket "/media/pi/250gb/proj/bf2/v17/buildroot/tmp/vLgcHgvc7O/.s.PGSQL.50013" 2024-06-19 21:27:35.055 ACST [13246:1] LOG: database system was interrupted; last known up at 2024-06-19 21:26:55 ACST A successful run between two failures [2]: 2024-06-21 05:17:43.102 ACST [18033:1] LOG: database system was interrupted; last known up at 2024-06-21 05:17:31 ACST 2024-06-21 05:18:06.718 ACST [18033:2] LOG: entering standby mode (That is, that start took around 20 seconds.) We can also find longer "successful" duration at [3]: 008_fsm_truncation_standby.log: 2024-06-19 23:11:13.854 ACST [26202:1] LOG: database system was interrupted; last known up at 2024-06-19 23:11:02 ACST 2024-06-19 23:12:07.115 ACST [26202:2] LOG: entering standby mode (That start took almost a minute.) So it doesn't seem impossible for this operation to last for more than two minutes. The facts that SyncDataDirectory() is executed between these two messages logged, 008_fsm_truncation is the only test which turns fsync on, and we see no such failures in newer branches (because of a7f417107), make me suspect that dodo is slow on fsync. I managed to reproduce similar fsync degradation (and reached 40 seconds duration of this start operation) on a slow armv7 device with a SD card, which getting significantly slower after many test runs without executing fstrim periodically. So maybe fstrim could help dodo too... On the other hand, backporting a7f417107 could fix the issue too, but I'm afraid we'll still see other tests (027_stream_regress) failing like [4]. When similar failures were observed on Andres Freund's animals, Andres came to conclusion that they were caused by fsync too ([5]). [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-19%2010%3A15%3A08 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-20%2018%3A30%3A53 [3] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dodo&dt=2024-06-19%2012%3A30%3A51&stg=recovery-check [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-06-21%2018%3A31%3A11 [5] https://www.postgresql.org/message-id/20240321063953.oyfojyq3wbc77xxb%40awork3.anarazel.de Best regards, Alexander
pgsql-hackers by date: