Thread: stats.sql fails during installcheck on mac
Hi, When running "make installcheck" on my mac, I ran into a failure: """ @@ -1459,7 +1459,7 @@ OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs; ?column? ---------- - t + f (1 row) -- Change the tablespace so that the table is rewritten directly, then SELECT """ which occurs when regress/sql/stats.sql is checking for # of fsync calls for wal writesto disk. This test was added in a051e71e28a1 """ SELECT current_setting('fsync') = 'off' OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs; """ However, because on mac, the default wal_sync_method is open_datasync, there is no fsync() or similar call being issued when wal is synced to disk. I think we should modify the test to only check for the failing stat only if wal_sync_method = fdatasync|fsync|fsync_writethrough and the io object is 'wal'. See attached patch with a simple fix to this test. "make check" does not fail because it runs with fsync = off [1] Also, The documentation for pg_stat_wal already makes this point clear in [0] that "wal_sync is only incremented when the wal_sync_method is either fdatasync, fsync or fsync_writethrough". Perhaps, the same clarification will be beneficial for the pg_stat_io.fsyncs* fields? [0] https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-WAL-VIEW [1] https://github.com/postgres/postgres/blob/master/src/test/regress/pg_regress.c#L2485 -- Sami Imseih Amazon Web Services (AWS)
Attachment
Sami Imseih <samimseih@gmail.com> writes: > When running "make installcheck" on my mac, I ran into a failure: > @@ -1459,7 +1459,7 @@ > OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs; > ?column? > ---------- > - t > + f > (1 row) Hmm, that's a little nasty, because it's not showing up in the buildfarm. It appears from a little testing that the issue only manifests if you have fsync = on, which we generally don't on buildfarm animals. That concerns me independently of this specific failure, because it calls into question how realistic our testing of things like fsync statistics really is. Anyway, back to the patch: there are multiple places in stats.sql that are effectively disabling tests if fsync = off, not only this one. Why does only this one need the exception? Should we be rethinking or tightening some of the others? regards, tom lane
> Hmm, that's a little nasty, because it's not showing up in the > buildfarm. It appears from a little testing that the issue only > manifests if you have fsync = on, which we generally don't on > buildfarm animals. right, "make check" does not encounter this because it runs with fsync=off, as I mentioned at the top of the thread. > That concerns me independently of this specific failure, because > it calls into question how realistic our testing of things like > fsync statistics really is. I agree, the 3 tests in stats.sql that do this SELECT current_setting('fsync') = 'off' OR some stats test are skipping over the fsync counters test altogether. Perhaps we should enable fsync for these specific parts of the test? it's SIGHUP > Anyway, back to the patch: there are multiple places in > stats.sql that are effectively disabling tests if fsync = off, > not only this one. Why does only this one need the exception? > Should we be rethinking or tightening some of the others? IIUC, this is only an issue for wal syncing """ postgres=# select n into t from generate_series(1, 10000) as n; SELECT 10000 postgres=# checkpoint; checkpoint; CHECKPOINT CHECKPOINT postgres=# SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs, object FROM pg_stat_io WHERE context = 'normal' group by object; writes | fsyncs | object --------+--------+--------------- 7 | 0 | wal 0 | | temp relation 79 | 32 | relation (3 rows) postgres=# show wal_sync_method ; wal_sync_method ----------------- open_datasync (1 row) """ so only the below test with object = 'wal' needs to be tightened SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs FROM pg_stat_io WHERE context = 'normal' AND object = 'wal' \gset io_sum_wal_normal_after_ -- Sami Imseih Amazon Web Services (AWS)
On Thu, Apr 10, 2025 at 06:45:36PM -0500, Sami Imseih wrote: > IIUC, this is only an issue for wal syncing Yes, good catch. I have missed this effect of issue_xlog_fsync(), which has two callers. The first one in XLogWrite() never happens if wal_sync_method is open_sync or open_datasync. The second call just relies on the sync to be discarded internally by issue_xlog_fsync(). > so only the below test with object = 'wal' needs to be tightened > > SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs > FROM pg_stat_io > WHERE context = 'normal' AND object = 'wal' \gset io_sum_wal_normal_after_ My mistake here. Sorry about that. SELECT current_setting('fsync') = 'off' + OR current_setting('wal_sync_method') NOT IN ('fdatasync', 'fsync', 'fsync_writethrough') The code in xlog.c filters out the syncs for WAL_SYNC_METHOD_OPEN and WAL_SYNC_METHOD_OPEN_DSYNC, wouldn't it be more consistent to do the same in the code and the SQL test, using an IN clause with the two values that block the syncs rather than a NOT IN clause with the three values that allow the syncs? This translates to the attached, which is the same as your patch, but this is more consistent with the code. -- Michael
Attachment
On Thu, Apr 10, 2025 at 04:39:45PM -0500, Sami Imseih wrote: > Also, The documentation for pg_stat_wal already makes this point clear > in [0] that "wal_sync is only > incremented when the wal_sync_method is either fdatasync, fsync or > fsync_writethrough". > > Perhaps, the same clarification will be beneficial for the > pg_stat_io.fsyncs* fields? wal_sync and wal_sync_time (and the write parts) have been moved from pg_stat_wal to pg_stat_io under the 'wal' object, and these details with wal_sync_method are documented here: https://www.postgresql.org/docs/devel/wal-configuration.html The docs of pg_stat_io include a paragraph that link to the WAL configuration: https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-IO-VIEW `For the object wal, fsyncs and fsync_time track the fsync.. blah.` The current format is intentional to reduce the amount of duplication in the docs regarding this level of details to one place: the section that describes in details the internals of the WAL I/O activity. -- Michael
Attachment
> The code in xlog.c filters out the syncs for WAL_SYNC_METHOD_OPEN and > WAL_SYNC_METHOD_OPEN_DSYNC, wouldn't it be more consistent to do the > same in the code and the SQL test, using an IN clause with the two > values that block the syncs rather than a NOT IN clause with the three > values that allow the syncs? I actually originally had it this way, but for some reason felt it would be better to be explicit about the methods we want to test rather than not test. I can't think of a very compelling reason to go either way, so v2 LGTM. >> Hmm, that's a little nasty, because it's not showing up in the >> buildfarm. It appears from a little testing that the issue only >> manifests if you have fsync = on, which we generally don't on > >buildfarm animals. > right, "make check" does not encounter this because it runs > with fsync=off, as I mentioned at the top of the thread. what do you think of this? I think we should set fsync = on at least for the part of the test that proceeds the 2 checkpoints and set if back to off at the end of the tests for fsync stats. It is concerning the tests for the fsync stats are not being exercised in the buildfarm. -- Sami Imseih Amazon Web Services (AWS)
On Fri, Apr 11, 2025 at 10:44:59AM -0500, Sami Imseih wrote: > I actually originally had it this way, but for some reason > felt it would be better to be explicit about the methods we want to test rather > than not test. I can't think of a very compelling reason to go either way, so v2 > LGTM. I will proceed with v2 then, thanks. > what do you think of this? I think we should set fsync = on > at least for the part of the test that proceeds the 2 checkpoints and > set if back to off at the end of the tests for fsync stats. It is concerning > the tests for the fsync stats are not being exercised in > the buildfarm. One thing I fear here is the impact for animals with little capacity, like PIs and the like. On the other hand, I could just switch one of my animals to use fsync = on on at least one branch. -- Michael
Attachment
> > what do you think of this? I think we should set fsync = on > > at least for the part of the test that proceeds the 2 checkpoints and > > set if back to off at the end of the tests for fsync stats. It is concerning > > the tests for the fsync stats are not being exercised in > > the buildfarm. > > One thing I fear here is the impact for animals with little capacity, > like PIs and the like. On the other hand, I could just switch one of > my animals to use fsync = on on at least one branch. Yes, there should be some tests running for these stats, so if it's possible to enable fsync on one or a few animals, that will be better than nothing. -- Sami
On Fri, Apr 11, 2025 at 07:37:49PM -0500, Sami Imseih wrote: > Yes, there should be some tests running for these stats, > so if it's possible to enable fsync on one or a few animals, that > will be better than nothing. I have just done that on batta that only tests HEAD, that's a start. -- Michael