Thread: The stats.sql test is failing sporadically in v14- on POWER7/AIX 7.1 buildfarm animals

Hello hackers,

Yesterday, the buildfarm animal sungazer was benevolent enough to
demonstrate a rare anomaly, related to old stats collector:
test stats                        ... FAILED   469155 ms

========================
  1 of 212 tests failed.
========================

--- /home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/expected/stats.out 2022-03-30 01:18:17.000000000
+0000
+++ /home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/results/stats.out 2024-07-30 09:49:39.000000000
+0000
@@ -165,11 +165,11 @@
    WHERE relname like 'trunc_stats_test%' order by relname;
         relname      | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup
-------------------+-----------+-----------+-----------+------------+------------
-  trunc_stats_test  |         3 |         0 |         0 | 0 |          0
-  trunc_stats_test1 |         4 |         2 |         1 | 1 |          0
-  trunc_stats_test2 |         1 |         0 |         0 | 1 |          0
-  trunc_stats_test3 |         4 |         0 |         0 | 2 |          2
-  trunc_stats_test4 |         2 |         0 |         0 | 0 |          2
+  trunc_stats_test  |         0 |         0 |         0 | 0 |          0
+  trunc_stats_test1 |         0 |         0 |         0 | 0 |          0
+  trunc_stats_test2 |         0 |         0 |         0 | 0 |          0
+  trunc_stats_test3 |         0 |         0 |         0 | 0 |          0
+  trunc_stats_test4 |         0 |         0 |         0 | 0 |          0
...

inst/logfile contains:
2024-07-30 09:25:11.225 UTC [63307946:1] LOG:  using stale statistics instead of current ones because stats collector
is
 
not responding
2024-07-30 09:25:11.345 UTC [11206724:559] pg_regress/create_index LOG:  using stale statistics instead of current ones

because stats collector is not responding
...

That's not the only failure of that kind occurred on sungazer, there were
also [2] (REL_13_STABLE), [3] (REL_13_STABLE), [4] (REL_12_STABLE).
Moreover, such failures were produced by all the other POWER7/AIX 7.1
animals: hornet ([5], [6]), tern ([7], [8]), mandrill ([9], [10], ...).
But I could not find such failures coming from POWER8 animals: hoverfly
(running AIX 7200-04-03-2038), ayu, boa, chub, and I did not encounter such
anomalies on x86 nor ARM platforms.

Thus, it looks like this stats collector issue is only happening on this
concrete platform, and given [11], I think such failures perhaps should
be just ignored for the next two years (until v14 EOL) unless AIX 7.1
will be upgraded and we see them on a vendor-supported OS version.

So I'm parking this information here just for reference.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2024-07-30%2003%3A49%3A35
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-02-09%2009%3A29%3A10
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2022-06-16%2009%3A52%3A47
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-12-13%2003%3A40%3A42
[5] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-03-29%2005%3A27%3A09
[6] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2024-03-19%2002%3A09%3A07
[7] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2022-12-16%2009%3A17%3A38
[8] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2021-04-01%2003%3A09%3A38
[9] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-04-05%2004%3A22%3A17
[10] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2021-07-12%2004%3A31%3A37
[11] https://www.postgresql.org/message-id/3154146.1697661946%40sss.pgh.pa.us

Best regards,
Alexander



On Wed, Jul 31, 2024 at 02:00:00PM +0300, Alexander Lakhin wrote:
> --- /home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/expected/stats.out 2022-03-30 01:18:17.000000000
+0000
> +++ /home/nm/farm/gcc64/REL_14_STABLE/pgsql.build/src/test/regress/results/stats.out 2024-07-30 09:49:39.000000000
+0000
> @@ -165,11 +165,11 @@
>    WHERE relname like 'trunc_stats_test%' order by relname;
>         relname      | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup
> -------------------+-----------+-----------+-----------+------------+------------
> -  trunc_stats_test  |         3 |         0 |         0 | 0 |          0
> -  trunc_stats_test1 |         4 |         2 |         1 | 1 |          0
> -  trunc_stats_test2 |         1 |         0 |         0 | 1 |          0
> -  trunc_stats_test3 |         4 |         0 |         0 | 2 |          2
> -  trunc_stats_test4 |         2 |         0 |         0 | 0 |          2
> +  trunc_stats_test  |         0 |         0 |         0 | 0 |          0
> +  trunc_stats_test1 |         0 |         0 |         0 | 0 |          0
> +  trunc_stats_test2 |         0 |         0 |         0 | 0 |          0
> +  trunc_stats_test3 |         0 |         0 |         0 | 0 |          0
> +  trunc_stats_test4 |         0 |         0 |         0 | 0 |          0
> ...
> 
> inst/logfile contains:
> 2024-07-30 09:25:11.225 UTC [63307946:1] LOG:  using stale statistics
> instead of current ones because stats collector is not responding
> 2024-07-30 09:25:11.345 UTC [11206724:559] pg_regress/create_index LOG: 
> using stale statistics instead of current ones because stats collector is
> not responding
> ...

> I could not find such failures coming from POWER8 animals: hoverfly
> (running AIX 7200-04-03-2038), ayu, boa, chub, and I did not encounter such
> anomalies on x86 nor ARM platforms.

The animals you list as affected share a filesystem.  The failure arises from
the slow filesystem metadata operations of that filesystem.

> Thus, it looks like this stats collector issue is only happening on this
> concrete platform, and given [11], I think such failures perhaps should
> be just ignored for the next two years (until v14 EOL) unless AIX 7.1
> will be upgraded and we see them on a vendor-supported OS version.

This has happened on non-POWER, I/O-constrained machines.  Still, I have been
ignoring these failures.  The stats subsystem was designed to drop stats
updates at times, which was always at odds with the need for stable tests.  So
the failures witness a defect of the test, not a defect of the backend.
Stabilizing this test was a known benefit of the new stats implementation.