Thread: Recent pg_rewind test failures in buildfarm
In the last day or so, both skink and mamba have hit this in the pg_rewind test suite [1][2]: #3 0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0", fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66 #4 0x01d7994c in pgstat_report_stat (force=force@entry=true) at pgstat.c:734 #5 0x01d7a248 in pgstat_shutdown_hook (code=<optimized out>, arg=<optimized out>) at pgstat.c:615 #6 0x01d1e7c0 in shmem_exit (code=code@entry=0) at ipc.c:243 #7 0x01d1e92c in proc_exit_prepare (code=code@entry=0) at ipc.c:198 #8 0x01d1e9fc in proc_exit (code=0) at ipc.c:111 #9 0x01cd5bd0 in ProcessRepliesIfAny () at walsender.c:2294 #10 0x01cd8084 in WalSndLoop (send_data=send_data@entry=0x1cd7628 <XLogSendPhysical>) at walsender.c:2790 #11 0x01cd8f40 in StartReplication (cmd=0xfd048210) at walsender.c:960 #12 exec_replication_command (cmd_string=cmd_string@entry=0xfd53c098 "START_REPLICATION 0/3000000 TIMELINE 2") at walsender.c:2135 #13 0x01d5bd84 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4767 That assert appears to be several years old, and the 008_min_recovery_point.pl test script that's triggering it hasn't changed very recently either, so I'm baffled where to start digging. It has the odor of a timing problem, so maybe we just started hitting this by chance. Still ... anybody have an idea? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-13%2018%3A55%3A03 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-04-15%2001%3A00%3A04
Hello Tom, 15.04.2025 05:58, Tom Lane wrote: > In the last day or so, both skink and mamba have hit this > in the pg_rewind test suite [1][2]: > > #3 0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0", fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66 > ... > > That assert appears to be several years old, and the > 008_min_recovery_point.pl test script that's triggering it hasn't > changed very recently either, so I'm baffled where to start digging. > It has the odor of a timing problem, so maybe we just started hitting > this by chance. Still ... anybody have an idea? > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-13%2018%3A55%3A03 > [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-04-15%2001%3A00%3A04 > FWIW, there was also another test (028_row_filter.pl) failed due to that assert triggered: [3]. I've managed to reproduce this failure locally when running 10 instances of 008_min_recovery_point.pl in parallel under Valgrind. Will try to investigate the issue during this week. [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-11%2007%3A41%3A36 Best regards, Alexander Lakhin Neon (https://neon.tech)
Hi, On Tue, Apr 15, 2025 at 08:00:00AM +0300, Alexander Lakhin wrote: > Hello Tom, > > 15.04.2025 05:58, Tom Lane wrote: > > In the last day or so, both skink and mamba have hit this > > in the pg_rewind test suite [1][2]: > > > > #3 0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0", fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66 > > ... > > > > That assert appears to be several years old, and the > > 008_min_recovery_point.pl test script that's triggering it hasn't > > changed very recently either, so I'm baffled where to start digging. > > It has the odor of a timing problem, so maybe we just started hitting > > this by chance. Still ... anybody have an idea? Sorry, can't look at the details right now but it might be linked to 039549d70f6 which is recent enough and in this area. Will give it a look once I've time. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Apr 14, 2025 at 10:58:28PM -0400, Tom Lane wrote: > That assert appears to be several years old, and the > 008_min_recovery_point.pl test script that's triggering it hasn't > changed very recently either, so I'm baffled where to start digging. > It has the odor of a timing problem, so maybe we just started hitting > this by chance. Still ... anybody have an idea? Likely 039549d70f6a. The proc_exit(0) calls done at this stage are reporting that we have non-flushed stats, but we should have some remaining around. This sequence is part of the pgstat_shutdown_hook callback, which has been initialized by pgstat_initialize() done for the WAL senders in BaseInit(). It looks like we are going to need to be more aggressive with the stats flushes, so as we have nothing left when reaching this stage of the shutdown in ProcessRepliesIfAny(). (Sorry for the fuzzy reply, I cannot dive into all the details now, that's just a quick take on the matter so I may be wrong with my assumptions.) -- Michael
Attachment
Hi, On 2025-04-14 22:58:28 -0400, Tom Lane wrote: > In the last day or so, both skink and mamba have hit this > in the pg_rewind test suite [1][2]: > > #3 0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0", fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66 > #4 0x01d7994c in pgstat_report_stat (force=force@entry=true) at pgstat.c:734 > #5 0x01d7a248 in pgstat_shutdown_hook (code=<optimized out>, arg=<optimized out>) at pgstat.c:615 > #6 0x01d1e7c0 in shmem_exit (code=code@entry=0) at ipc.c:243 > #7 0x01d1e92c in proc_exit_prepare (code=code@entry=0) at ipc.c:198 > #8 0x01d1e9fc in proc_exit (code=0) at ipc.c:111 > #9 0x01cd5bd0 in ProcessRepliesIfAny () at walsender.c:2294 > #10 0x01cd8084 in WalSndLoop (send_data=send_data@entry=0x1cd7628 <XLogSendPhysical>) at walsender.c:2790 > #11 0x01cd8f40 in StartReplication (cmd=0xfd048210) at walsender.c:960 > #12 exec_replication_command (cmd_string=cmd_string@entry=0xfd53c098 "START_REPLICATION 0/3000000 TIMELINE 2") at walsender.c:2135 > #13 0x01d5bd84 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4767 > > That assert appears to be several years old, and the > 008_min_recovery_point.pl test script that's triggering it hasn't > changed very recently either, so I'm baffled where to start digging. > It has the odor of a timing problem, so maybe we just started hitting > this by chance. Still ... anybody have an idea? See also https://postgr.es/m/dwrkeszz6czvtkxzr5mqlciy652zau5qqnm3cp5f3p2po74ppk%40omg4g3cc6dgq I am fairly certain it's the fault of 039549d70f6 Greetings, Andres Freund
Hi, On 2025-04-15 06:04:54 -0400, Andres Freund wrote: > See also https://postgr.es/m/dwrkeszz6czvtkxzr5mqlciy652zau5qqnm3cp5f3p2po74ppk%40omg4g3cc6dgq Huh, turns out my emails of the last few days had gotten stuck. I only noticed when trying to open the above link and it was dead. I had been wondering why neither Bertrand nor Michael had reacted to the mail... Greetings, Andres Freund