Thread: Recent pg_rewind test failures in buildfarm

Recent pg_rewind test failures in buildfarm

From
Tom Lane
Date:
In the last day or so, both skink and mamba have hit this
in the pg_rewind test suite [1][2]:

#3  0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0",
fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66 
#4  0x01d7994c in pgstat_report_stat (force=force@entry=true) at pgstat.c:734
#5  0x01d7a248 in pgstat_shutdown_hook (code=<optimized out>, arg=<optimized out>) at pgstat.c:615
#6  0x01d1e7c0 in shmem_exit (code=code@entry=0) at ipc.c:243
#7  0x01d1e92c in proc_exit_prepare (code=code@entry=0) at ipc.c:198
#8  0x01d1e9fc in proc_exit (code=0) at ipc.c:111
#9  0x01cd5bd0 in ProcessRepliesIfAny () at walsender.c:2294
#10 0x01cd8084 in WalSndLoop (send_data=send_data@entry=0x1cd7628 <XLogSendPhysical>) at walsender.c:2790
#11 0x01cd8f40 in StartReplication (cmd=0xfd048210) at walsender.c:960
#12 exec_replication_command (cmd_string=cmd_string@entry=0xfd53c098 "START_REPLICATION 0/3000000 TIMELINE 2") at
walsender.c:2135
#13 0x01d5bd84 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4767

That assert appears to be several years old, and the
008_min_recovery_point.pl test script that's triggering it hasn't
changed very recently either, so I'm baffled where to start digging.
It has the odor of a timing problem, so maybe we just started hitting
this by chance.  Still ... anybody have an idea?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-13%2018%3A55%3A03
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-04-15%2001%3A00%3A04



Re: Recent pg_rewind test failures in buildfarm

From
Alexander Lakhin
Date:
Hello Tom,

15.04.2025 05:58, Tom Lane wrote:
> In the last day or so, both skink and mamba have hit this
> in the pg_rewind test suite [1][2]:
>
> #3  0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0",
fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66
 
> ...
>
> That assert appears to be several years old, and the
> 008_min_recovery_point.pl test script that's triggering it hasn't
> changed very recently either, so I'm baffled where to start digging.
> It has the odor of a timing problem, so maybe we just started hitting
> this by chance.  Still ... anybody have an idea?
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-13%2018%3A55%3A03
> [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-04-15%2001%3A00%3A04
>

FWIW, there was also another test (028_row_filter.pl) failed due to that
assert triggered: [3].

I've managed to reproduce this failure locally when running 10 instances
of 008_min_recovery_point.pl in parallel under Valgrind.

Will try to investigate the issue during this week.

[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-04-11%2007%3A41%3A36

Best regards,
Alexander Lakhin
Neon (https://neon.tech)



Re: Recent pg_rewind test failures in buildfarm

From
Bertrand Drouvot
Date:
Hi,

On Tue, Apr 15, 2025 at 08:00:00AM +0300, Alexander Lakhin wrote:
> Hello Tom,
> 
> 15.04.2025 05:58, Tom Lane wrote:
> > In the last day or so, both skink and mamba have hit this
> > in the pg_rewind test suite [1][2]:
> > 
> > #3  0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0",
fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66
 
> > ...
> > 
> > That assert appears to be several years old, and the
> > 008_min_recovery_point.pl test script that's triggering it hasn't
> > changed very recently either, so I'm baffled where to start digging.
> > It has the odor of a timing problem, so maybe we just started hitting
> > this by chance.  Still ... anybody have an idea?

Sorry, can't look at the details right now but it might be linked to 
039549d70f6 which is recent enough and in this area. Will give it a look once
I've time.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Recent pg_rewind test failures in buildfarm

From
Michael Paquier
Date:
On Mon, Apr 14, 2025 at 10:58:28PM -0400, Tom Lane wrote:
> That assert appears to be several years old, and the
> 008_min_recovery_point.pl test script that's triggering it hasn't
> changed very recently either, so I'm baffled where to start digging.
> It has the odor of a timing problem, so maybe we just started hitting
> this by chance.  Still ... anybody have an idea?

Likely 039549d70f6a.  The proc_exit(0) calls done at this stage are
reporting that we have non-flushed stats, but we should have some
remaining around.  This sequence is part of the pgstat_shutdown_hook
callback, which has been initialized by pgstat_initialize() done for
the WAL senders in BaseInit().  It looks like we are going to need to
be more aggressive with the stats flushes, so as we have nothing left
when reaching this stage of the shutdown in ProcessRepliesIfAny().

(Sorry for the fuzzy reply, I cannot dive into all the details now,
that's just a quick take on the matter so I may be wrong with my
assumptions.)
--
Michael

Attachment

Re: Recent pg_rewind test failures in buildfarm

From
Andres Freund
Date:
Hi,

On 2025-04-14 22:58:28 -0400, Tom Lane wrote:
> In the last day or so, both skink and mamba have hit this
> in the pg_rewind test suite [1][2]:
> 
> #3  0x01f03f7c in ExceptionalCondition (conditionName=conditionName@entry=0x2119c4c "pending_since == 0",
fileName=fileName@entry=0x2119454"pgstat.c", lineNumber=lineNumber@entry=734) at assert.c:66
 
> #4  0x01d7994c in pgstat_report_stat (force=force@entry=true) at pgstat.c:734
> #5  0x01d7a248 in pgstat_shutdown_hook (code=<optimized out>, arg=<optimized out>) at pgstat.c:615
> #6  0x01d1e7c0 in shmem_exit (code=code@entry=0) at ipc.c:243
> #7  0x01d1e92c in proc_exit_prepare (code=code@entry=0) at ipc.c:198
> #8  0x01d1e9fc in proc_exit (code=0) at ipc.c:111
> #9  0x01cd5bd0 in ProcessRepliesIfAny () at walsender.c:2294
> #10 0x01cd8084 in WalSndLoop (send_data=send_data@entry=0x1cd7628 <XLogSendPhysical>) at walsender.c:2790
> #11 0x01cd8f40 in StartReplication (cmd=0xfd048210) at walsender.c:960
> #12 exec_replication_command (cmd_string=cmd_string@entry=0xfd53c098 "START_REPLICATION 0/3000000 TIMELINE 2") at
walsender.c:2135
> #13 0x01d5bd84 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4767
> 
> That assert appears to be several years old, and the
> 008_min_recovery_point.pl test script that's triggering it hasn't
> changed very recently either, so I'm baffled where to start digging.
> It has the odor of a timing problem, so maybe we just started hitting
> this by chance.  Still ... anybody have an idea?

See also https://postgr.es/m/dwrkeszz6czvtkxzr5mqlciy652zau5qqnm3cp5f3p2po74ppk%40omg4g3cc6dgq

I am fairly certain it's the fault of 039549d70f6

Greetings,

Andres Freund



Re: Recent pg_rewind test failures in buildfarm

From
Andres Freund
Date:
Hi,

On 2025-04-15 06:04:54 -0400, Andres Freund wrote:
> See also https://postgr.es/m/dwrkeszz6czvtkxzr5mqlciy652zau5qqnm3cp5f3p2po74ppk%40omg4g3cc6dgq

Huh, turns out my emails of the last few days had gotten stuck. I only noticed
when trying to open the above link and it was dead.  I had been wondering why
neither Bertrand nor Michael had reacted to the mail...

Greetings,

Andres Freund