Re: Improving connection scalability: GetSnapshotData() - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Improving connection scalability: GetSnapshotData()
Date
Msg-id 20200816181604.l54m6kss5ntd6xow@alap3.anarazel.de
Whole thread Raw
In response to Re: Improving connection scalability: GetSnapshotData()  (Andres Freund <andres@anarazel.de>)
Responses Re: Improving connection scalability: GetSnapshotData()  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2020-08-15 09:42:00 -0700, Andres Freund wrote:
> On 2020-08-15 11:10:51 -0400, Tom Lane wrote:
> > We have two essentially identical buildfarm failures since these patches
> > went in:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly&dt=2020-08-15%2011%3A27%3A32
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=peripatus&dt=2020-08-15%2003%3A09%3A14
> >
> > They're both in the same place in the freeze-the-dead isolation test:
> 
> > TRAP: FailedAssertion("!TransactionIdPrecedes(members[i].xid, cutoff_xid)", File: "heapam.c", Line: 6051)
> > 0x9613eb <ExceptionalCondition+0x5b> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x52d586 <heap_prepare_freeze_tuple+0x926> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x53bc7e <heap_vacuum_rel+0x100e> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x6949bb <vacuum_rel+0x25b> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x694532 <vacuum+0x602> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x693d1c <ExecVacuum+0x37c> at /home/pgbuildfarm/buildroot/HEAD/inst/bin/postgres
> > 0x8324b3
> > ...
> > 2020-08-14 22:16:41.783 CDT [78410:4] LOG:  server process (PID 80395) was terminated by signal 6: Abort trap
> > 2020-08-14 22:16:41.783 CDT [78410:5] DETAIL:  Failed process was running: VACUUM FREEZE tab_freeze;
> >
> > peripatus has successes since this failure, so it's not fully reproducible
> > on that machine.  I'm suspicious of a timing problem in computing vacuum's
> > cutoff_xid.
> 
> Hm, maybe it's something around what I observed in
> https://www.postgresql.org/message-id/20200723181018.neey2jd3u7rfrfrn%40alap3.anarazel.de
> 
> I.e. that somehow we end up with hot pruning and freezing coming to a
> different determination, and trying to freeze a hot tuple.
> 
> I'll try to add a few additional asserts here, and burn some cpu tests
> trying to trigger the issue.
> 
> I gotta escape the heat in the house for a few hours though (no AC
> here), so I'll not look at the results till later this afternoon, unless
> it triggers soon.

690 successful runs later, it didn't trigger for me :(. Seems pretty
clear that there's another variable than pure chance, otherwise it seems
like that number of runs should have hit the issue, given the number of
bf hits vs bf runs.

My current plan would is to push a bit of additional instrumentation to
help narrow down the issue. We can afterwards decide what of that we'd
like to keep longer term, and what not.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: run pgindent on a regular basis / scripted manner
Next
From: Tom Lane
Date:
Subject: Re: Improving connection scalability: GetSnapshotData()