RE: Random pg_upgrade test failure on drongo - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: Random pg_upgrade test failure on drongo
Date
Msg-id TY3PR01MB98894D8BE99AE53217C96C0AF56A2@TY3PR01MB9889.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Random pg_upgrade test failure on drongo  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Random pg_upgrade test failure on drongo
List pgsql-hackers
Dear Amit, Alexander,

> > We get the effect discussed when the background writer process decides to
> > flush a file buffer for pg_largeobject during stage 1.
> > (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE,
> > the result must be the same.)
> > And another important factor is shared_buffers = 1MB (set during the test).
> > With the default setting of 128MB I couldn't see the failure.
> >
> > It can be reproduced easily (on old Windows versions) just by running
> > pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the
> > default cluster)).
> > If an old cluster contains dozen of databases, this increases the failure
> > probability significantly (with 10 additional databases I've got failures
> > on iterations 4, 1, 6).
> >
> 
> I don't have an old Windows environment to test but I agree with your
> analysis and theory. The question is what should we do for these new
> random BF failures? I think we should set bgwriter_lru_maxpages to 0
> and checkpoint_timeout to 1hr for these new tests. Doing some invasive
> fix as part of this doesn't sound reasonable because this is an
> existing problem and there seems to be another patch by Thomas that
> probably deals with the root cause of the existing problem [1] as
> pointed out by you.
> 
> [1] - https://commitfest.postgresql.org/40/3951/

Based on the suggestion by Amit, I have created a patch with the alternative
approach. This just does GUC settings. The reported failure is only for
003_logical_slots, but the patch also includes changes for the recently added
test, 004_subscription. IIUC, there is a possibility that 004 would fail as well.

Per our understanding, this patch can stop random failures. Alexander, can you
test for the confirmation?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

pgsql-hackers by date:

Previous
From: jian he
Date:
Subject: Re: SQL:2011 application time
Next
From: Michael Paquier
Date:
Subject: Re: Emit fewer vacuum records by reaping removable tuples during pruning