Hi,
On 2022-05-20 01:25:10 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-05-20 00:22:14 -0400, Tom Lane wrote:
> >> There's some fallout in the expected-file, of course, but this
> >> does seem to fix it (20 consecutive successful runs now at
> >> 100/2). Don't see why though ...
>
> > I think what might be happening is that the transactional stats updates get
> > reported by s2 *before* the non-transactional stats updates come in from
> > s1. I.e. the pgstat_report_stat() at the end of s2_commit_prepared_a does a
> > report, because the machine is slow enough for it to be "time to reports stats
> > again". Then s1 reports its non-transactional stats.
>
> Sounds plausible. And I left the test loop running, and it's now past
> 100 consecutive successes, so I think this change definitely "fixes" it.
FWIW, the problem can be reliably reproduced by sticking a
pgstat_force_next_flush() into pgstat_twophase_postcommit(). This is the only
failure when doing so.
> > It looks like our stats maintenance around truncation isn't quite "concurrency
> > safe". That code hasn't meaningfully changed, but it'd not be surprising if
> > it's not 100% precise...
>
> Yeah. Probably not something to try to improve post-beta, especially
> since it's not completely clear how transactional and non-transactional
> cases *should* interact.
Yea. It's also not normally particularly crucial to be accurate down to that
degree.
> Maybe non-transactional updates should be
> pushed immediately? But I'm not sure if that's fully correct, and
> it definitely sounds expensive.
I think that'd be far too expensive - the majority of stats are
non-transactional...
I think what we could do is to model truncates as subtracting the number of
live/dead rows the truncating backend knows about, rather than setting them to
0. But that of course could incur other inaccuracies.
> I'd be good with tweaking this test case as you suggest, and maybe
> revisiting the topic later.
Pushed the change of the test. Christoph, just to make sure, can you confirm
that this fixes the test instability for you?
> Kyotaro-san worried about whether any other places in stats.spec
> have the same issue. I've not seen any evidence of that in my
> tests, but perhaps some other machine with different timing
> could find it.
I tried to find some by putting in forced flushes in a bunch of places before,
and now some more, without finding further cases.
Greetings,
Andres Freund