Re: snapshot too old issues, first around wraparound and then more. - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: snapshot too old issues, first around wraparound and then more.
Date
Msg-id CA+hUKG+h1waWw0MHMkY_cvA2zxo9jn81TOJz9js6on0F4vcPHA@mail.gmail.com
Whole thread Raw
In response to Re: snapshot too old issues, first around wraparound and then more.  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: snapshot too old issues, first around wraparound and then more.
List pgsql-hackers
On Mon, Apr 13, 2020 at 2:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Apr 3, 2020 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I think that it's worth considering whether or not there are a
> > significant number of "snapshot too old" users that rarely or never
> > rely on old snapshots used by new queries. Kevin said that this
> > happens "in some cases", but how many cases? Might it be that many
> > "snapshot too old" users could get by with a version of the feature
> > that makes the most conservative possible assumptions, totally giving
> > up on the idea of differentiating which blocks are truly safe to
> > access with an "old" snapshot? (In other words, one that assumes that
> > they're *all* unsafe for an "old" snapshot.)
> >
> > I'm thinking of a version of "snapshot too old" that amounts to a
> > statement timeout that gets applied for xmin horizon type purposes in
> > the conventional way, while only showing an error to the client if and
> > when they access literally any buffer (though not when the relation is
> > a system catalog). Is it possible that something along those lines is
> > appreciably better than nothing to users? If it is, and if we can find
> > a way to manage the transition, then maybe we could tolerate
> > supporting this greatly simplified implementation of "snapshot too
> > old".
>
> Interesting idea.  I'm keen to try prototyping it to see how well it
> works out it practice.  Let me know soon if you already have designs
> on that and I'll get out of your way, otherwise I'll give it a try and
> share what I come up with.

Here's a quick and dirty test patch of that idea (or my understanding
of it), just for experiments.  It introduces snapshot->expire_time and
a new timer SNAPSHOT_TIMEOUT to cause the next CHECK_FOR_INTERRUPTS()
to set snapshot->too_old on any active or registered snapshots whose
time has come, and then try to advance MyPgXact->xmin, without
considering the ones marked too old.  That gets rid of the concept of
"early pruning".  You can use just regular pruning, because the
snapshot is no longer holding the regular xmin back.  Then
TestForOldSnapshot() becomes simply if (snapshot->too_old)
ereport(...).

There are certainly some rough edges, missed details and bugs in here,
not least the fact (pointed out to me by Andres in an off-list chat)
that we sometimes use short-lived snapshots without registering them;
we'd have to fix that.  It also does nothing to ensure that
TestForOldSnapshot() is actually called at all the right places, which
is still required for correct results.

If those problems can be fixed, you'd have a situation where
snapshot-too-old is a coarse grained, blunt instrument that
effectively aborts your transaction even if the whole cluster is
read-only.  I am not sure if that's really truly useful to anyone (ie
if these ODBC cursor users would be satisfied; I'm not sure I
understand that use case).

Hmm.  I suppose it must be possible to put the LSN check back: if
(snapshot->too_old && PageGetLSN(page) > snapshot->lsn) ereport(...).
Then the granularity would be the same as today -- block level -- but
the complexity is transferred from the pruning side (has to deal with
xid time map) to the snapshot-owning side (has to deal with timers,
CFI() and make sure all snapshots are registered).  Maybe not a great
deal, and maybe not easier than fixing the existing bugs.

One problem is all the new setitimer() syscalls.  I feel like that
could be improved, as could statement_timeout, by letting existing
timers run rather than repeatedly rescheduling eagerly, so that eg a 1
minute timeout never gets rescheduled more than once per minute.  I
haven't looked into that, but I guess it's no worse than the existing
implement's overheads anyway.

PS in the patch the GUC is interpreted as milliseconds, which is more
fun for testing but it should really be minutes like before.

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: backup manifests
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Race condition in SyncRepGetSyncStandbysPriority