Re: snapshot too old issues, first around wraparound and then more. - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: snapshot too old issues, first around wraparound and then more.
Date
Msg-id CAH2-WzmM5wmuw8xMmapBFkSx5puMg4oU7r8PvncC2e=VUJ-Tcg@mail.gmail.com
Whole thread Raw
In response to Re: snapshot too old issues, first around wraparound and then more.  (Andres Freund <andres@anarazel.de>)
Responses Re: snapshot too old issues, first around wraparound and then more.  (Amit Kapila <amit.kapila16@gmail.com>)
Re: snapshot too old issues, first around wraparound and then more.  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Thu, Apr 2, 2020 at 5:17 PM Andres Freund <andres@anarazel.de> wrote:
> Since this is a feature that can result in wrong query results (and
> quite possibly crashes / data corruption), I don't think we can just
> leave this unfixed.  But given the amount of code / infrastructure
> changes required to get this into a working feature, I don't see how we
> can unleash those changes onto the stable branches.

I don't think that the feature can be allowed to remain in anything
like its current form. The current design is fundamentally unsound.

> I don't really know what to do here. Causing problems by neutering a
> feature in the back branch *sucks*. While not quite as bad, removing a
> feature without a replacement in a major release is pretty harsh
> too. But I don't really see any other realistic path forward.

I have an idea that might allow us to insulate some users from the
problem caused by a full revert (or disabling the feature) in the
backbranches. I wouldn't usually make such a radical suggestion, but
the current situation is exceptional. Anything that avoids serious
pain for users deserves to be considered.

Kevin said this about the feature very recently:

"""
Keep in mind that the real goal of this feature is not to eagerly
_see_ "snapshot too old" errors, but to prevent accidental
debilitating bloat due to one misbehaving user connection.  This is
particularly easy to see (and therefore unnervingly common) for those
using ODBC, which in my experience tends to correspond to the largest
companies which are using PostgreSQL.  In some cases, the snapshot
which is preventing removal of the rows will never be used again;
removal of the rows will not actually affect the result of any query,
but only the size and performance of the database.  This is a "soft
limit" -- kinda like max_wal_size.  Where there was a trade-off
between accuracy of the limit and performance, the less accurate way
was intentionally chosen.  I apologize for not making that more clear
in comments.
"""

ODBC uses cursors in rather strange ways, often to implement a kind of
ODBC-level cache. See the description of "Use Declare/Fetch" from
https://odbc.postgresql.org/docs/config.html to get some idea of what
this can look like.

I think that it's worth considering whether or not there are a
significant number of "snapshot too old" users that rarely or never
rely on old snapshots used by new queries. Kevin said that this
happens "in some cases", but how many cases? Might it be that many
"snapshot too old" users could get by with a version of the feature
that makes the most conservative possible assumptions, totally giving
up on the idea of differentiating which blocks are truly safe to
access with an "old" snapshot? (In other words, one that assumes that
they're *all* unsafe for an "old" snapshot.)

I'm thinking of a version of "snapshot too old" that amounts to a
statement timeout that gets applied for xmin horizon type purposes in
the conventional way, while only showing an error to the client if and
when they access literally any buffer (though not when the relation is
a system catalog). Is it possible that something along those lines is
appreciably better than nothing to users? If it is, and if we can find
a way to manage the transition, then maybe we could tolerate
supporting this greatly simplified implementation of "snapshot too
old".

I feel slightly silly for even suggesting this. I have to ask. Maybe
nobody noticed a problem with the feature before now (at least in
part) because they didn't truly care about old snapshots anyway. They
just wanted to avoid a significant impact from buggy code that leaks
cursors and things like that. Or, they were happy as long as they
could still access ODBC's "100 rows in a cache" through the cursor.
The docs say that a old_snapshot_threshold setting in the hours is
about the lowest reasonable setting for production use, which seems
rather high to me. It almost seems as if the feature specifically
targets misbehaving applications already.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: WAL usage calculation patch
Next
From: James Coleman
Date:
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)