Re: snapshot too old issues, first around wraparound and then more. - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: snapshot too old issues, first around wraparound and then more.
Date
Msg-id CACjxUsM70rSgeTMpqkiGQVq6wUU3onJ7L+jSG+kpfRtuD9Z4Cw@mail.gmail.com
Whole thread Raw
In response to Re: snapshot too old issues, first around wraparound and then more.  (Andres Freund <andres@anarazel.de>)
Responses Re: snapshot too old issues, first around wraparound and then more.  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, Apr 1, 2020 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:
The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error).

That is a very big problem.  On the sort-of bright side (ironic in light of the fact that I'm a big proponent of using serializable transactions), none of the uses that I have personally seen of this feature use anything other than the default READ COMMITTED isolation level.  That might help explain the lack of complaints for those using the feature.  But yeah, I REALLY want to see a solid fix for that!
 
And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
  long-running queries/transactions querying immutable or insert only
  data. Those would not notice that, on other tables, rows are getting
  removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
  oldSnapshotControl->head_timestamp getting updated more often than
  correct). But if old_snapshot_threshold is small enough in relation to
  the time the generated bloat becomes problematic, there will still be
  occasions to actually perform cleanup.

Keep in mind that the real goal of this feature is not to eagerly _see_ "snapshot too old" errors, but to prevent accidental debilitating bloat due to one misbehaving user connection.  This is particularly easy to see (and therefore unnervingly common) for those using ODBC, which in my experience tends to correspond to the largest companies which are using PostgreSQL.  In some cases, the snapshot which is preventing removal of the rows will never be used again; removal of the rows will not actually affect the result of any query, but only the size and performance of the database.  This is a "soft limit" -- kinda like max_wal_size.  Where there was a trade-off between accuracy of the limit and performance, the less accurate way was intentionally chosen.  I apologize for not making that more clear in comments.

While occasional "snapshot too old" errors are an inconvenient side effect of achieving the primary goal, it might be of interest to know that the initial (very large corporate) user of this feature had, under Oracle, intentionally used a cursor that would be held open as long as a user chose to leave a list open for scrolling around.  They used cursor features for as long as the cursor allowed.  This could be left open for days or weeks (or longer?).  Their query ordered by a unique index, and tracked the ends of the currently displayed portion of the list so that if they happened to hit the "snapshot too old" error they could deallocate and restart the cursor and reposition before moving forward or back to the newly requested rows.  They were not willing to convert to PostgreSQL unless this approach continued to work.

In Summary:
(1) It's not urgent that rows always be removed as soon as possible after the threshold is crossed as long as they don't often linger too awfully far past that limit and allow debilitating bloat.
(2) It _is_ a problem if results inconsistent with the snapshot are returned -- a "snapshot too old" error is necessary.
(3) Obviously, wraparound problems need to be solved.

I hope this is helpful.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: snapshot too old issues, first around wraparound and then more.
Next
From: Robert Haas
Date:
Subject: Re: snapshot too old issues, first around wraparound and then more.