Home > mailing lists

Thread: Allow "snapshot too old" error, to prevent bloat

Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

15 February 2015, 01:47:12

This patch is related to the "Reduce pinning in btree indexes"
patch submitted here:

http://www.postgresql.org/message-id/721615179.3351449.1423959585771.JavaMail.yahoo@mail.yahoo.com

That describes how they evolved and how they relate; I won't
duplicate that here.

Unlike the other patch, this one is more at the "proof of concept"
phase, because it requires support in the heap and each index AM to
work correctly; so far I have only had time to cover the heap and
btree indexes. In spite of that, I have thrown the worst test
cases I could think of at it (and only succeeded in uncovering a
bug which was already out there in production), and it has shown
its value in a two-day test test simulating a 300 user load with
complex real-world applications (although the only indexes it used
were btree indexes). Without the patches the database growth was
39GB per day; with the patches it was 28.5GB per day. (The test
does involve more inserts than deletes, so some growth is
expected.) At the end of the tests, pgstattuple reported eight
times as many dead tuples in the database without the patches.
More importantly, without the patches the CPU load started at 60%
and showed linear growth to 92% over the course of the first day;
with the patches it stayed at a stable 60% throughout the test.

What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed. It also
saves the current insertion LSN into every snapshot when it is
created. When reading from the heap or any index, if the snapshot
is vulnerable to showing incorrect data because the threshold has
been crossed since it was generated, reading any page with an LSN
past the snapshot LSN causes a "snapshot too old" error to be
thrown. Since this is LSN-based, the new logic is not used for any
relation which is not WAL-logged.

Note that if you don't read data from a page which has been
modified after your snapshot was taken, the threshold doesn't
matter.

All `make installcheck` tests succeed with any setting. With a
setting of 0 (the most extreme), `make installcheck-world` sees
four isolation tests fail. Those all pass if you raise the
setting to 2. The postgres_fdw test needs a setting of 4 to
succeed. I would expect most shops would want to tune this to
something in the six-digit to eight-digit range. In the tests
mentioned above it was set to 150000 (which corresponded to just
under 4 minutes of txid consumption) and there were no "snapshot
too old" errors, even though some cursors were left open for the
entire two-day run.

The patch still lacks (as mentioned above) support for index AMs
other than btree, and lacks documentation for the new GUC. I'm
sure that there are some comments and README files that need
adjustment, too. As I said, this is still POC.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

snapshot-too-old-v1.patch

Re: Allow "snapshot too old" error, to prevent bloat

From

Tom Lane

Date:

15 February 2015, 01:55:46

Kevin Grittner <kgrittn@ymail.com> writes:
> What this patch does is add a GUC call old_snapshot_threshold.  It
> defaults to -1, which leaves behavior matching unpatched code.
> Above that it allows tuples to be vacuumed away after the number of
> transaction IDs specified by the GUC have been consumed.

TBH, I'm not sure why we'd wish to emulate Oracle's single worst
operational feature.

> Unlike the other patch, this one is more at the "proof of concept"
> phase, because it requires support in the heap and each index AM to
> work correctly; so far I have only had time to cover the heap and
> btree indexes.

But, having said that, why would the index AMs care?  Seems like what
you are describing should be strictly a matter for VACUUM's removal
rules.  If we're going to have something as ugly as this, I would much
rather it had a very small code footprint.
        regards, tom lane

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

15 February 2015, 02:13:31

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Kevin Grittner <kgrittn@ymail.com> writes:
>> What this patch does is add a GUC call old_snapshot_threshold.  It
>> defaults to -1, which leaves behavior matching unpatched code.
>> Above that it allows tuples to be vacuumed away after the number of
>> transaction IDs specified by the GUC have been consumed.
>
> TBH, I'm not sure why we'd wish to emulate Oracle's single worst
> operational feature.

I've run into cases where people have suffered horribly bloated
databases because of one ill-behaved connection.  Some companies
don't want to be vulnerable to that and the disruption that
recovery from that bloat causes.

Note that the default is to maintain legacy PostgreSQL behavior; 
the postgresql.conf file must be modified to see this error.

>> Unlike the other patch, this one is more at the "proof of concept"
>> phase, because it requires support in the heap and each index AM to
>> work correctly; so far I have only had time to cover the heap and
>> btree indexes.
>
> But, having said that, why would the index AMs care?  Seems like what
> you are describing should be strictly a matter for VACUUM's removal
> rules.  If we're going to have something as ugly as this, I would much
> rather it had a very small code footprint.

If a tuple is vacuumed away, you would not notice that if you were
running an index scan because you would not visit the page it used
to be on to see that the LSN had changed.  We tried to sell the
idea that *any* scan using a snapshot past the threshold should
error out (which is the only way I can see to avoid making the
index AMs aware of this), but they insisted that there was no
reason to have scans fail when reading an unmodified table using an
old snapshot, or even an unmodified page of a modified table.

Due to the cost to this company of modifying their software to deal
with differrent behavior, they will not move without the behavior
I'm shooting for with this patch.  Due to the value of this
customer to us, we will put this (or something to accomplish this
behavior) into our fork.  We're happy to share it with the
community.  If the community does not feel they need this behavior,
or wants to generate the error more aggressively to avoid touching
the index AMs, I understand.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Tom Lane

Date:

15 February 2015, 16:37:01

Kevin Grittner <kgrittn@ymail.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Kevin Grittner <kgrittn@ymail.com> writes:
>>> What this patch does is add a GUC call old_snapshot_threshold.  It
>>> defaults to -1, which leaves behavior matching unpatched code.
>>> Above that it allows tuples to be vacuumed away after the number of
>>> transaction IDs specified by the GUC have been consumed.

>> TBH, I'm not sure why we'd wish to emulate Oracle's single worst
>> operational feature.

> I've run into cases where people have suffered horribly bloated
> databases because of one ill-behaved connection.  Some companies
> don't want to be vulnerable to that and the disruption that
> recovery from that bloat causes.

No doubt, preventing bloat is a good thing, but that doesn't mean this
is the best API we could create for the issue.  The proposition this
patch offers to DBAs is: "You can turn this knob to reduce bloat by some
hard-to-quantify factor.  The cost is that some long-running transactions
might fail.  You won't know which ones are at risk, the failures won't be
the same from time to time, and you won't be able to do anything to spare
high-value transactions from that fate except by turning that knob back
again globally (which requires a database restart)."  Maybe refugees from
Oracle will think that sounds good, but nobody else will.

I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option.  That would be far
easier for users to understand and manage, it would be trivial to allow
specific high-value transactions to run with a laxer setting, it does not
create any implementation-detail-dependent behavior that we'd be having to
explain to users forevermore, and (not incidentally) it would be a lot
simpler and more trustworthy to implement.  There's no well-defined
correlation between your setting and the net effect on database bloat,
but that's true with the "snapshot too old" approach as well.
        regards, tom lane

Re: Allow "snapshot too old" error, to prevent bloat

From

Jim Nasby

Date:

15 February 2015, 18:59:39

On 2/15/15 10:36 AM, Tom Lane wrote:
> Kevin Grittner <kgrittn@ymail.com> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Kevin Grittner <kgrittn@ymail.com> writes:
>>>> What this patch does is add a GUC call old_snapshot_threshold.  It
>>>> defaults to -1, which leaves behavior matching unpatched code.
>>>> Above that it allows tuples to be vacuumed away after the number of
>>>> transaction IDs specified by the GUC have been consumed.
>
>>> TBH, I'm not sure why we'd wish to emulate Oracle's single worst
>>> operational feature.
>
>> I've run into cases where people have suffered horribly bloated
>> databases because of one ill-behaved connection.  Some companies
>> don't want to be vulnerable to that and the disruption that
>> recovery from that bloat causes.
>
> No doubt, preventing bloat is a good thing, but that doesn't mean this
> is the best API we could create for the issue.  The proposition this
> patch offers to DBAs is: "You can turn this knob to reduce bloat by some
> hard-to-quantify factor.  The cost is that some long-running transactions
> might fail.  You won't know which ones are at risk, the failures won't be
> the same from time to time, and you won't be able to do anything to spare
> high-value transactions from that fate except by turning that knob back
> again globally (which requires a database restart)."  Maybe refugees from
> Oracle will think that sounds good, but nobody else will.
>
> I wonder if we couldn't achieve largely the same positive effects through
> adding a simple transaction-level timeout option.  That would be far
> easier for users to understand and manage, it would be trivial to allow
> specific high-value transactions to run with a laxer setting, it does not
> create any implementation-detail-dependent behavior that we'd be having to
> explain to users forevermore, and (not incidentally) it would be a lot
> simpler and more trustworthy to implement.  There's no well-defined
> correlation between your setting and the net effect on database bloat,
> but that's true with the "snapshot too old" approach as well.

A common use-case is long-running reports hitting relatively stable data 
in a database that also has tables with a high churn rate (ie: queue 
tables). In those scenarios your only options right now are to suffer 
huge amounts of bloat in the high-churn or not do your reporting. A 
simple transaction timeout only "solves" this by denying you reporting 
queries.

An idea that I've had on this would be some way to "lock down" the 
tables that a long-running transaction could access. That would allow 
vacuum to ignore any snapshots that transaction had for tables it wasn't 
accessing. That's something that would be deterministic.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Allow "snapshot too old" error, to prevent bloat

From

Tom Lane

Date:

15 February 2015, 20:27:15

Jim Nasby <Jim.Nasby@bluetreble.com> writes:
> On 2/15/15 10:36 AM, Tom Lane wrote:
>> I wonder if we couldn't achieve largely the same positive effects through
>> adding a simple transaction-level timeout option.

> A common use-case is long-running reports hitting relatively stable data 
> in a database that also has tables with a high churn rate (ie: queue 
> tables). In those scenarios your only options right now are to suffer 
> huge amounts of bloat in the high-churn or not do your reporting. A 
> simple transaction timeout only "solves" this by denying you reporting 
> queries.

Agreed, but Kevin's proposal has exactly the same problem only worse,
because (a) the reporting transactions might or might not fail (per
Murphy, they'll fail consistently only when you're on deadline), and
(b) if they do fail, there's nothing you can do short of increasing the
slack db-wide.

> An idea that I've had on this would be some way to "lock down" the 
> tables that a long-running transaction could access. That would allow 
> vacuum to ignore any snapshots that transaction had for tables it wasn't 
> accessing. That's something that would be deterministic.

There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.
        regards, tom lane

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

16 February 2015, 05:19:37

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jim Nasby <Jim.Nasby@bluetreble.com> writes:
>> On 2/15/15 10:36 AM, Tom Lane wrote:
>>> I wonder if we couldn't achieve largely the same positive effects through
>>> adding a simple transaction-level timeout option.

We suggested this to our customer and got out of the meeting with
it looking like it *might* fly.  In the next meeting, however, they
said they had run it by others and reviewed the code and it was
completely unacceptable -- they would not consider pg with this as
the solution.

>> A common use-case is long-running reports hitting relatively stable data
>> in a database that also has tables with a high churn rate (ie: queue
>> tables). In those scenarios your only options right now are to suffer
>> huge amounts of bloat in the high-churn or not do your reporting. A
>> simple transaction timeout only "solves" this by denying you reporting
>> queries.
>
> Agreed, but Kevin's proposal has exactly the same problem only worse,
> because (a) the reporting transactions might or might not fail (per
> Murphy, they'll fail consistently only when you're on deadline), and
> (b) if they do fail, there's nothing you can do short of increasing the
> slack db-wide.

These they were comfortable with, and did *not* consider to be
unpredictable or something they could not do something about.
I really don't feel I can say more than that, though, without
disclosing more than I should.

>> An idea that I've had on this would be some way to "lock down" the
>> tables that a long-running transaction could access. That would allow
>> vacuum to ignore any snapshots that transaction had for tables it wasn't
>> accessing. That's something that would be deterministic.

While this option was not specifically suggested, based on their
their reasons that numerous other options we suggested were
unacceptable, I feel sure that this would not be acceptable.

I think Tom hit the nail on the head when he said "Maybe refugees
from Oracle will think that sounds good..."  It is precisely those
with very large code bases which have been modified over long
periods of time to work with Oracle that would find this solution
attractive, or perhaps *necessary* to consider a move to Postgres.
That's a potential market we don't care to write off.

> There might be something in that, but again it's not much like this patch.
> The key point I think we're both making is that nondeterministic failures
> are bad, especially when you're talking about long-running, expensive-to-
> retry transactions.

What the customer most doesn't want to be "nondeterministic" is
whether the error is generated only when the snapshot has been used
to read a page which has been modified since the snapshot was
taken.  If tables or materialized views are set up before a query
and then not subsequently modified during the execution of the
query, that query must not be canceled even if it runs for days,
but it must not cause unconstrained bloat during the run.  So far I
don't see any way to avoid including the LSN with the snapshot or
modifying the index AMs.  Let's be clear on the footprint for that;
for the btree implementation it is:
src/backend/access/nbtree/nbtinsert.c |  7 ++++---src/backend/access/nbtree/nbtpage.c   |  2
+-src/backend/access/nbtree/nbtsearch.c| 43 ++++++++++++++++++++++++++++++++++---------src/include/access/nbtree.h
    |  7 ++++---4 files changed, 43 insertions(+), 16 deletions(-)

What this discussion has made me reconsider is the metric for
considering a transaction "too old".  The number of transaction IDs
consumed seems inferior as the unit of measure for that to LSN or
time.

It looks to me to be pretty trivial (on the order of maybe 30 lines
of code) to specify this GUC in minutes rather than transaction
IDs.  At first glance this seems like it would be vulnerable to the
usual complaints about mismanaged clocks, but those are easily
answered by using a cached time in shared memory that we populate
in such a way that it can never move backward.  Done right, this
could even allow the GUC to be changeable on reload rather than
only at restart.  A badly mismanaged system clock could not cause a
query to generate incorrect results; the worst that could happen is
that this feature would fail to control bloat as much as expected
or reads of modified data could generate the "snapshot too old"
error around the time of the clock adjustment.

As before, this would default to a magic value to mean that you
want the historical PostgreSQL behavior.

If that makes the idea more palatable to anyone, I can submit a
patch to that effect within the next 24 hours.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Andres Freund

Date:

16 February 2015, 08:42:09

On 2015-02-16 05:19:11 +0000, Kevin Grittner wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Jim Nasby <Jim.Nasby@bluetreble.com> writes:
> >> On 2/15/15 10:36 AM, Tom Lane wrote:
> >>> I wonder if we couldn't achieve largely the same positive effects through
> >>> adding a simple transaction-level timeout option.
> 
> We suggested this to our customer and got out of the meeting with
> it looking like it *might* fly.  In the next meeting, however, they
> said they had run it by others and reviewed the code and it was
> completely unacceptable -- they would not consider pg with this as
> the solution.
> 
> >> A common use-case is long-running reports hitting relatively stable data
> >> in a database that also has tables with a high churn rate (ie: queue
> >> tables). In those scenarios your only options right now are to suffer
> >> huge amounts of bloat in the high-churn or not do your reporting. A
> >> simple transaction timeout only "solves" this by denying you reporting
> >> queries.
> >
> > Agreed, but Kevin's proposal has exactly the same problem only worse,
> > because (a) the reporting transactions might or might not fail (per
> > Murphy, they'll fail consistently only when you're on deadline), and
> > (b) if they do fail, there's nothing you can do short of increasing the
> > slack db-wide.
> 
> These they were comfortable with, and did *not* consider to be
> unpredictable or something they could not do something about.
> I really don't feel I can say more than that, though, without
> disclosing more than I should.

I agree that we need to do something about the dangers of long
snapshots, I'm not sure this is it. I'm not sure of the contrary either.

What I'm definitely not a fan of though, is this implementation. Having
to add checks to a large number of places is a recipe for silently wrong
query results.

One thing I was wondering about recently was introducing an optimization
for repeatedly updated rows into the vacuum code: A row that has xmin =
xmax where these have committed can be removed, even if the xid is above
the xmin horizon - no other transaction is ever going to see it. While
there's some hairy-ness implementing that, it doesn't seem too hard. And
there's a fair number of cases where that'd allow less bloat to
accumulate.  Obviously it'd be better if we had logic to do that for
other patterns as well (where the updates aren't in the same xact), but
it seems like a good start.

> > There might be something in that, but again it's not much like this patch.
> > The key point I think we're both making is that nondeterministic failures
> > are bad, especially when you're talking about long-running, expensive-to-
> > retry transactions.
> 
> What the customer most doesn't want to be "nondeterministic" is
> whether the error is generated only when the snapshot has been used
> to read a page which has been modified since the snapshot was
> taken.  If tables or materialized views are set up before a query
> and then not subsequently modified during the execution of the
> query, that query must not be canceled even if it runs for days,
> but it must not cause unconstrained bloat during the run.  So far I
> don't see any way to avoid including the LSN with the snapshot or
> modifying the index AMs.  Let's be clear on the footprint for that;
> for the btree implementation it is:

IIUC you never would want cancellations when accessing the the tables
these longrunning backends actually access. The errors about too old
snapshots are just a stopgap because we can't compute the xmin per table
in a meaningful way atm. Right?

Is the bloat caused by rows these transactions actually can see or are
the not-removed rows newer than the transaction's xmax?

Since you actually don't want cancellations for the long running
reporting queries it very much might be sensible to switch to a more
complicated version of HeapTupleSatisfiesVacuum() if there's longrunning
queries. One that can recognize if rows are actually visible to any
backend, or if there are just snapshots that see older rows. I've
previously wondered how hard this would be, but I don't think it's
*that* hard. And it'd be a much more general solution.

Greetings,

Andres Freund

Re: Allow "snapshot too old" error, to prevent bloat

From

Jim Nasby

Date:

16 February 2015, 21:35:49

On 2/16/15 2:41 AM, Andres Freund wrote:
> Since you actually don't want cancellations for the long running
> reporting queries it very much might be sensible to switch to a more
> complicated version of HeapTupleSatisfiesVacuum() if there's longrunning
> queries. One that can recognize if rows are actually visible to any
> backend, or if there are just snapshots that see older rows. I've
> previously wondered how hard this would be, but I don't think it's
> *that*  hard. And it'd be a much more general solution.

Josh Berkus mentioned on IRC that they're working on improving vacuum, 
and I think it was something along these lines.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Allow "snapshot too old" error, to prevent bloat

From

Bruce Momjian

Date:

16 February 2015, 21:36:08

On Sun, Feb 15, 2015 at 11:36:50AM -0500, Tom Lane wrote:
> I wonder if we couldn't achieve largely the same positive effects through
> adding a simple transaction-level timeout option.  That would be far
> easier for users to understand and manage, it would be trivial to allow
> specific high-value transactions to run with a laxer setting, it does not
> create any implementation-detail-dependent behavior that we'd be having to
> explain to users forevermore, and (not incidentally) it would be a lot
> simpler and more trustworthy to implement.  There's no well-defined
> correlation between your setting and the net effect on database bloat,
> but that's true with the "snapshot too old" approach as well.

While we have prided ourselves on not generating query cancel messages
due to snapshot-too-old, there is the downside of bloat.  I understand
the need to allow users to make the trade-off between bloat and query
cancellation.

It seems we already have a mechanism in place that allows tuning of
query cancel on standbys vs. preventing standby queries from seeing old
data, specifically
max_standby_streaming_delay/max_standby_archive_delay.  We obsessed
about how users were going to react to these odd variables, but there
has been little negative feedback.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +

Re: Allow "snapshot too old" error, to prevent bloat

From

Andres Freund

Date:

16 February 2015, 23:26:32

On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
> It seems we already have a mechanism in place that allows tuning of
> query cancel on standbys vs. preventing standby queries from seeing old
> data, specifically
> max_standby_streaming_delay/max_standby_archive_delay.  We obsessed
> about how users were going to react to these odd variables, but there
> has been little negative feedback.

FWIW, I think that's a somewhat skewed perception. I think it was right to
introduce those, because we didn't really have any alternatives.

But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for questions
and complaints that I/we see.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Allow "snapshot too old" error, to prevent bloat

From

Magnus Hagander

Date:

17 February 2015, 06:35:30

<p dir="ltr"><br /> On Feb 17, 2015 12:26 AM, "Andres Freund" <<a
href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>>wrote:<br /> ><br /> > On 2015-02-16 16:35:46
-0500,Bruce Momjian wrote:<br /> > > It seems we already have a mechanism in place that allows tuning of<br />
>> query cancel on standbys vs. preventing standby queries from seeing old<br /> > > data, specifically<br
/>> > max_standby_streaming_delay/max_standby_archive_delay.  We obsessed<br /> > > about how users were
goingto react to these odd variables, but there<br /> > > has been little negative feedback.<br /> ><br />
>FWIW, I think that's a somewhat skewed perception. I think it was right to<br /> > introduce those, because we
didn'treally have any alternatives.<br /> ><br /> > But max_standby_streaming_delay, max_standby_archive_delay
and<br/> > hot_standby_feedback are among the most frequent triggers for questions<br /> > and complaints that
I/wesee.<br /> ><p dir="ltr">Agreed.<p dir="ltr">And a really bad one used to be vacuum_defer_cleanup_age, because
ofconfusing units amongst other things. Which in terms seems fairly close to Kevins suggestions, unfortunately. <br
/><pdir="ltr">/Magnus <br />

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

18 February 2015, 21:58:35

Magnus Hagander <magnus@hagander.net> wrote:
> On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:
>> On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:

>>> It seems we already have a mechanism in place that allows
>>> tuning of query cancel on standbys vs. preventing standby
>>> queries from seeing old data, specifically
>>> max_standby_streaming_delay/max_standby_archive_delay.  We
>>> obsessed about how users were going to react to these odd
>>> variables, but there has been little negative feedback.
>>
>> FWIW, I think that's a somewhat skewed perception. I think it
>> was right to introduce those, because we didn't really have any
>> alternatives.

As far as I can see, the "alternatives" suggested so far are all
about causing heap bloat to progress more slowly, but still without
limit.  I suggest, based on a lot of user feedback (from the
customer I've talked about at some length on this thread, as well
as numerous others), that unlimited bloat based on the activity of
one connection is a serious deficiency in the product; and that
there is no real alternative to something like a "snapshot too old"
error if we want to fix that deficiency.  Enhancements to associate
a snapshot with a database and using a separate vacuum xmin per
database, not limiting vacuum of a particular object by snapshots
that cannot see that snapshot, etc., are all Very Good Things and I
hope those changes are made, but none of that fixes a very
fundamental flaw.

>> But max_standby_streaming_delay, max_standby_archive_delay and
>> hot_standby_feedback are among the most frequent triggers for
>> questions and complaints that I/we see.
>>
> Agreed.
> And a really bad one used to be vacuum_defer_cleanup_age, because
> of confusing units amongst other things. Which in terms seems
> fairly close to Kevins suggestions, unfortunately.

Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned.  Does this look
any better to you if it is something that can be set to '20min' or
'1h'?  Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.

Unless this idea gets some +1 votes Real Soon Now, I will mark the
patch as Rejected and move on.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Stephen Frost

Date:

18 February 2015, 22:32:35

Kevin,

* Kevin Grittner (kgrittn@ymail.com) wrote:
> Magnus Hagander <magnus@hagander.net> wrote:
> > On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:
> >> On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
>
> >>> It seems we already have a mechanism in place that allows
> >>> tuning of query cancel on standbys vs. preventing standby
> >>> queries from seeing old data, specifically
> >>> max_standby_streaming_delay/max_standby_archive_delay.  We
> >>> obsessed about how users were going to react to these odd
> >>> variables, but there has been little negative feedback.
> >>
> >> FWIW, I think that's a somewhat skewed perception. I think it
> >> was right to introduce those, because we didn't really have any
> >> alternatives.
>
> As far as I can see, the "alternatives" suggested so far are all
> about causing heap bloat to progress more slowly, but still without
> limit.  I suggest, based on a lot of user feedback (from the
> customer I've talked about at some length on this thread, as well
> as numerous others), that unlimited bloat based on the activity of
> one connection is a serious deficiency in the product; and that
> there is no real alternative to something like a "snapshot too old"
> error if we want to fix that deficiency.  Enhancements to associate
> a snapshot with a database and using a separate vacuum xmin per
> database, not limiting vacuum of a particular object by snapshots
> that cannot see that snapshot, etc., are all Very Good Things and I
> hope those changes are made, but none of that fixes a very
> fundamental flaw.

I also agree with the general idea that it makes sense to provide a way
to control bloat, but I think you've missed what Andres was getting at
with his suggestion (if I understand correctly, apologies if I don't).

The problem is that we're only looking at the overall xmin / xmax
horizon when it comes to deciding if a given tuple is dead.  That's
not quite right- the tuple might actually be dead to all *current*
transactions by being newer than the oldest transaction but dead for all
later transactions.  Basically, there exist gaps between our cluster
wide xmin / xmax where we might find actually dead rows.  Marking those
rows dead and reusable *would* stop the bloat, not just slow it down.

In the end, with a single long-running transaction, the worst bloat you
would have is double the size of the system at the time the long-running
transaction started.  Another way of thinking about it is with this
timeline:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
/\                  /\             /\                 /\
Long running        row created    row deleted        vacuum
transaction starts

Now, if all transactions currently running started either before the row
was created or after the row was deleted, then the row is dead and
vacuum can reclaim it.  Finding those gaps in the transaction space for
all of the currently running processes might not be cheap, but for this
kind of a use-case, it might be worth the effort.  On a high-churn
system where the actual set of rows being copied over constantly isn't
very high then you could end up with some amount of duplication of rows-
those to satisfy the ancient transaction and the current working set,
but there wouldn't be any further bloat and it'd almost cerainly be a
much better situation than what is being seen now.

> Particularly my initial suggestion, which was to base snapshot
> "age" it on the number of transaction IDs assigned.  Does this look
> any better to you if it is something that can be set to '20min' or
> '1h'?  Just to restate, that would not automatically cancel the
> snapshots past that age; it would allow vacuum of any tuples which
> became "dead" that long ago, and would cause a "snapshot too old"
> message for any read of a page modified more than that long ago
> using a snapshot which was older than that.
>
> Unless this idea gets some +1 votes Real Soon Now, I will mark the
> patch as Rejected and move on.

I'm not against having a knob like this, which is defaulted to off,
but I do think we'd be a lot better off with a system that could
realize when rows are not visible to any currently running transaction
and clean them up.  If this knob's default is off then I don't think
we'd be likely to get the complaints which are being discussed (or, if
we did, we could point the individual at the admin who set the knob...).
Thanks!
    Stephen

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

19 February 2015, 00:03:20

Stephen Frost <sfrost@snowman.net> wrote:

> I also agree with the general idea that it makes sense to provide a way
> to control bloat, but I think you've missed what Andres was getting at
> with his suggestion (if I understand correctly, apologies if I don't).
>
> The problem is that we're only looking at the overall xmin / xmax
> horizon when it comes to deciding if a given tuple is dead.  That's
> not quite right- the tuple might actually be dead to all *current*
> transactions by being newer than the oldest transaction but dead for all
> later transactions.  Basically, there exist gaps between our cluster
> wide xmin / xmax where we might find actually dead rows.  Marking those
> rows dead and reusable *would* stop the bloat, not just slow it down.
>
> In the end, with a single long-running transaction, the worst bloat you
> would have is double the size of the system at the time the long-running
> transaction started.

I agree that limiting bloat to one dead tuple for every live one
for each old snapshot is a limit that has value, and it was unfair
of me to characterize that as not being a limit.  Sorry for that.

This possible solution was discussed with the user whose feedback
caused me to write this patch, but there were several reasons they
dismissed that as a viable total solution for them, two of which I
can share:

(1)  They have a pool of connections each of which can have several
long-running cursors, so the limit from that isn't just doubling
the size of their database, it is limiting it to some two-or-three
digit multiple of the necessary size.

(2)  They are already prepared to deal with "snapshot too old"
errors on queries that run more than about 20 minutes and which
access tables which are modified.  They would rather do that than
suffer the bloat beyond that point.

IMO all of these changes people are working are very valuable, and
complement each other.  This particular approach is likely to be
especially appealing to those moving from Oracle because it is a
familiar mechanism, and one which some of them have written their
software to be aware of and deal with gracefully.

> I'm not against having a knob like this, which is defaulted to off,

Thanks!  I'm not sure that amounts to a +1, but at least it doesn't
sound like a -1.  :-)

> but I do think we'd be a lot better off with a system that could
> realize when rows are not visible to any currently running transaction
> and clean them up.

+1

But they are not mutually exclusive; I see them as complementary.

> If this knob's default is off then I don't think
> we'd be likely to get the complaints which are being discussed (or, if
> we did, we could point the individual at the admin who set the knob...).

That's how I see it, too.  I would not suggest making the default
anything other than "off", but there are situations where it would
be a nice tool to have in the toolbox.

Thanks for the feedback!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Greg Stark

Date:

19 February 2015, 00:39:29

On Sun, Feb 15, 2015 at 8:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.

But I think Kevin would agree with you there. That's why he's interested in having the errors not occur if you don't read from the volatile tables. Ie, your reporting query reading from read-only tables would be guaranteed to succeed even if some other table had had some rows vacuumed away.

I'm not sure that's really going to work because of things like hint bit updates or hot pruning. That is, say your table is read-only now but last week some of the records were updated. You might reasonably expect those rows to be safe and indeed the rows themselves will still be in your report. But the earlier versions of the rows could still be sitting on some pages invisible to every query but not vacuumable quite yet and then suddenly they get vacuumed away and the LSN on the page gets bumped.

Fwiw I would strongly suggest that instead of having a number of transactions to have a time difference. I haven't been following the "commit timestamp" thread but I'm hoping that patch has the infrastructure needed to look up an lsn and find out the timestamp and vice versa. So when you take a snapshot you could look up the effective cut-off LSN based on the time difference specified in the GUC. As a DBA I would find it infinitely more comforting specifying "old_snapshot_threshold=4h" than specifying some arbitrary large number of transactions and hoping I calculated the transaction rate reasonably accurately.

greg

Re: Allow "snapshot too old" error, to prevent bloat

From

Stephen Frost

Date:

19 February 2015, 01:14:13

* Kevin Grittner (kgrittn@ymail.com) wrote:
> Stephen Frost <sfrost@snowman.net> wrote:
> > I also agree with the general idea that it makes sense to provide a way
> > to control bloat, but I think you've missed what Andres was getting at
> > with his suggestion (if I understand correctly, apologies if I don't).
> >
> > The problem is that we're only looking at the overall xmin / xmax
> > horizon when it comes to deciding if a given tuple is dead.  That's
> > not quite right- the tuple might actually be dead to all *current*
> > transactions by being newer than the oldest transaction but dead for all
> > later transactions.  Basically, there exist gaps between our cluster
> > wide xmin / xmax where we might find actually dead rows.  Marking those
> > rows dead and reusable *would* stop the bloat, not just slow it down.
> >
> > In the end, with a single long-running transaction, the worst bloat you
> > would have is double the size of the system at the time the long-running
> > transaction started.
>
> I agree that limiting bloat to one dead tuple for every live one
> for each old snapshot is a limit that has value, and it was unfair
> of me to characterize that as not being a limit.  Sorry for that.
>
> This possible solution was discussed with the user whose feedback
> caused me to write this patch, but there were several reasons they
> dismissed that as a viable total solution for them, two of which I
> can share:
>
> (1)  They have a pool of connections each of which can have several
> long-running cursors, so the limit from that isn't just doubling
> the size of their database, it is limiting it to some two-or-three
> digit multiple of the necessary size.

This strikes me as a bit off-the-cuff; was an analysis done which
deteremined that would be the result?  If there is overlap between the
long-running cursors then there would be less bloat, and most systems
which I'm familiar with don't turn the entire database over in 20
minutes, 20 hours, or even 20 days except in pretty specific cases.
Perhaps this is one of those, and if so then I'm all wet, but the
feeling I get is that this is a way to dismiss this solution because
it's not what's wanted, which is "what Oracle did."

> (2)  They are already prepared to deal with "snapshot too old"
> errors on queries that run more than about 20 minutes and which
> access tables which are modified.  They would rather do that than
> suffer the bloat beyond that point.

That, really, is the crux here- they've already got support for dealing
with it the way Oracle did and they'd like PG to do that too.
Unfortunately, that, by itself, isn't a good reason for a particular
capability (we certainly aren't going to be trying to duplicate PL/SQL
in PG any time soon).  That said, there are capabilities in other
RDBMS's which are valuable and which we *do* want, so the fact that
Oracle does this also isn't a reason to not include it.

> IMO all of these changes people are working are very valuable, and
> complement each other.  This particular approach is likely to be
> especially appealing to those moving from Oracle because it is a
> familiar mechanism, and one which some of them have written their
> software to be aware of and deal with gracefully.

For my 2c, I'd much rather provide them with a system where they don't
have to deal with broken snapshots than give them a way to have them the
way Oracle provided them. :)  That said, even the approach Andres
outlined will cause bloat and it may be beyond what's acceptable in some
environments, and it's certainly more complicated and unlikely to get
done in the short term.

> > I'm not against having a knob like this, which is defaulted to off,
>
> Thanks!  I'm not sure that amounts to a +1, but at least it doesn't
> sound like a -1.  :-)

So, at the time I wrote that, I wasn't sure if it was a +1 or not
myself.  I've been thinking about it since then, however, and I'm
leaning more towards having the capability than not, so perhaps that's a
+1, but it doesn't excuse the need to come up with an implementation
that everyone can be happy with and what you've come up with so far
doesn't have a lot of appeal, based on the feedback (I've only glanced
through it myself, but I agree with Andres and Tom that it's a larger
footprint than we'd want for this and the number of places having to be
touched is concerning as it could lead to future bugs).

A lot of that would go away if there was a way to avoid having to mess
with the index AMs, I'd think, but I wonder if we'd actually need more
done there- it's not immediately obvious to me how an index-only scan is
safe with this.  Whenever an actual page is visited, we can check the
LSN, and the relation can't be truncated by vacuum since the transaction
will still have a lock on the table which prevents it, but does the
visibility-map update check make sure to never mark pages all-visible
when one of these old transactions is running around?  On what basis?

> > but I do think we'd be a lot better off with a system that could
> > realize when rows are not visible to any currently running transaction
> > and clean them up.
>
> +1
>
> But they are not mutually exclusive; I see them as complementary.

I can see how they would be, provided we can be confident that we're
going to actually throw an error when the snapshot is out of date and
not end up returning incorrect results.  We need to be darn sure of
that, both now and in a few years from now when many of us may have
forgotten about this knob.. ;)

> > If this knob's default is off then I don't think
> > we'd be likely to get the complaints which are being discussed (or, if
> > we did, we could point the individual at the admin who set the knob...).
>
> That's how I see it, too.  I would not suggest making the default
> anything other than "off", but there are situations where it would
> be a nice tool to have in the toolbox.

Agreed.
Thanks!
    Stephen

Re: Allow "snapshot too old" error, to prevent bloat

From

Magnus Hagander

Date:

19 February 2015, 02:31:48

On Wed, Feb 18, 2015 at 10:57 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Magnus Hagander <magnus@hagander.net> wrote:
> On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:
>> On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
>> But max_standby_streaming_delay, max_standby_archive_delay and
>> hot_standby_feedback are among the most frequent triggers for
>> questions and complaints that I/we see.
>>
> Agreed.
> And a really bad one used to be vacuum_defer_cleanup_age, because
> of confusing units amongst other things. Which in terms seems
> fairly close to Kevins suggestions, unfortunately.

Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.

Yes, it would definitely look much better. My reference per above was exactly that - having a setting in the unit "number of xids" confused a lot of users and made it really hard to tune. Having something in time units is a lot easier to understand and tune for most people.

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

19 February 2015, 14:44:34

Greg Stark <stark@mit.edu> wrote:
> On Sun, Feb 15, 2015 at 8:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> There might be something in that, but again it's not much like
>> this patch.  The key point I think we're both making is that
>> nondeterministic failures are bad, especially when you're
>> talking about long-running, expensive-to-retry transactions.
>>
> But I think Kevin would agree with you there. That's why he's
> interested in having the errors not occur if you don't read from
> the volatile tables. Ie, your reporting query reading from
> read-only tables would be guaranteed to succeed even if some
> other table had had some rows vacuumed away.
>
> I'm not sure that's really going to work because of things like
> hint bit updates or hot pruning. That is, say your table is
> read-only now but last week some of the records were updated. You
> might reasonably expect those rows to be safe and indeed the rows
> themselves will still be in your report. But the earlier versions
> of the rows could still be sitting on some pages invisible to
> every query but not vacuumable quite yet and then suddenly they
> get vacuumed away and the LSN on the page gets bumped.

That's a very interesting point.  In the case that the reporting
tables are like that (versus being created right before the run,
for the report), it would argue for either very aggressive
autovacuum settings or an explicit vacuum on those tables before
starting the report.

> Fwiw I would strongly suggest that instead of having a number of
> transactions to have a time difference.

On the 15th I said this:

| What this discussion has made me reconsider is the metric for
| considering a transaction "too old".  The number of transaction IDs
| consumed seems inferior as the unit of measure for that to LSN or
| time.
|
| It looks to me to be pretty trivial (on the order of maybe 30 lines
| of code) to specify this GUC in minutes rather than transaction
| IDs.  At first glance this seems like it would be vulnerable to the
| usual complaints about mismanaged clocks, but those are easily
| answered by using a cached time in shared memory that we populate
| in such a way that it can never move backward.  Done right, this
| could even allow the GUC to be changeable on reload rather than
| only at restart.  A badly mismanaged system clock could not cause a
| query to generate incorrect results; the worst that could happen is
| that this feature would fail to control bloat as much as expected
| or reads of modified data could generate the "snapshot too old"
| error around the time of the clock adjustment.
|
| As before, this would default to a magic value to mean that you
| want the historical PostgreSQL behavior.
|
| If that makes the idea more palatable to anyone, I can submit a
| patch to that effect within the next 24 hours.

Until yesterday I didn't get any feedback suggesting that such a
change would make the patch more palatable.  Now that I have had,
I'll try to post a patch to that effect today.

Thanks!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

19 February 2015, 16:01:59

Stephen Frost <sfrost@snowman.net> wrote:
> Kevin Grittner (kgrittn@ymail.com) wrote:
>> Stephen Frost <sfrost@snowman.net> wrote:

>>> In the end, with a single long-running transaction, the worst bloat you
>>> would have is double the size of the system at the time the long-running
>>> transaction started.
>>
>> I agree that limiting bloat to one dead tuple for every live one
>> for each old snapshot is a limit that has value, and it was unfair
>> of me to characterize that as not being a limit.  Sorry for that.
>>
>> This possible solution was discussed with the user whose feedback
>> caused me to write this patch, but there were several reasons they
>> dismissed that as a viable total solution for them, two of which I
>> can share:
>>
>> (1)  They have a pool of connections each of which can have several
>> long-running cursors, so the limit from that isn't just doubling
>> the size of their database, it is limiting it to some two-or-three
>> digit multiple of the necessary size.
>
> This strikes me as a bit off-the-cuff; was an analysis done which
> deteremined that would be the result?

If it sounds like off-the-cuff it is because I am summarizing the
mountain of data which has been accumulated over weeks of
discussions and many rounds of multi-day tests using their actual
software driven by a software test driver that simulates users,
with "think time" and all.

To be usable, we need to run on a particular set of hardware with a
simulated load of 3000 users.  In a two-day test without the
patches I'm proposing, their actual, working production software
running against PostgreSQL bloated the database by 37% and was on a
linear rate of increase.  Unpatched, CPU time consumed at a linear
rate due to the bloat until in the second day it saturated the CPUs
and the driver was unable to get acceptable user performance.
Because of that we haven't moved on to test what the bloat rate
would be like with 3000 users, but I assume it would be higher than
with 300 users.

With the two patches I submitted, bloat stabilized at less than 5%
except for some btree indexes which followed pattern of inserting
at the end and deleting most (but not all) of the entries over
time.  I've been working on a separate patch for that, but it's not
ready to present, so I probably won't post that until the first CF
of the next release -- unless someone's interested enough to want
to discuss it before then.

>> (2)  They are already prepared to deal with "snapshot too old"
>> errors on queries that run more than about 20 minutes and which
>> access tables which are modified.  They would rather do that than
>> suffer the bloat beyond that point.
>
> That, really, is the crux here- they've already got support for dealing
> with it the way Oracle did and they'd like PG to do that too.
> Unfortunately, that, by itself, isn't a good reason for a particular
> capability (we certainly aren't going to be trying to duplicate PL/SQL
> in PG any time soon).

I understand that, and if the community doesn't want this style of
limitation on bloat we can make it part of PPAS (which *does*
duplicate PL/SQL, among other things).  We are offering it to the
community first.

> That said, there are capabilities in other RDBMS's which are
> valuable and which we *do* want, so the fact that Oracle does this
> also isn't a reason to not include it.

:-)

>>> I'm not against having a knob like this, which is defaulted to off,
>>
>> Thanks!  I'm not sure that amounts to a +1, but at least it doesn't
>> sound like a -1.  :-)
>
> So, at the time I wrote that, I wasn't sure if it was a +1 or not
> myself.  I've been thinking about it since then, however, and I'm
> leaning more towards having the capability than not, so perhaps that's a
> +1, but it doesn't excuse the need to come up with an implementation
> that everyone can be happy with and what you've come up with so far
> doesn't have a lot of appeal, based on the feedback (I've only glanced
> through it myself, but I agree with Andres and Tom that it's a larger
> footprint than we'd want for this and the number of places having to be
> touched is concerning as it could lead to future bugs).

This conversation has gotten a little confusing because some of my
posts seem to have been held up in the email systems for some
reason, and are arriving out-of-order (and at least one I don't see
yet).  But I have been responding to these points.  For one thing I
stated four days ago that I now feel that the metric for
determining that a snapshot is "old" should be *time*, not
transactions IDs consumed, and offered to modify that patch in that
direction if people agreed.  I now see one person agreeing and
several people arguing that I should do just that (as though they
had not seem me propose it), so I'll try to get a new version out
today like that.  In any event, I am sure it is possible and feel
that it would be better.  (Having posted that now at least 4 times,
hopefully people will pick up on it.)

For another thing, as mentioned earlier, this is the btree footprint:
src/backend/access/nbtree/nbtinsert.c |   7 ++++---src/backend/access/nbtree/nbtpage.c     |   2
+-src/backend/access/nbtree/nbtsearch.c| 43 ++++++++++++++++++++++++++++++++++---------src/include/access/nbtree.h
              |   7 ++++---4 files changed, 43 insertions(+), 16 deletions(-)

That consists mainly of two things.  First, there are 13 places
(all in nbtsearch.c) immediately following a call to
BufferGetPage() that a line like this was added:
   TestForOldSnapshot(snapshot, rel, page);

Second, some functions that were not previously passed a snapshot
had that added to the function signature and the snapshot was
passed from the caller where needed.  C files other than nbtsearch
were only modified to pass a NULL in to calls to the functions with
modified signatures -- in a total of four places.

That is it.

To be sure I had to initially spend some time figuring out which
BufferGetPage() calls needed to be followed with this test, but
it's not rocket science.

> A lot of that would go away if there was a way to avoid having to mess
> with the index AMs, I'd think, but I wonder if we'd actually need more
> done there- it's not immediately obvious to me how an index-only scan is
> safe with this.  Whenever an actual page is visited, we can check the
> LSN, and the relation can't be truncated by vacuum since the transaction
> will still have a lock on the table which prevents it, but does the
> visibility-map update check make sure to never mark pages all-visible
> when one of these old transactions is running around?  On what basis?

I will review that; it may indeed need some attention.  The index
entry is removed before the line pointer can be freed, but the
question is whether there is a window where pruning or vacuum has
removed the tuple and changed the page visibility map bit to say
that everything on the page is visible to all before the index
entry is removed.  There may be a need to, as you say, suppress
setting the all-visible flag in some particular cases.

>>> but I do think we'd be a lot better off with a system that could
>>> realize when rows are not visible to any currently running transaction
>>> and clean them up.
>>
>> +1
>>
>> But they are not mutually exclusive; I see them as complementary.
>
> I can see how they would be, provided we can be confident that we're
> going to actually throw an error when the snapshot is out of date and
> not end up returning incorrect results.  We need to be darn sure of
> that, both now and in a few years from now when many of us may have
> forgotten about this knob.. ;)

I get that this is not zero-cost to maintain, and that it might not
be of interest to the community largely for that reason.  If you
have any ideas about how to improve that, I'm all ears.  But do
consider the actual scope of what was needed for the btree coverage
(above).  (Note that the index-only scan issue doesn't look like
anything AM-specific; if something is needed for that it would be
in the pruning/vacuum area.)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Andrew Dunstan

Date:

19 February 2015, 17:31:55

On 02/19/2015 09:44 AM, Kevin Grittner wrote:

> On the 15th I said this:
>
> | What this discussion has made me reconsider is the metric for
> | considering a transaction "too old".  The number of transaction IDs
> | consumed seems inferior as the unit of measure for that to LSN or
> | time.
> |
> | It looks to me to be pretty trivial (on the order of maybe 30 lines
> | of code) to specify this GUC in minutes rather than transaction
> | IDs.  At first glance this seems like it would be vulnerable to the
> | usual complaints about mismanaged clocks, but those are easily
> | answered by using a cached time in shared memory that we populate
> | in such a way that it can never move backward.  Done right, this
> | could even allow the GUC to be changeable on reload rather than
> | only at restart.  A badly mismanaged system clock could not cause a
> | query to generate incorrect results; the worst that could happen is
> | that this feature would fail to control bloat as much as expected
> | or reads of modified data could generate the "snapshot too old"
> | error around the time of the clock adjustment.
> |
> | As before, this would default to a magic value to mean that you
> | want the historical PostgreSQL behavior.
> |
> | If that makes the idea more palatable to anyone, I can submit a
> | patch to that effect within the next 24 hours.
>
> Until yesterday I didn't get any feedback suggesting that such a
> change would make the patch more palatable.  Now that I have had,
> I'll try to post a patch to that effect today.
>
> Thanks!
>


I understand why this make people nervous. I wonder if it might be more 
palatable if there were a per-table setting that could enable it? If we 
could ensure that this was only applied to high churn queue tables, say, 
while tables touched by the report writer would not have it applied, 
that would calm a lot of my fears.

I'm also interested in handling the case Stephen Frost described, where 
a tuple is effectively dead but we don't currently have the means of 
discovering the fact, because there is an older long running transaction 
which is never in fact going to be able to see the tuple.

cheers

andrew

Re: Allow "snapshot too old" error, to prevent bloat

From

Rod Taylor

Date:

19 February 2015, 18:23:53

On Wed, Feb 18, 2015 at 4:57 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

>> But max_standby_streaming_delay, max_standby_archive_delay and
>> hot_standby_feedback are among the most frequent triggers for
>> questions and complaints that I/we see.
>>
> Agreed.
> And a really bad one used to be vacuum_defer_cleanup_age, because
> of confusing units amongst other things. Which in terms seems
> fairly close to Kevins suggestions, unfortunately.

Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.

I like this thought. One of the first things I do in a new Pg environment is setup a cronjob that watches pg_stat_activity and terminates most backends over N minutes in age (about 5x the length of normal work) with an exception for a handful of accounts doing backups and other maintenance operations. This prevents a stuck client from jamming up the database.

Would pg_dump be able to opt-out of such a restriction?

regards,

Rod

Re: Allow "snapshot too old" error, to prevent bloat

From

Ants Aasma

Date:

19 February 2015, 18:36:29

On Thu, Feb 19, 2015 at 6:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>> I can see how they would be, provided we can be confident that we're
>> going to actually throw an error when the snapshot is out of date and
>> not end up returning incorrect results.  We need to be darn sure of
>> that, both now and in a few years from now when many of us may have
>> forgotten about this knob.. ;)
>
> I get that this is not zero-cost to maintain, and that it might not
> be of interest to the community largely for that reason.  If you
> have any ideas about how to improve that, I'm all ears.  But do
> consider the actual scope of what was needed for the btree coverage
> (above).  (Note that the index-only scan issue doesn't look like
> anything AM-specific; if something is needed for that it would be
> in the pruning/vacuum area.)

If I understood the issue correctly, you have long running snapshots
accessing one part of the data, while you have high churn on a
disjoint part of data. We would need to enable vacuum on the high
churn data while still being able to run those long queries. The
"snapshot too old" solution limits the maximum age of global xmin at
the cost of having to abort transactions that are older than this and
happen to touch a page that was modified after they were started.

What about having the long running snapshots declare their working
set, and then only take them into account for global xmin for
relations that are in the working set? Like a SET TRANSACTION WORKING
SET command. This way the error is deterministic, vacuum on the high
churn tables doesn't have to wait for the old transaction delay to
expire and we avoid a hard to tune GUC (what do I need to set
old_snapshot_threshold to, to reduce bloat while not having "normal"
transactions abort).

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

19 February 2015, 18:44:13

Andrew Dunstan <andrew@dunslane.net> wrote:
> On 02/19/2015 09:44 AM, Kevin Grittner wrote:

> I understand why this make people nervous. I wonder if it might be more
> palatable if there were a per-table setting that could enable it? If we
> could ensure that this was only applied to high churn queue tables, say,
> while tables touched by the report writer would not have it applied,
> that would calm a lot of my fears.

That's an interesting idea.  I think I should switch the unit of
measure for "too old" to a time-based value first, since there
seems to be universal agreement that it would be better than number
of transactions.  Once I've cleared that hurdle I'll see what this
would take.

> I'm also interested in handling the case Stephen Frost described, where
> a tuple is effectively dead but we don't currently have the means of
> discovering the fact, because there is an older long running transaction
> which is never in fact going to be able to see the tuple.

Absolutely. That's one of several other issues that I've been
looking at over the last few weeks. It sounds like people are
already working on that one, which is great. My personal priority
list included that, but after the two I submitted here and a patch
to allow combining near-empty btree pages so that btree bloat is
constrained without having to reindex periodically for the cases
where index tuples are added in index order (at one or a few
insertion points) and most-but-not-all are deleted. You can
currently wind up with a worst-case of one index tuple per page
with no action to reduce that bloat by vacuum processes.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Claudio Freire

Date:

19 February 2015, 18:54:34

On Thu, Feb 19, 2015 at 3:44 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>> I'm also interested in handling the case Stephen Frost described, where
>> a tuple is effectively dead but we don't currently have the means of
>> discovering the fact, because there is an older long running transaction
>> which is never in fact going to be able to see the tuple.
>
> Absolutely. That's one of several other issues that I've been
> looking at over the last few weeks. It sounds like people are
> already working on that one, which is great. My personal priority
> list included that, but after the two I submitted here and a patch
> to allow combining near-empty btree pages so that btree bloat is
> constrained without having to reindex periodically for the cases
> where index tuples are added in index order (at one or a few
> insertion points) and most-but-not-all are deleted. You can
> currently wind up with a worst-case of one index tuple per page
> with no action to reduce that bloat by vacuum processes.

I'd be willing to test that patch.

I have a big database that does that, and a script that fills the disk
with said bloat. That's forced me to do periodic reindexing, which
sucks.

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

19 February 2015, 18:54:42

Rod Taylor <rod.taylor@gmail.com> wrote:

> Would pg_dump be able to opt-out of such a restriction?

I don't see how, since vacuum would be removing recently dead
tuples that are still visible; the alternative to getting a
"snapshot too old" error when reading a page which could be
affected is to return incorrect results, and nobody wants that.
The best you could do if you wanted to run pg_dump (or similar) and
it might take more time than your old_snapshot_threshold would be
to increase the threshold, reload, dump, set it back to the
"normal" setting, and reload again.


Andrew's suggestion of setting this per table would not help here.


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

19 February 2015, 20:31:25

Ants Aasma <ants@cybertec.at> wrote:

> If I understood the issue correctly, you have long running snapshots
> accessing one part of the data, while you have high churn on a
> disjoint part of data. We would need to enable vacuum on the high
> churn data while still being able to run those long queries. The
> "snapshot too old" solution limits the maximum age of global xmin at
> the cost of having to abort transactions that are older than this and
> happen to touch a page that was modified after they were started.

Pretty much.  It's not quite as "black and white" regarding high
churn and stable tables, though -- at least for the customer whose
feedback drove my current work on containing bloat.  They are
willing to tolerate an occasional "snapshot too old" on a cursor,
and they can automatically pick up again by restarting the cursor
with a new "lower limit" and a new snapshot.

> What about having the long running snapshots declare their working
> set, and then only take them into account for global xmin for
> relations that are in the working set? Like a SET TRANSACTION WORKING
> SET command. This way the error is deterministic, vacuum on the high
> churn tables doesn't have to wait for the old transaction delay to
> expire and we avoid a hard to tune GUC (what do I need to set
> old_snapshot_threshold to, to reduce bloat while not having "normal"
> transactions abort).

Let me make sure I understand your suggestion.  You are suggesting
that within a transaction you can declare a list of tables which
should get the "snapshot too old" error (or something like it) if a
page is read which was modified after the snapshot was taken?
Snapshots within that transaction would not constrain the effective
global xmin for purposes of vacuuming those particular tables?

If I'm understanding your proposal, that would help some of the
cases the "snapshot too old" case addresses, but it would not
handle the accidental "idle in transaction" case (e.g., start a
repeatable read transaction, run one query, then sit idle
indefinitely).  I don't think it would work quite as well for some
of the other cases I've seen, but perhaps it could be made to fit
if we could figure out the accidental "idle in transaction" case.

I also think it might be hard to develop a correct global xmin for
vacuuming a particular table with that solution.  How do you see
that working?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Andrew Dunstan

Date:

19 February 2015, 20:52:57

On 02/19/2015 03:31 PM, Kevin Grittner wrote:

>> What about having the long running snapshots declare their working
>> set, and then only take them into account for global xmin for
>> relations that are in the working set? Like a SET TRANSACTION WORKING
>> SET command. This way the error is deterministic, vacuum on the high
>> churn tables doesn't have to wait for the old transaction delay to
>> expire and we avoid a hard to tune GUC (what do I need to set
>> old_snapshot_threshold to, to reduce bloat while not having "normal"
>> transactions abort).
> Let me make sure I understand your suggestion.  You are suggesting
> that within a transaction you can declare a list of tables which
> should get the "snapshot too old" error (or something like it) if a
> page is read which was modified after the snapshot was taken?
> Snapshots within that transaction would not constrain the effective
> global xmin for purposes of vacuuming those particular tables?


I thought it meant that the declared tables would only be vacuumed 
conservatively, so the transaction would expect not to see "snapshot too 
old" from them.

cheers

andrew

Re: Allow "snapshot too old" error, to prevent bloat

From

Ants Aasma

Date:

19 February 2015, 22:19:46

<p dir="ltr">On Feb 19, 2015 10:31 PM, "Kevin Grittner" <<a
href="mailto:kgrittn@ymail.com">kgrittn@ymail.com</a>>wrote:<br /> > > What about having the long running
snapshotsdeclare their working<br /> > > set, and then only take them into account for global xmin for<br /> >
>relations that are in the working set? Like a SET TRANSACTION WORKING<br /> > > SET command. This way the
erroris deterministic, vacuum on the high<br /> > > churn tables doesn't have to wait for the old transaction
delayto<br /> > > expire and we avoid a hard to tune GUC (what do I need to set<br /> > >
old_snapshot_thresholdto, to reduce bloat while not having "normal"<br /> > > transactions abort).<br /> ><br
/>> Let me make sure I understand your suggestion.  You are suggesting<br /> > that within a transaction you can
declarea list of tables which<br /> > should get the "snapshot too old" error (or something like it) if a<br /> >
pageis read which was modified after the snapshot was taken?<br /> > Snapshots within that transaction would not
constrainthe effective<br /> > global xmin for purposes of vacuuming those particular tables?<p dir="ltr">Sorry, I
shouldhave been clearer. I'm proposing that a transaction can declare what tables it will access. After that the
transactionwill constrain xmin for only those tables. Accessing any other table would give an error immediately.<p
dir="ltr">>If I'm understanding your proposal, that would help some of the<br /> > cases the "snapshot too old"
caseaddresses, but it would not<br /> > handle the accidental "idle in transaction" case (e.g., start a<br /> >
repeatableread transaction, run one query, then sit idle<br /> > indefinitely).  I don't think it would work quite
aswell for some<br /> > of the other cases I've seen, but perhaps it could be made to fit<br /> > if we could
figureout the accidental "idle in transaction" case.<p dir="ltr">Accidental idle in transaction seems better handled by
justterminating transactions older than some timeout. This is already doable externally, but an integrated solution
wouldbe nice if we could figure out how the permissions for setting such a timeout work.<p dir="ltr">> I also think
itmight be hard to develop a correct global xmin for<br /> > vacuuming a particular table with that solution.  How
doyou see<br /> > that working?<p dir="ltr">I'm imagining the long running transaction would register it's xmin in a
sharedmap keyed by relation for each referenced relation and remove the xmin from procarray. Vacuum would access this
mapby relation, determine the minimum and use that if it's earlier than the global xmin. I'm being deliberately vague
hereabout the datastructure in shared memory as I don't have a great idea what to use there. It's somewhat similar to
thelock table in that in theory the size is unbounded, but in practice it's expected to be relatively tiny.<p
dir="ltr">Regards,<br/> Ants Aasma<br />

Re: Allow "snapshot too old" error, to prevent bloat

From

Stephen Frost

Date:

19 February 2015, 23:39:53

Kevin,

* Kevin Grittner (kgrittn@ymail.com) wrote:
> Stephen Frost <sfrost@snowman.net> wrote:
> > Kevin Grittner (kgrittn@ymail.com) wrote:
> >> (1)  They have a pool of connections each of which can have several
> >> long-running cursors, so the limit from that isn't just doubling
> >> the size of their database, it is limiting it to some two-or-three
> >> digit multiple of the necessary size.
> >
> > This strikes me as a bit off-the-cuff; was an analysis done which
> > deteremined that would be the result?
>
> If it sounds like off-the-cuff it is because I am summarizing the
> mountain of data which has been accumulated over weeks of
> discussions and many rounds of multi-day tests using their actual
> software driven by a software test driver that simulates users,
> with "think time" and all.

My apologies for, unintentionally, implying that you hadn't done
sufficient research or testing.  That certainly was *not* my intent and
it's absolutely clear to me that you've spent quite a bit of time
analyzing this problem.

What I was trying to get at is that the approach which I outlined was,
perhaps, not closely analyzed.  Now, analyzing a problem based on a
solution which doesn't actually exist isn't exactly trivial, and while
it might be possible to do in this case it'd probably be easier to
simply write the code and the review the results than perform an
independent analysis.  I do feel that the solution might have been
dismissed on the basis of "doubling the database for each long-running
transaction is not acceptable" or even "having bloat because of a
long-running transaction is not acceptable," but really analyzing what
would actually happen is a much more difficult exercise.

> With the two patches I submitted, bloat stabilized at less than 5%
> except for some btree indexes which followed pattern of inserting
> at the end and deleting most (but not all) of the entries over
> time.  I've been working on a separate patch for that, but it's not
> ready to present, so I probably won't post that until the first CF
> of the next release -- unless someone's interested enough to want
> to discuss it before then.

Understood.  I'd encourage you to post your thoughts on it- I'm
certainly interested in it, even if it isn't something we'll be able to
really work on for 9.5.

> > That, really, is the crux here- they've already got support for dealing
> > with it the way Oracle did and they'd like PG to do that too.
> > Unfortunately, that, by itself, isn't a good reason for a particular
> > capability (we certainly aren't going to be trying to duplicate PL/SQL
> > in PG any time soon).
>
> I understand that, and if the community doesn't want this style of
> limitation on bloat we can make it part of PPAS (which *does*
> duplicate PL/SQL, among other things).  We are offering it to the
> community first.

That is certainly appreciated and I'd encourage you to continue to do
so.  I do feel that the community, in general, is interested in these
kinds of knobs- it's just a question of implemenation and making sure
that it's a carefully considered solution and not a knee-jerk reaction
to match what another RDBMS does.

> >>> I'm not against having a knob like this, which is defaulted to off,
> >>
> >> Thanks!  I'm not sure that amounts to a +1, but at least it doesn't
> >> sound like a -1.  :-)
> >
> > So, at the time I wrote that, I wasn't sure if it was a +1 or not
> > myself.  I've been thinking about it since then, however, and I'm
> > leaning more towards having the capability than not, so perhaps that's a
> > +1, but it doesn't excuse the need to come up with an implementation
> > that everyone can be happy with and what you've come up with so far
> > doesn't have a lot of appeal, based on the feedback (I've only glanced
> > through it myself, but I agree with Andres and Tom that it's a larger
> > footprint than we'd want for this and the number of places having to be
> > touched is concerning as it could lead to future bugs).
>
> This conversation has gotten a little confusing because some of my
> posts seem to have been held up in the email systems for some
> reason, and are arriving out-of-order (and at least one I don't see
> yet).  But I have been responding to these points.  For one thing I
> stated four days ago that I now feel that the metric for
> determining that a snapshot is "old" should be *time*, not
> transactions IDs consumed, and offered to modify that patch in that
> direction if people agreed.  I now see one person agreeing and
> several people arguing that I should do just that (as though they
> had not seem me propose it), so I'll try to get a new version out
> today like that.  In any event, I am sure it is possible and feel
> that it would be better.  (Having posted that now at least 4 times,
> hopefully people will pick up on it.)

I agree with others that having it be time-based would be better.  I
did see that but it's really an independent consideration from the
footprint of the patch.

> For another thing, as mentioned earlier, this is the btree footprint:
>
>  src/backend/access/nbtree/nbtinsert.c |   7 ++++---
>  src/backend/access/nbtree/nbtpage.c     |   2 +-
>  src/backend/access/nbtree/nbtsearch.c | 43 ++++++++++++++++++++++++++++++++++---------
>  src/include/access/nbtree.h                     |   7 ++++---
>  4 files changed, 43 insertions(+), 16 deletions(-)

Understood- but that's not really the issue here, it's...

> That consists mainly of two things.  First, there are 13 places
> (all in nbtsearch.c) immediately following a call to
> BufferGetPage() that a line like this was added:
>
>     TestForOldSnapshot(snapshot, rel, page);
>

This.  That's quite a few different places which had to be updated and
while they are obvious now, it might not be obvious in the future which
BufferGetPage() calls need that follows TestForOldSnapshot() call.

> Second, some functions that were not previously passed a snapshot
> had that added to the function signature and the snapshot was
> passed from the caller where needed.  C files other than nbtsearch
> were only modified to pass a NULL in to calls to the functions with
> modified signatures -- in a total of four places.

This part doesn't bother me nearly as much as having to add those tests
into a lot of different places in the code, and having to get the index
AMs involved.

> > A lot of that would go away if there was a way to avoid having to mess
> > with the index AMs, I'd think, but I wonder if we'd actually need more
> > done there- it's not immediately obvious to me how an index-only scan is
> > safe with this.  Whenever an actual page is visited, we can check the
> > LSN, and the relation can't be truncated by vacuum since the transaction
> > will still have a lock on the table which prevents it, but does the
> > visibility-map update check make sure to never mark pages all-visible
> > when one of these old transactions is running around?  On what basis?
>
> I will review that; it may indeed need some attention.  The index
> entry is removed before the line pointer can be freed, but the
> question is whether there is a window where pruning or vacuum has
> removed the tuple and changed the page visibility map bit to say
> that everything on the page is visible to all before the index
> entry is removed.  There may be a need to, as you say, suppress
> setting the all-visible flag in some particular cases.

Right, there may be a period of time where the flag is set to
all-visible and the index entry still exists.

Isn't there a concern about the index entry disappearing also though?
If there is a specific query against a particular value in the index and
the index no longer has that value then the query would return zero
results- but this could happen without visiting the heap, right?
Therefore, how do you know that the value was, or wasn't, visible at the
time that the original long-running snapshot was taken.

> >>> but I do think we'd be a lot better off with a system that could
> >>> realize when rows are not visible to any currently running transaction
> >>> and clean them up.
> >>
> >> +1
> >>
> >> But they are not mutually exclusive; I see them as complementary.
> >
> > I can see how they would be, provided we can be confident that we're
> > going to actually throw an error when the snapshot is out of date and
> > not end up returning incorrect results.  We need to be darn sure of
> > that, both now and in a few years from now when many of us may have
> > forgotten about this knob.. ;)
>
> I get that this is not zero-cost to maintain, and that it might not
> be of interest to the community largely for that reason.  If you
> have any ideas about how to improve that, I'm all ears.  But do
> consider the actual scope of what was needed for the btree coverage
> (above).  (Note that the index-only scan issue doesn't look like
> anything AM-specific; if something is needed for that it would be
> in the pruning/vacuum area.)

Maintenance is certainly one concern among many but I'm not sure that
we're quite ready to write this capability off due to the
maintainability of this particular patch- it needs to be considered
against the alternatives proposed and the risk of breaking this
particular capability in the future.  I agree that the index issue is
not AM-specific but it's still a consideration to consider.
Thanks!
    Stephen

Re: Allow "snapshot too old" error, to prevent bloat

From

David Steele

Date:

20 February 2015, 02:58:53

On 2/19/15 1:54 PM, Kevin Grittner wrote:
> Rod Taylor <rod.taylor@gmail.com> wrote:
>
>> Would pg_dump be able to opt-out of such a restriction?
>
> I don't see how, since vacuum would be removing recently dead
> tuples that are still visible; the alternative to getting a
> "snapshot too old" error when reading a page which could be
> affected is to return incorrect results, and nobody wants that.
> The best you could do if you wanted to run pg_dump (or similar) and
> it might take more time than your old_snapshot_threshold would be
> to increase the threshold, reload, dump, set it back to the
> "normal" setting, and reload again.

While I think pg_dump is a great solution for small to medium
installations, there are a number of excellent file-based backup options
available.  Anyone who is seriously worried about bloat (or locking)
should be looking to those solutions.

--
- David Steele
david@pgmasters.net

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

20 February 2015, 13:41:27

Stephen Frost <sfrost@snowman.net> wrote:
> Kevin Grittner (kgrittn@ymail.com) wrote:

>> With the two patches I submitted, bloat stabilized at less than 5%
>> except for some btree indexes which followed pattern of inserting
>> at the end and deleting most (but not all) of the entries over
>> time.  I've been working on a separate patch for that, but it's not
>> ready to present, so I probably won't post that until the first CF
>> of the next release -- unless someone's interested enough to want
>> to discuss it before then.
>
> Understood.  I'd encourage you to post your thoughts on it- I'm
> certainly interested in it, even if it isn't something we'll be able to
> really work on for 9.5.

OK, if you're up for discussing it even while it's in early
development, I'll start a separate thread to outline the idea.  As
a preliminary "teaser" in very abbreviated form, the idea is to
basically use the "page split" logic, but splitting all tuples from
multiple adjacent pages onto a new page to the right.  The old tuples
maintain their pointers (to support scans positioned in that range),
but all pointers *to* those pages will be redirected to the new
page.  Let's not bifurcate this thread by discussing it further
here; please wait until I start the new thread, which will be after
I post the next version of this "snapshot too old" patch.

> Isn't there a concern about the index entry disappearing also though?
> If there is a specific query against a particular value in the index and
> the index no longer has that value then the query would return zero
> results- but this could happen without visiting the heap, right?

Right; that's why I don't see how to avoid touching the index AMs.
The index pages also contain the LSN, and if you are not visiting
the heap it is because some page in the index was changed, and you
will see the "too recent" LSN there.

> Therefore, how do you know that the value was, or wasn't, visible at the
> time that the original long-running snapshot was taken.

Strictly speaking, you don't know anything that specific.  You know
that changes were made that *might* have removed tuples that would
have been visible to the snapshot.  Once a snapshot passes the
threshold for being considered "old", then viewing any page changed
after the snapshot was built has to trigger the error.  That means
some false positives, but as long as you can't get false negatives
there is never a silently incorrect query result.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Amit Kapila

Date:

22 February 2015, 15:35:03

On Mon, Feb 16, 2015 at 10:49 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
>
> What this discussion has made me reconsider is the metric for
> considering a transaction "too old". The number of transaction IDs
> consumed seems inferior as the unit of measure for that to LSN or
> time.
>
> It looks to me to be pretty trivial (on the order of maybe 30 lines
> of code) to specify this GUC in minutes rather than transaction
> IDs.

It seems to me that SQL Server also uses similar mechanism to

avoid the bloat in version store (place to store previous versions or

record).

Mechanism used by them is that during the shrink process, the

longest running transactions that have not yet generated row

versions are marked as victims. If a transaction is marked as a

victim, it can no longer read the row versions in the version store.

When it attempts to read row versions, an error message is

generated and the transaction is rolled back.

I am not sure how much weight it adds to the usefulness of this

patch, however I think if other leading databases provide a way to

control the bloat, it indicates that most of the customers having

write-intesive workload would like to see such an option.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

22 February 2015, 16:49:15

Amit Kapila <amit.kapila16@gmail.com> wrote:

> It seems to me that SQL Server also uses similar mechanism to
> avoid the bloat in version store (place to store previous
> versions or record).

> I think if other leading databases provide a way to control the
> bloat, it indicates that most of the customers having
> write-intesive workload would like to see such an option.

Yeah, I think many users are surprised by the damage that can be
done to a database by an idle connection or long-running read-only
query, especially when coming from other databases which have
protections against that.

The bad news is that changing to specifying the limit in time
rather than transaction IDs is far more complicated than I thought.
Basically, we need to associate times in the past with the value of
latestCompletedXid (or derived values like snapshot xmin) at those
times.  I was initially thinking that by adding a timestamp to the
snapshot we could derive this from active snapshots.  For any one
connection, it's not that hard for it to scan the pairingheap of
snapshots and pick out the right values; but the problem is that it
is process-local memory and this needs to work while the connection
is sitting idle for long periods of time.  I've got a couple ideas
on how to approach it, but neither seems entirely satisfying:

(1) Use a new timer ID to interrupt the process whenever its
oldest snapshot which hasn't yet crossed the "old snapshot"
threshold does so.  The problem is that this seems as though it
would need to do an awful lot more work (scanning the pairing heap
for the best match) than anything else is currently doing in a
timeout handler.  One alternative would be to keep a second pairing
heap to track snapshots which have not crossed the threshold,
removing items as they get old -- that would make the work needed
in the timeout handler closer to other handlers.

(2) Use a course enough granularity on time and a short enough
maximum for the GUC to just keep a circular buffer of the mappings
in memory.  We might be able to make this dense enough that one
minute resolution for up to 60 days could fit in 338kB.  Obviously
that could be reduced with courser granularity or a shorter
maximum.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Andrew Dunstan

Date:

23 February 2015, 00:35:42

On 02/22/2015 11:48 AM, Kevin Grittner wrote:
>
> (2) Use a course enough granularity on time and a short enough
> maximum for the GUC to just keep a circular buffer of the mappings
> in memory.  We might be able to make this dense enough that one
> minute resolution for up to 60 days could fit in 338kB.  Obviously
> that could be reduced with courser granularity or a shorter
> maximum.
>

This doesn't sound too bad to me. Presumably these would be tunable. I 
think one minute granularity would be fine for most purposes,  but I can 
imagine people who would want it finer, like 10 seconds, say.

cheers

andrew

Re: Allow "snapshot too old" error, to prevent bloat

From

Amit Kapila

Date:

23 February 2015, 16:31:47

On Sun, Feb 22, 2015 at 10:18 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > It seems to me that SQL Server also uses similar mechanism to
> > avoid the bloat in version store (place to store previous
> > versions or record).
>
> > I think if other leading databases provide a way to control the
> > bloat, it indicates that most of the customers having
> > write-intesive workload would like to see such an option.
>
> Yeah, I think many users are surprised by the damage that can be
> done to a database by an idle connection or long-running read-only
> query, especially when coming from other databases which have
> protections against that.
>
> The bad news is that changing to specifying the limit in time
> rather than transaction IDs is far more complicated than I thought.
> Basically, we need to associate times in the past with the value of
> latestCompletedXid (or derived values like snapshot xmin) at those
> times. I was initially thinking that by adding a timestamp to the
> snapshot we could derive this from active snapshots. For any one
> connection, it's not that hard for it to scan the pairingheap of
> snapshots and pick out the right values; but the problem is that it
> is process-local memory and this needs to work while the connection
> is sitting idle for long periods of time.

Could you please explain in slightly more detail why can't it work

if we use timestamp instead of snapshot->xmin in your patch in

function TestForOldSnapshot()?

> I've got a couple ideas
> on how to approach it, but neither seems entirely satisfying:
>
> (1) Use a new timer ID to interrupt the process whenever its
> oldest snapshot which hasn't yet crossed the "old snapshot"
> threshold does so. The problem is that this seems as though it
> would need to do an awful lot more work (scanning the pairing heap
> for the best match) than anything else is currently doing in a
> timeout handler. One alternative would be to keep a second pairing
> heap to track snapshots which have not crossed the threshold,
> removing items as they get old -- that would make the work needed
> in the timeout handler closer to other handlers.
>
> (2) Use a course enough granularity on time and a short enough
> maximum for the GUC to just keep a circular buffer of the mappings
> in memory. We might be able to make this dense enough that one
> minute resolution for up to 60 days could fit in 338kB. Obviously
> that could be reduced with courser granularity or a shorter
> maximum.
>

How exactly will this allow to return "snapshot old" error, do we

need a check for page (page lsn) as in your current patch?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Allow "snapshot too old" error, to prevent bloat

From

Kevin Grittner

Date:

24 February 2015, 13:21:02

Amit Kapila <amit.kapila16@gmail.com> wrote:

> Could you please explain in slightly more detail why can't it work> if we use timestamp instead of snapshot->xmin in
yourpatch in
 

> function TestForOldSnapshot()?

It works fine for the additional visibility checking in scans, but
it doesn't cover the vacuuming -- that needs to use a transaction ID
for its cutoff.
> How exactly will this allow to return "snapshot old" error, do we
> need a check for page (page lsn) as in your current patch?


The change to the part where it actually throws the error is very
small, just checking time instead of xmin.


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Simon Riggs

Date:

14 March 2015, 16:01:58

On 17 February 2015 at 06:35, Magnus Hagander <magnus@hagander.net> wrote:
>
> On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:
>>
>> On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
>> > It seems we already have a mechanism in place that allows tuning of
>> > query cancel on standbys vs. preventing standby queries from seeing old
>> > data, specifically
>> > max_standby_streaming_delay/max_standby_archive_delay.  We obsessed
>> > about how users were going to react to these odd variables, but there
>> > has been little negative feedback.
>>
>> FWIW, I think that's a somewhat skewed perception. I think it was right to
>> introduce those, because we didn't really have any alternatives.
>>
>> But max_standby_streaming_delay, max_standby_archive_delay and
>> hot_standby_feedback are among the most frequent triggers for questions
>> and complaints that I/we see.
>>
>
> Agreed.
>
> And a really bad one used to be vacuum_defer_cleanup_age, because of
> confusing units amongst other things. Which in terms seems fairly close to
> Kevins suggestions, unfortunately.

I agree with Bruce that we already have a mechanism similar to this
for Hot Standby, so it should therefore be OK to have it for normal
running when users desire it. The key difference here is that in this
patch we use the LSN to look for changed data blocks, allowing queries
to proceed for longer than they would do on Hot Standby where they
currently just get blown away regardless. I like the proposal in this
patch but it is more extreme than the mechanism Hot Standby provides.
(If we implement this then I would want to see it work for Hot Standby
also so we can keep the mechanisms in step, I think that is too late
in the day to add that for 9.5.)

I also agree with Andres and Magnus in saying that in the current
definition of the tunable parameter it would be hard to use.

In response to Tom's perspective that it is the single most annoying
feature of Oracle, I agree. It just happens we have a similar problem:
bloat.

Having a parameter to define "maximum acceptable bloat" would be a
very useful thing in PostgreSQL.  IIRC other DBMS had a lot of work to
make "snapshot too old" occur in controllable circumstances.  So I
would call having a very crap parameter like "old_snapshot_limit" a
good start, but clearly not the end point of an initiative too improve
our control of bloat.

+1 to include this patch in 9.5, as long as there is agreement to
improve this in the future

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training &
Services

Re: Allow "snapshot too old" error, to prevent bloat

From

Rui DeSousa

Date:

16 March 2015, 00:05:06

Would a parameter to auto terminate long running transactions be a better solution? Terminating the long running
transactionwould allow for the normal vacuuming process to cleanup the deleted records thus avoiding database bloat
withoutintroducing new semantics to Postgres's MVCC. I would also recommend that the default be disabled.

Re: Allow "snapshot too old" error, to prevent bloat

From

Gavin Flower

Date:

16 March 2015, 00:14:32

On 16/03/15 13:04, Rui DeSousa wrote:
> Would a parameter to auto terminate long running transactions be a better solution? Terminating the long running
transactionwould allow for the normal vacuuming process to cleanup the deleted records thus avoiding database bloat
withoutintroducing new semantics to Postgres's MVCC. I would also recommend that the default be disabled.

>
>
Hmm...

Better per transaction, or even per statement.

If I fire of a query I expect should be completed within 30 seconds, 
then I probably don't mind it aborting after 5 minutes.   However, if I 
fire of a query that I expect to take between an hour and 3 hours, then 
I definitely don't want it aborted after 5 minutes!

It could be that 99.99% of transactions will complete with a minute, but 
a small minority might reasonably be expected to take much longer.

Cheers,
Gavin

Re: Allow "snapshot too old" error, to prevent bloat

From

Robert Haas

Date:

16 March 2015, 15:46:51

On Sun, Mar 15, 2015 at 8:04 PM, Rui DeSousa <rui@crazybean.net> wrote:
> Would a parameter to auto terminate long running transactions be a better solution? Terminating the long running
transactionwould allow for the normal vacuuming process to cleanup the deleted records thus avoiding database bloat
withoutintroducing new semantics to Postgres's MVCC. I would also recommend that the default be disabled. 

An advantage of Kevin's approach is that you don't have to abort *all*
long-running transactions, only those that touch data which has been
modified since then.  If your system is read-mostly, that could
dramatically reduce the number of aborts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Allow "snapshot too old" error, to prevent bloat

From

Andrew Dunstan

Date:

16 March 2015, 16:28:52

<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Mar 16, 2015 at 11:46 AM, Robert
Haas<span dir="ltr"><<a href="mailto:robertmhaas@gmail.com" target="_blank">robertmhaas@gmail.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span
class="">OnSun, Mar 15, 2015 at 8:04 PM, Rui DeSousa <<a href="mailto:rui@crazybean.net">rui@crazybean.net</a>>
wrote:<br/> > Would a parameter to auto terminate long running transactions be a better solution? Terminating the
longrunning transaction would allow for the normal vacuuming process to cleanup the deleted records thus avoiding
databasebloat without introducing new semantics to Postgres's MVCC. I would also recommend that the default be
disabled.<br/><br /></span>An advantage of Kevin's approach is that you don't have to abort *all*<br /> long-running
transactions,only those that touch data which has been<br /> modified since then.  If your system is read-mostly, that
could<br/> dramatically reduce the number of aborts.<br /><span class="HOEnZb"><font color="#888888"></font></span><br
/></blockquote></div><br/></div><div class="gmail_extra">Indeed. The suggestion of auto-terminating long running
transactionsmisses the point, ISTM. Most of the use cases I can see for this involve vacuuming tables that the long
runningtransaction will have no interest in at all (which is why I suggested being able to set it on a per table
basis).I certainly don't want to be terminating my long running report transaction so that my queue table which is only
evertouched by very short-lived transactions can be reasonably vacuumed. But that's what we have to do now.<br /><br
/></div><divclass="gmail_extra">cheers<br /><br /></div><div class="gmail_extra">andrew<br /></div></div>

Re: Allow "snapshot too old" error, to prevent bloat

From

Andy Fan

Date:

08 November 2020, 09:34:34

On Sun, Nov 8, 2020 at 5:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jim Nasby <Jim.Nasby@bluetreble.com> writes:
> On 2/15/15 10:36 AM, Tom Lane wrote:
>> I wonder if we couldn't achieve largely the same positive effects through
>> adding a simple transaction-level timeout option.

> A common use-case is long-running reports hitting relatively stable data
> in a database that also has tables with a high churn rate (ie: queue
> tables). In those scenarios your only options right now are to suffer
> huge amounts of bloat in the high-churn or not do your reporting. A
> simple transaction timeout only "solves" this by denying you reporting
> queries.

Agreed, but Kevin's proposal has exactly the same problem only worse,
because (a) the reporting transactions might or might not fail (per
Murphy, they'll fail consistently only when you're on deadline), and
(b) if they do fail, there's nothing you can do short of increasing the
slack db-wide.

> An idea that I've had on this would be some way to "lock down" the
> tables that a long-running transaction could access. That would allow
> vacuum to ignore any snapshots that transaction had for tables it wasn't
> accessing. That's something that would be deterministic.

There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad,

Based on my experience, a long transaction which is caused by a long-running

query is not so many. A common case I met is user start a transaction

but forget it to close it, I am thinking if we can do something for this situation.

Since most users will use Read Committed isolation level, so after a user

completes a query, the next query will use a fresh new snapshot, so there is no

need to block the oldest xmin because of this. will it be correct to advance the oldest

xmin in this case? If yes, what would be the blocker in implementing this feature?

Best Regards

Andy Fan