Thread: snapshot too old, configured by time

snapshot too old, configured by time

From
Kevin Grittner
Date:
As discussed when the "proof of concept" patch was submitted during
9.5 development, here is a version intended to be considered for
commit to 9.6, with the following changes:

1. It is configured using time rather than number of transactions.
Not only was there unanimous agreement here that this was better,
but the EDB customer who had requested this feature and who had
been testing it independently made the same request.

2. The "proof of concept" patch only supported heap and btree
checking; this supports all index types.

3. Documentation has been added.

4. Tests have been added.  They are currently somewhat minimal,
since this is using a whole new technique for testing from any
existing committed tests -- I wanted to make sure that this
approach to testing was OK with everyone before expanding it.  If
it is, I assume we will want to move some of the more generic
portions to a .pm file to make it available for other tests.

Basically, this patch aims to limit bloat when there are snapshots
that are kept registered for prolonged periods.  The immediate
reason for this is a customer application that keeps read-only
cursors against fairly static data open for prolonged periods, and
automatically fields SQLSTATE 72000 to re-open them if necessary.
When used, it should also provide some protections against extreme
bloat from forgotten "idle in transaction" connections which are
left holding a snapshot.

Once a snapshot reaches the age threshold, it can be terminated if
reads data modified after the snapshot was built.  It is expected
that useful ranges will normally be somewhere from a few hours to a
few days.

By default old_snapshot_threshold is set to -1, which disables the
new behavior.

The customer has been testing a preliminary version of this
time-based patch for several weeks, and is happy with the results
-- it is preventing bloat for them and not generating "snapshot too
old" errors at unexpected times.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: snapshot too old, configured by time

From
Steve Singer
Date:
On 08/31/2015 10:07 AM, Kevin Grittner wrote:


Kevin,

I've started to do a review on this patch but I am a bit confused with
some of what I am seeing.

The attached testcase fails I  replace the cursor in your test case with
direct selects from the table.
I would have expected this to generate the snapshot too old error as
well but it doesn't.



#   Failed test 'expect "snapshot too old" error'
#   at t/002_snapshot_too_old_select.pl line 64.
#          got: ''
#     expected: '72000'
# Looks like you failed 1 test of 9.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/9 subtests

Am I misunderstanding something or is the patch not working as expected?




> As discussed when the "proof of concept" patch was submitted during
> 9.5 development, here is a version intended to be considered for
> commit to 9.6, with the following changes:
>
> 1. It is configured using time rather than number of transactions.
> Not only was there unanimous agreement here that this was better,
> but the EDB customer who had requested this feature and who had
> been testing it independently made the same request.
>
> 2. The "proof of concept" patch only supported heap and btree
> checking; this supports all index types.
>
> 3. Documentation has been added.
>
> 4. Tests have been added.  They are currently somewhat minimal,
> since this is using a whole new technique for testing from any
> existing committed tests -- I wanted to make sure that this
> approach to testing was OK with everyone before expanding it.  If
> it is, I assume we will want to move some of the more generic
> portions to a .pm file to make it available for other tests.
>
> Basically, this patch aims to limit bloat when there are snapshots
> that are kept registered for prolonged periods.  The immediate
> reason for this is a customer application that keeps read-only
> cursors against fairly static data open for prolonged periods, and
> automatically fields SQLSTATE 72000 to re-open them if necessary.
> When used, it should also provide some protections against extreme
> bloat from forgotten "idle in transaction" connections which are
> left holding a snapshot.
>
> Once a snapshot reaches the age threshold, it can be terminated if
> reads data modified after the snapshot was built.  It is expected
> that useful ranges will normally be somewhere from a few hours to a
> few days.
>
> By default old_snapshot_threshold is set to -1, which disables the
> new behavior.
>
> The customer has been testing a preliminary version of this
> time-based patch for several weeks, and is happy with the results
> -- it is preventing bloat for them and not generating "snapshot too
> old" errors at unexpected times.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>


Attachment

Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
I'm starting to read through this and have a few questions.

Kevin Grittner wrote:

> 4. Tests have been added.  They are currently somewhat minimal,
> since this is using a whole new technique for testing from any
> existing committed tests -- I wanted to make sure that this
> approach to testing was OK with everyone before expanding it.  If
> it is, I assume we will want to move some of the more generic
> portions to a .pm file to make it available for other tests.

I think your test module is a bit unsure about whether it's called tso
or sto.  I would place it inside src/test/modules, so that buildfarm
runs it automatically; I'm not sure it will pick up things in src/test.
(This is the whole point of src/test/modules, after all).  I haven't
written or reviewed our TestLib stuff so I can't comment on whether the
test code is good or not.  How long is the test supposed to last?

It bothers me a bit to add #include rel.h to snapmgr.h because of the
new macro definition.  It seems out of place.  I would instead move the
macro closer to bufmgr headers or rel.h itself perhaps.  Also, the
definition isn't accompanied by an explanatory comment (other than why
is it a macro instead of a function).

So if I understand correctly, every call to BufferGetPage needs to have
a TestForOldSnapshot immediately afterwards?  It seems easy to later
introduce places that fail to test for old snapshots.  What happens if
they do?  Does vacuum remove tuples anyway and then the query returns
wrong results?  That seems pretty dangerous.  Maybe the snapshot could
be an argument of BufferGetPage?

Please have the comments be clearer on what "align to minute boundaries"
means.  Is it floor or ceil?  Also, in the OldSnapshotControlData big
comment you talk about "length" and then the variable is called "used".
Maybe that should be more consistent, for clarity.

How large is the struct kept in shmem?  I don't think the size is
configurable, is it?  I would have guessed that the size would depend on
the GUC param ...

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
Thanks to both Steve and Álvaro for review and comments.  I won't
be able to address all of those within the time frame of the
current CF, so I have moved it to the next CF and flagged it as
"Waiting on Author".  I will post a patch to address all comments
before that CF starts.  I will discuss now, though.


Steve Singer <steve@ssinger.info> wrote:

> The attached testcase fails I  replace the cursor in your test
> case with direct selects from the table.
> I would have expected this to generate the snapshot too old error
> as well but it doesn't.

Good idea for an additional test; it does indeed show a case where
the patch could do better.  I've reviewed the code and see how I
can adjust the calculations to fix this.


Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Kevin Grittner wrote:

>> 4. Tests have been added.  They are currently somewhat minimal,
>> since this is using a whole new technique for testing from any
>> existing committed tests -- I wanted to make sure that this
>> approach to testing was OK with everyone before expanding it.
>> If it is, I assume we will want to move some of the more generic
>> portions to a .pm file to make it available for other tests.

> I think your test module is a bit unsure about whether it's
> called tso or sto.

Oops.  Yeah, will fix.  They all should be sto for "snapshot too
old".  There is heavy enough use of timestamp fields name ts in the
code that my fingers got confused.

> I would place it inside src/test/modules, so that buildfarm
> runs it automatically; I'm not sure it will pick up things in
> src/test.

As it stands now, the test is getting run as part of `make
check-world`, and it seems like src/test/modules is about testing
separate executables, so I don't think it makes sense to move the
tests -- but I could be convinced that I'm missing something.

> I haven't written or reviewed our TestLib stuff so I can't
> comment on whether the test code is good or not.  How long is the
> test supposed to last?


Well, I can't see how to get a good test of some code with a
setting of 0 (snapshot can become "too old" immediately).  That
setting may keep parts of the test fast, but to exercise much of
the important code you need to configure to '1min' and have the
time pass the 0 second mark for a minute at various points in the
test; so it will be hard to avoid having this test take a few
minutes.
> It bothers me a bit to add #include rel.h to snapmgr.h because of
> the new macro definition.  It seems out of place.

Yeah, I couldn't find anywhere to put the macro that I entirely
liked.

> I would instead move the macro closer to bufmgr headers or rel.h
> itself perhaps.

I'll try that, or something similar.

> Also, the definition isn't accompanied by an explanatory comment
> (other than why is it a macro instead of a function).

Will fix.

> So if I understand correctly, every call to BufferGetPage needs
> to have a TestForOldSnapshot immediately afterwards?

No, every call that is part of a scan.  Access for internal
purposes (such as adding an index entry or vacuuming) does not need
this treatment.

> It seems easy to later introduce places that fail to test for old
> snapshots.  What happens if they do?

That could cause incorrect results for those using the "snapshot
too old" feature, because a query might fail to recognize that one
or more rows it should see are missing.

> Does vacuum remove tuples anyway and then the query returns
> wrong results?

Right.  That.

> That seems pretty dangerous.  Maybe the snapshot could be an
> argument of BufferGetPage?

It would need that, the Relation (or go find it from the Buffer),
and an indication of whether this access is due to a scan.

> Please have the comments be clearer on what "align to minute
> boundaries" means.  Is it floor or ceil?

ok

> Also, in the OldSnapshotControlData big comment you talk about
> "length" and then the variable is called "used".  Maybe that
> should be more consistent, for clarity.

Will work on that comment.

> How large is the struct kept in shmem?

To allow a setting up to 60 days, 338kB.

> I don't think the size is configurable, is it?

Not in the posted patch.

> I would have guessed that the size would depend on the GUC param ...

I had been trying to make this GUC PGC_SIGHUP, but found a lot of
devils hiding in the details of that.  When I gave up on that and
went back to PGC_POSTMASTER I neglected to change this allocation
back.  I agree that it should be changed, and that will make the
size depend on the configuration.  If the feature is not used, no
significant RAM will be allocated.  If, for example, someone wants
to configure to '5h' they would need only about 1.2kB for this
structure.


Again, thanks to both of you for the review!


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Kevin Grittner wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > I would place it inside src/test/modules, so that buildfarm
> > runs it automatically; I'm not sure it will pick up things in
> > src/test.
> 
> As it stands now, the test is getting run as part of `make
> check-world`, and it seems like src/test/modules is about testing
> separate executables, so I don't think it makes sense to move the
> tests -- but I could be convinced that I'm missing something.

It's not conceived just as a way to test separate executables; maybe it
is at the moment (though I don't think it is?) but if so that's just an
accident.  The intention is to have modules that get tested without them
being installed, which wasn't the case when they were in contrib.

The problem with check-world is that buildfarm doesn't run it.  We don't 
want to set up separate buildfarm modules for each subdir in src/test;
that would be pretty tedious.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Tuesday, September 15, 2015 12:07 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Kevin Grittner wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

>>> I would place it inside src/test/modules, so that buildfarm
>>> runs it automatically; I'm not sure it will pick up things in
>>> src/test.
>>
>> As it stands now, the test is getting run as part of `make
>> check-world`, and it seems like src/test/modules is about testing
>> separate executables, so I don't think it makes sense to move the
>> tests -- but I could be convinced that I'm missing something.
>
> It's not conceived just as a way to test separate executables; maybe it
> is at the moment (though I don't think it is?) but if so that's just an
> accident.  The intention is to have modules that get tested without them
> being installed, which wasn't the case when they were in contrib.
>
> The problem with check-world is that buildfarm doesn't run it.  We don't
> want to set up separate buildfarm modules for each subdir in src/test;
> that would be pretty tedious.

OK, moved.

All other issues raised by Álvaro and Steve have been addressed,
except for this one, which I will argue against:

> So if I understand correctly, every call to BufferGetPage needs to have
> a TestForOldSnapshot immediately afterwards?  It seems easy to later
> introduce places that fail to test for old snapshots.  What happens if
> they do?  Does vacuum remove tuples anyway and then the query returns
> wrong results?  That seems pretty dangerous.  Maybe the snapshot could
> be an argument of BufferGetPage?

There are 486 occurences of BufferGetPage in the source code, and
this patch follows 36 of them with TestForOldSnapshot.  This only
needs to be done when a page is going to be used for a scan which
produces user-visible results.  That is, it is *not* needed for
positioning within indexes to add or vacuum away entries, for heap
inserts or deletions (assuming the row to be deleted has already
been found).  It seems wrong to modify about 450 BufferGetPage
references to add a NULL parameter; and if we do want to make that
noop change as insurance, it seems like it should be a separate
patch, since the substance of this patch would be buried under the
volume of that.

I will add this to the November CF.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: snapshot too old, configured by time

From
Steve Singer
Date:
On 10/15/2015 05:47 PM, Kevin Grittner wrote:
> All other issues raised by Álvaro and Steve have been addressed,
> except for this one, which I will argue against:

I've been looking through the updated patch


In snapmgr.c


+ * XXX: If we can trust a read of an int64 value to be atomic, we can
skip the
+ * spinlock here.
+ */
+int64
+GetOldSnapshotThresholdTimestamp(void)


Was your intent with the XXX for it to be a TODO to only aquire the lock
on platforms without the atomic 64bit operations?

On a more general note:

I've tried various manual tests of this feature and it sometimes works
as expected and sometimes doesn't.
I'm getting the feeling that how I expect it to work isn't quite in sync
with how it does work.

I'd expect the following to be sufficient to generate the test

T1: Obtains a snapshot that can see some rows
T2: Waits 60 seconds and performs an update on those rows
T2: Performs a vacuum
T1: Waits 60 seconds, tries to select from the table.  The snapshot
should be too old


For example it seems that in test 002 the select_ok on conn1 following
the vacuum but right before the final sleep is critical to the snapshot
too old error showing up (ie if I remove that select_ok but leave in the
sleep I don't get the error)

Is this intended and if so is there a better way we can explain how
things work?


Also is 0 intended to be an acceptable value for old_snapshot_threshold
and if so what should we expect to see then?




>> So if I understand correctly, every call to BufferGetPage needs to have
>> a TestForOldSnapshot immediately afterwards?  It seems easy to later
>> introduce places that fail to test for old snapshots.  What happens if
>> they do?  Does vacuum remove tuples anyway and then the query returns
>> wrong results?  That seems pretty dangerous.  Maybe the snapshot could
>> be an argument of BufferGetPage?
> There are 486 occurences of BufferGetPage in the source code, and
> this patch follows 36 of them with TestForOldSnapshot.  This only
> needs to be done when a page is going to be used for a scan which
> produces user-visible results.  That is, it is *not* needed for
> positioning within indexes to add or vacuum away entries, for heap
> inserts or deletions (assuming the row to be deleted has already
> been found).  It seems wrong to modify about 450 BufferGetPage
> references to add a NULL parameter; and if we do want to make that
> noop change as insurance, it seems like it should be a separate
> patch, since the substance of this patch would be buried under the
> volume of that.
>
> I will add this to the November CF.
>
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>




Re: snapshot too old, configured by time

From
Michael Paquier
Date:
On Mon, Nov 9, 2015 at 8:07 AM, Steve Singer <steve@ssinger.info> wrote:
> On 10/15/2015 05:47 PM, Kevin Grittner wrote:
>>
>> All other issues raised by Álvaro and Steve have been addressed, except
>> for this one, which I will argue against:
>
>
> I've been looking through the updated patch
>
>
> In snapmgr.c
>
>
> + * XXX: If we can trust a read of an int64 value to be atomic, we can skip
> the
> + * spinlock here.
> + */
> +int64
> +GetOldSnapshotThresholdTimestamp(void)
>
>
> Was your intent with the XXX for it to be a TODO to only aquire the lock on
> platforms without the atomic 64bit operations?
>
> On a more general note:
>
> I've tried various manual tests of this feature and it sometimes works as
> expected and sometimes doesn't.
> I'm getting the feeling that how I expect it to work isn't quite in sync
> with how it does work.
>
> I'd expect the following to be sufficient to generate the test
>
> T1: Obtains a snapshot that can see some rows
> T2: Waits 60 seconds and performs an update on those rows
> T2: Performs a vacuum
> T1: Waits 60 seconds, tries to select from the table.  The snapshot should
> be too old
>
>
> For example it seems that in test 002 the select_ok on conn1 following the
> vacuum but right before the final sleep is critical to the snapshot too old
> error showing up (ie if I remove that select_ok but leave in the sleep I
> don't get the error)
>
> Is this intended and if so is there a better way we can explain how things
> work?
>
>
> Also is 0 intended to be an acceptable value for old_snapshot_threshold and
> if so what should we expect to see then?

There has been a review but no replies for more than 1 month. Returned
with feedback?
--
Michael



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Wed, Dec 2, 2015 at 12:39 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Nov 9, 2015 at 8:07 AM, Steve Singer <steve@ssinger.info> wrote:

>> In snapmgr.c
>>
>>
>> + * XXX: If we can trust a read of an int64 value to be atomic, we can skip the
>> + * spinlock here.
>> + */
>> +int64
>> +GetOldSnapshotThresholdTimestamp(void)
>>
>>
>> Was your intent with the XXX for it to be a TODO to only aquire the lock on
>> platforms without the atomic 64bit operations?

I'm not sure whether we can safely assume a read of an int64 to be
atomic on any platform; if we actually can on some platforms, and
we have a #define to tell us whether we are in such an environment,
we could condition the spinlock calls on that.  Are we there yet?

>> On a more general note:
>>
>> I've tried various manual tests of this feature and it sometimes works as
>> expected and sometimes doesn't.
>> I'm getting the feeling that how I expect it to work isn't quite in sync
>> with how it does work.
>>
>> I'd expect the following to be sufficient to generate the test
>>
>> T1: Obtains a snapshot that can see some rows
>> T2: Waits 60 seconds and performs an update on those rows
>> T2: Performs a vacuum
>> T1: Waits 60 seconds, tries to select from the table.  The snapshot should
>> be too old
>>
>>
>> For example it seems that in test 002 the select_ok on conn1 following the
>> vacuum but right before the final sleep is critical to the snapshot too old
>> error showing up (ie if I remove that select_ok but leave in the sleep I
>> don't get the error)
>>
>> Is this intended and if so is there a better way we can explain how things
>> work?

At every phase I took a conservative approach toward deferring
pruning of tuples still visible to any snapshot -- often reducing
the overhead of tracking by letting things go to the next minute
boundary.  The idea is that an extra minute or two probably isn't
going to be a big deal in terms of bloat, so if we can save any
effort on the bookkeeping by letting things go a little longer, it
is a worthwhile trade-off.  That does make it hard to give a
precise statement of exactly when a transaction *will* be subject
to cancellation based on this feature, so I have emphasized the
minimum guaranteed time that a transaction will be *safe*.  In
reviewing what you describe, I think I still don't have it as
aggressive as I can (and probably should).  My biggest concern is
that a long-running transaction which gets a transaction ID
matching the xmin on a snapshot it will hold for a long time may
not be subject to cancellation.  That's probably not too hard to
fix, but the bigger problem is the testing.

People have said that issuing SQL commands directly from a TAP test
via DBD::Pg is not acceptable for a core feature, and (despite
assertions to the contrary) I see no way to test this feature with
existing testing mechanisms.  The bigger set of work here, if we
don't want this feature to go in without any testing scripts (which
is not acceptable IMO), is to enhance the isolation tester or
hybridize TAP testing with the isolation tester.

>> Also is 0 intended to be an acceptable value for old_snapshot_threshold and
>> if so what should we expect to see then?

The docs in the patch say this:

+        <para>
+         A value of <literal>-1</> disables this feature, and is the default.
+         Useful values for production work probably range from a small number
+         of hours to a few days.  The setting will be coerced to a granularity
+         of minutes, and small numbers (such as <literal>0</> or
+         <literal>1min</>) are only allowed because they may sometimes be
+         useful for testing.  While a setting as high as <literal>60d</> is
+         allowed, please note that in many workloads extreme bloat or
+         transaction ID wraparound may occur in much shorter time frames.
+        </para>

Once we can agree on a testing methodology I expect that I will be
adding a number of tests based on a cluster started with
old_snapshot_threshold = 0, but as I said in my initial post of the
patch I was keeping the tests in the patch thin until it was
confirmed whether this testing methodology was acceptable.  Since
it isn't, that was just as well.  The time put into learning enough
about perl and TAP tests to create those tests already exceeds the
time to develop the actual patch, and it looks like even more will
be needed for a test methodology that doesn't require adding a
package or downloading a CPAN module.  C'est la vie.  I did expand
my perl and TAP knowledge considerably, for what that's worth.

> There has been a review but no replies for more than 1 month. Returned
> with feedback?

I do intend to post another version of the patch to tweak the
calculations again, after I can get a patch in to expand the
testing capabilities to allow an acceptable way to test the patch
-- so I put it into the next CF instead.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Michael Paquier
Date:
On Thu, Dec 3, 2015 at 5:48 AM, Kevin Grittner wrote:
>> There has been a review but no replies for more than 1 month. Returned
>> with feedback?
>
> I do intend to post another version of the patch to tweak the
> calculations again, after I can get a patch in to expand the
> testing capabilities to allow an acceptable way to test the patch
> -- so I put it into the next CF instead.

OK, thanks.
-- 
Michael



Re: snapshot too old, configured by time

From
Andres Freund
Date:
On 2015-12-02 14:48:24 -0600, Kevin Grittner wrote:
> On Wed, Dec 2, 2015 at 12:39 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Mon, Nov 9, 2015 at 8:07 AM, Steve Singer <steve@ssinger.info> wrote:
> 
> >> In snapmgr.c
> >>
> >>
> >> + * XXX: If we can trust a read of an int64 value to be atomic, we can skip the
> >> + * spinlock here.
> >> + */
> >> +int64
> >> +GetOldSnapshotThresholdTimestamp(void)
> >>
> >>
> >> Was your intent with the XXX for it to be a TODO to only aquire the lock on
> >> platforms without the atomic 64bit operations?
> 
> I'm not sure whether we can safely assume a read of an int64 to be
> atomic on any platform; if we actually can on some platforms, and
> we have a #define to tell us whether we are in such an environment,
> we could condition the spinlock calls on that.  Are we there yet?

We currently don't assume it's atomic. And there are platforms, e.g 32
bit arm, where that's not the case
(c.f. https://wiki.postgresql.org/wiki/Atomics). It'd be rather useful
to abstract that knowledge into a macro...

Andres



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Kevin Grittner wrote:

> People have said that issuing SQL commands directly from a TAP test
> via DBD::Pg is not acceptable for a core feature, and (despite
> assertions to the contrary) I see no way to test this feature with
> existing testing mechanisms.  The bigger set of work here, if we
> don't want this feature to go in without any testing scripts (which
> is not acceptable IMO), is to enhance the isolation tester or
> hybridize TAP testing with the isolation tester.

Is it possible to use the PostgresNode stuff to test this?  If not,
perhaps if you restate what additional capabilities you need we could
look into adding them there.  I suspect that what you need is the
ability to keep more than one session open and feed them commands;
perhaps we could have the framework have a function that opens a psql
process and returns a FD to which the test program can write, using the
IPC::Run stuff (start / pump / finish).

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Kevin Grittner wrote:

> > There has been a review but no replies for more than 1 month. Returned
> > with feedback?
> 
> I do intend to post another version of the patch to tweak the
> calculations again, after I can get a patch in to expand the
> testing capabilities to allow an acceptable way to test the patch
> -- so I put it into the next CF instead.

Two months passed since this, and no activity.  I'm closing this as
returned-with-feedback now; you're of course free to resubmit to
2016-03.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Fri, Jan 8, 2016 at 5:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

>> People have said that issuing SQL commands directly from a TAP test
>> via DBD::Pg is not acceptable for a core feature, and (despite
>> assertions to the contrary) I see no way to test this feature with
>> existing testing mechanisms.  The bigger set of work here, if we
>> don't want this feature to go in without any testing scripts (which
>> is not acceptable IMO), is to enhance the isolation tester or
>> hybridize TAP testing with the isolation tester.
>
> Is it possible to use the PostgresNode stuff to test this?  If not,
> perhaps if you restate what additional capabilities you need we could
> look into adding them there.  I suspect that what you need is the
> ability to keep more than one session open and feed them commands;
> perhaps we could have the framework have a function that opens a psql
> process and returns a FD to which the test program can write, using the
> IPC::Run stuff (start / pump / finish).

Resubmitting for the March CF.

The main thing that changed is that I can now run all the
regression and isolation tests using installcheck with
old_snapshot_threshold = 0 and get a clean run.  That probably gets
better overall coverage than specific tests to demonstrate the
"snapshot too old" error, but of course we need those, too.  While
I can do that with hand-run psql sessions or through connectors
from different languages, I have not been able to wrangle the
testing tools we support through the build system into working for
this purpose.  (I had been hoping that the recent improvements to
the TAP testing libraries would give me the traction to get there,
but either it's still not there or my perl-fu is just too weak to
figure out how to use those features -- suggestions welcome.)

Basically, a connection needs to remain open and interleave
commands with other connections, which the isolation tester does
just fine; but it needs to do that using a custom postgresql.conf
file, which TAP does just fine.  I haven't been able to see the
right way to get a TAP test to set up a customized installation to
run isolation tests against.  If I can get that working, I have
additional tests I can drop into that.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: snapshot too old, configured by time

From
Andres Freund
Date:
On 2016-02-29 18:30:27 -0600, Kevin Grittner wrote:
> Basically, a connection needs to remain open and interleave
> commands with other connections, which the isolation tester does
> just fine; but it needs to do that using a custom postgresql.conf
> file, which TAP does just fine.  I haven't been able to see the
> right way to get a TAP test to set up a customized installation to
> run isolation tests against.  If I can get that working, I have
> additional tests I can drop into that.

Check contrib/test_decoding's makefile. It does just that with
isolationtester.

Andres



Re: snapshot too old, configured by time

From
Michael Paquier
Date:
On Tue, Mar 1, 2016 at 9:35 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-02-29 18:30:27 -0600, Kevin Grittner wrote:
>> Basically, a connection needs to remain open and interleave
>> commands with other connections, which the isolation tester does
>> just fine; but it needs to do that using a custom postgresql.conf
>> file, which TAP does just fine.  I haven't been able to see the
>> right way to get a TAP test to set up a customized installation to
>> run isolation tests against.  If I can get that working, I have
>> additional tests I can drop into that.

Launching psql from PostgresNode does not hold the connection, we
would need to reinvent/integrate the logic of existing drivers to hold
the connection context properly, but that's utterly complicated with
not that much gain.

> Check contrib/test_decoding's makefile. It does just that with
> isolationtester.

pg_isolation_regress --temp-config is the key item here, you can
enforce a test to run on a server with a wanted configuration set.
-- 
Michael



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Tue, Mar 1, 2016 at 12:58 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Mar 1, 2016 at 9:35 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-02-29 18:30:27 -0600, Kevin Grittner wrote:
>>> Basically, a connection needs to remain open and interleave
>>> commands with other connections, which the isolation tester does
>>> just fine; but it needs to do that using a custom postgresql.conf
>>> file, which TAP does just fine.  I haven't been able to see the
>>> right way to get a TAP test to set up a customized installation to
>>> run isolation tests against.  If I can get that working, I have
>>> additional tests I can drop into that.

>> Check contrib/test_decoding's makefile. It does just that with
>> isolationtester.
>
> pg_isolation_regress --temp-config is the key item here, you can
> enforce a test to run on a server with a wanted configuration set.

Thanks for the tips.  Attached is a minimal set of isolation tests.
I can expand on it if needed, but wanted:

(1) to confirm that this is the right way to do this, and

(2) how long people were willing to tolerate these tests running.

Since we're making this time-based (by popular demand), there must
be delays to see the new behavior.  This very minimal pair of tests
runs in just under one minute on my i7.  Decent coverage of all the
index AMs would probably require tests which run for at least 10
minutes, and probably double that.  I don't recall any satisfactory
resolution to prior discussions about long-running tests.

This is a follow-on patch, just to add isolation testing; the prior
patch must be applied, too.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: snapshot too old, configured by time

From
Robert Haas
Date:
On Thu, Mar 3, 2016 at 2:40 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
> Thanks for the tips.  Attached is a minimal set of isolation tests.
> I can expand on it if needed, but wanted:
>
> (1) to confirm that this is the right way to do this, and
>
> (2) how long people were willing to tolerate these tests running.
>
> Since we're making this time-based (by popular demand), there must
> be delays to see the new behavior.  This very minimal pair of tests
> runs in just under one minute on my i7.  Decent coverage of all the
> index AMs would probably require tests which run for at least 10
> minutes, and probably double that.  I don't recall any satisfactory
> resolution to prior discussions about long-running tests.
>
> This is a follow-on patch, just to add isolation testing; the prior
> patch must be applied, too.

Michael, any chance that you could take a look at what Kevin did here
and see if it looks good?

I'm sure the base patch could use more review too, if anyone can find the time.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Michael Paquier
Date:
On Fri, Mar 11, 2016 at 2:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 3, 2016 at 2:40 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> Thanks for the tips.  Attached is a minimal set of isolation tests.
>> I can expand on it if needed, but wanted:
>>
>> (1) to confirm that this is the right way to do this, and
>>
>> (2) how long people were willing to tolerate these tests running.
>>
>> Since we're making this time-based (by popular demand), there must
>> be delays to see the new behavior.  This very minimal pair of tests
>> runs in just under one minute on my i7.  Decent coverage of all the
>> index AMs would probably require tests which run for at least 10
>> minutes, and probably double that.  I don't recall any satisfactory
>> resolution to prior discussions about long-running tests.
>>
>> This is a follow-on patch, just to add isolation testing; the prior
>> patch must be applied, too.
>
> Michael, any chance that you could take a look at what Kevin did here
> and see if it looks good?

OK, I am marking this email. Just don't expect any updates from my
side until mid/end of next week.

> I'm sure the base patch could use more review too, if anyone can find the time.

I guess I am going to need to look at the patch if if feedback for the
tests is needed.. There is no point in looking at the tests without
poking at the patch.
-- 
Michael



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
New patch just to merge in recent commits -- it was starting to
show some bit-rot.  Tests folded in with main patch.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: snapshot too old, configured by time

From
Jeff Janes
Date:
On Thu, Mar 17, 2016 at 2:15 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
> New patch just to merge in recent commits -- it was starting to
> show some bit-rot.  Tests folded in with main patch.

I'm not sure if this is operating as expected.

I set the value to 1min.

I set up a test like this:

pgbench -i

pgbench -c4 -j4 -T 3600 &

### watch the size of branches table
while (true) ; do psql -c "\dt+" | fgrep _branches; sleep 10; done &

### set up a long lived snapshot.
psql -c 'begin; set transaction isolation level repeatable read;
select sum(bbalance) from pgbench_branches; select pg_sleep(300);
select sum(bbalance) from pgbench_branches;'

As this runs, I can see the size of the pgbench_branches bloating once
the snapshot is taken, and continues bloating at a linear rate for the
full 300 seconds.

Once the 300 second pg_sleep is up, the long-lived snapshot holder
receives an error once it tries to access the table again, and then
the bloat stops increasing.  But shouldn't the bloat have stopped
increasing as soon as the snapshot became doomed, which would be after
a minute or so?


Cheers,

Jeff



Re: snapshot too old, configured by time

From
Peter Geoghegan
Date:
On Thu, Mar 17, 2016 at 2:15 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
> New patch just to merge in recent commits -- it was starting to
> show some bit-rot.  Tests folded in with main patch.

I haven't read the patch, but I wonder: What are the implications here
for B-Tree page recycling by VACUUM? I know that you understand this
topic well, so I don't assume that you didn't address it.

Offhand, I imagine that there'd be some special considerations. Why is
it okay that an index scan could land on a deleted page with no
interlock against VACUUM's page recycling? Or, what prevents that from
happening in the first place?

I worry that something weird could happen there. For example, perhaps
the page LSN on what is actually a newly recycled page could be set
such that the backend following a stale right spuriously raises a
"snapshot too old" error.

I suggest you consider making amcheck [1] a part of your testing
strategy. I think that this patch is a good idea, and I'd be happy to
take feedback from you on how to make amcheck more effective for
testing this patch in particular.

[1] https://commitfest.postgresql.org/9/561/
-- 
Peter Geoghegan



Re: snapshot too old, configured by time

From
Peter Geoghegan
Date:
On Sun, Mar 20, 2016 at 4:25 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I worry that something weird could happen there. For example, perhaps
> the page LSN on what is actually a newly recycled page could be set
> such that the backend following a stale right spuriously raises a
> "snapshot too old" error.

I mean a stale right-link, of course.

-- 
Peter Geoghegan



Re: snapshot too old, configured by time

From
Thom Brown
Date:
On 17 March 2016 at 21:15, Kevin Grittner <kgrittn@gmail.com> wrote:
> New patch just to merge in recent commits -- it was starting to
> show some bit-rot.  Tests folded in with main patch.

In session 1, I've run:

# begin transaction isolation level repeatable read ;
BEGIN

*# declare stuff scroll cursor for select * from test where num between 5 and 9;
DECLARE CURSOR

*# fetch forward 5 from stuff;id  | num |               thing
-----+-----+------------------------------------  2 |   8 | hellofji djf odsjfiojdsif ojdsiof  3 |   7 | hellofji djf
odsjfiojdsifojdsiof112 |   9 | hellofji djf odsjfiojdsif ojdsiof115 |   6 | hellofji djf odsjfiojdsif ojdsiof119 |   8
|hellofji djf odsjfiojdsif ojdsiof
 
(5 rows)


In session 2, over a min later:

# update test set num = 12 where num between 5 and 9 and id between 120 and 250;
ERROR:  snapshot too old


Then back to session 1:

*# fetch forward 5 from stuff;
ERROR:  snapshot too old


Should session 2 be getting that error?

Thom



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
Thanks to all for the feedback; I will try to respond later this
week.  First I'm trying to get my reviews for other patches posted.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Michael Paquier
Date:
On Tue, Mar 22, 2016 at 5:05 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
> Thanks to all for the feedback; I will try to respond later this
> week.  First I'm trying to get my reviews for other patches posted.

I have been looking at 4a, the test module, and things are looking
good IMO. Something that I think would be adapted would be to define
the options for isolation tests in a variable, like ISOLATION_OPTS to
allow MSVC scripts to fetch those option values more easily.

+submake-test_decoding:
+   $(MAKE) -C $(top_builddir)/src/test/modules/snapshot_too_old
The target name here is incorrect. This should refer to snapshot_too_old.

Regarding the main patch:
+        <primary><varname>old_snapshot_threshold</> configuration
parameter</primary>
snapshot_valid_limit?
   page = BufferGetPage(buf);
+   TestForOldSnapshot(scan->xs_snapshot, rel, page);
This is a sequence repeated many times in this patch, a new routine,
say BufferGetPageExtended with a uint8 flag, one flag being used to
test old snapshots would be more adapted. But this would require
creating a header dependency between the buffer manager and
SnapshotData.. Or more simply we may want a common code path when
fetching a page that a backend is going to use to fetch tuples. I am
afraid of TestForOldSnapshot() being something that could be easily
forgotten in code paths introduced in future patches...

+   if (whenTaken < 0)
+   {
+       elog(DEBUG1,
+            "MaintainOldSnapshotTimeMapping called with negative
whenTaken = %ld",
+            (long) whenTaken);
+       return;
+   }
+   if (!TransactionIdIsNormal(xmin))
+   {
+       elog(DEBUG1,
+            "MaintainOldSnapshotTimeMapping called with xmin = %lu",
+            (unsigned long) xmin);
+       return;
+   }
Material for two assertions?
-- 
Michael



Re: snapshot too old, configured by time

From
David Steele
Date:
Hi Kevin,

On 3/21/16 4:05 PM, Kevin Grittner wrote:

> Thanks to all for the feedback; I will try to respond later this
> week.  First I'm trying to get my reviews for other patches posted.

We're getting to the end of the CF now.  Do you know when you'll have an 
updated patch ready?

Thanks,
-- 
-David
david@pgmasters.net



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Tue, Mar 29, 2016 at 8:58 AM, David Steele <david@pgmasters.net> wrote:

> We're getting to the end of the CF now.  Do you know when you'll have an
> updated patch ready?

I am working on it right now.  Hopefully I can get it all sorted today.

-- 
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Michael Paquier wrote:

>     page = BufferGetPage(buf);
> +   TestForOldSnapshot(scan->xs_snapshot, rel, page);
> This is a sequence repeated many times in this patch, a new routine,
> say BufferGetPageExtended with a uint8 flag, one flag being used to
> test old snapshots would be more adapted. But this would require
> creating a header dependency between the buffer manager and
> SnapshotData.. Or more simply we may want a common code path when
> fetching a page that a backend is going to use to fetch tuples. I am
> afraid of TestForOldSnapshot() being something that could be easily
> forgotten in code paths introduced in future patches...

I said exactly the same thing, and Kevin dismissed it.

I would be worried about your specific proposal though, because it's
easy to just call BufferGetPage() (i.e. the not-extended version) and
forget the old-snapshot protection completely.

I think a safer proposition would be to replace all current
BufferGetPage() calls (there are about 500) by adding the necessary
arguments: buffer, snapshot, rel, and an integer "flags".  All this
without adding the feature.  Then a subsequent commit would add the
TestForOldSnapshot inside BufferGetPage, *except* when a
BUFFER_NO_SNAPSHOT_TEST flag is passed.  That way, new code always get
the snapshot test by default.

I don't like the new header dependency either, though.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I think a safer proposition would be to replace all current
> BufferGetPage() calls (there are about 500) by adding the necessary
> arguments: buffer, snapshot, rel, and an integer "flags".  All this
> without adding the feature.  Then a subsequent commit would add the
> TestForOldSnapshot inside BufferGetPage, *except* when a
> BUFFER_NO_SNAPSHOT_TEST flag is passed.  That way, new code always get
> the snapshot test by default.

That seems awfully invasive, not to mention performance-killing if
the expectation is that most such calls are going to need a snapshot
check.  (Quite aside from the calls themselves, are they all in
routines that are being passed the right snapshot today?)

TBH, I think that shoving in something like this at the end of the last
commitfest would be a bad idea even if there were widespread consensus
that we wanted the feature ... which I am not sure there is.

I think it might be time to bounce this one to 9.7.
        regards, tom lane



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Wed, Mar 30, 2016 at 11:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> I think a safer proposition would be to replace all current
>> BufferGetPage() calls (there are about 500) by adding the necessary
>> arguments: buffer, snapshot, rel, and an integer "flags".  All this
>> without adding the feature.  Then a subsequent commit would add the
>> TestForOldSnapshot inside BufferGetPage, *except* when a
>> BUFFER_NO_SNAPSHOT_TEST flag is passed.  That way, new code always get
>> the snapshot test by default.
>
> That seems awfully invasive,

That's the argument I made, which Álvaro described as "dismissing"
his suggestion.  In this post from October of 2015, I pointed out
that there are 36 calls where we need a snapshot and 450 where we
don't.

http://www.postgresql.org/message-id/56479263.1140984.1444945639606.JavaMail.yahoo@mail.yahoo.com

> not to mention performance-killing if the expectation is that
> most such calls are going to need a snapshot check.

This patch is one which has allowed a customer where we could not
meet their performance requirements to pass them.  It is the
opposite of performance-killing.

> (Quite aside from the calls themselves, are they all in
> routines that are being passed the right snapshot today?)

I went over that very carefully when the patch was first proposed
in January of 2015, and have kept an eye on things to try to avoid
bit-rot which might introduce new calls which need to be touched.
The customer for which this was initially developed uses a 30 day
test run with very complex production releases driven by a
simulated user load with large numbers of users.  EDB has
back-patched it to 9.4 where an earlier version of it is being used
in production by this (large) customer.

> TBH, I think that shoving in something like this at the end of the last
> commitfest would be a bad idea even if there were widespread consensus
> that we wanted the feature ... which I am not sure there is.

I don't recall anyone opposing the feature itself it except you,
and it has had enthusiastic support from many.  Before I was made
aware of a relevant isolation tester feature, there were many
objections to my efforts at regression testing, and Álvaro has
argued for touching 450 locations in the code that otherwise don't
need it, just as "reminders" to people to consider whether newly
added calls might need a snapshot, and he doesn't like the new
header dependencies.  Simon seemed to want it in 9.5, but that was
clearly premature, IMO.

Characterizing this as being shoved in at the last moment seems
odd, since the big hang-up from the November CF was the testing
methodology in the patch.  It has been feature-complete since the
September CF, and modified based on feedback.  Granted, some
additional testing in this CF brought up a couple things that merit
a look, but this patch is hardly unique in that regard.

> I think it might be time to bounce this one to 9.7.

If there is a consensus for that, sure, or if I can't sort out the
latest issues by feature freeze (which is admittedly looming).

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Sun, Mar 20, 2016 at 6:25 PM, Peter Geoghegan <pg@heroku.com> wrote:

> I haven't read the patch, but I wonder: What are the implications here
> for B-Tree page recycling by VACUUM?

> Offhand, I imagine that there'd be some special considerations. Why is
> it okay that an index scan could land on a deleted page with no
> interlock against VACUUM's page recycling? Or, what prevents that from
> happening in the first place?

When the initial "proof of concept" patch was tested by the
customer, it was not effective due to issues related to what you
raise.  Autovacuum workers were blocking due to the page pins for
scans using these old snapshots, causing the bloat to accumulate in
spite of this particular patch.  This was addressed, at least to a
degree sufficient for this customer, with this patch:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ed5b87f96d473962ec5230fd820abfeaccb2069

Basically, for most common cases the "super-exclusive" locking has
been eliminated from btree logic; and I have been happy to see that
Simon has been working on dealing with the corner cases where I
hadn't rooted it out.

> I worry that something weird could happen there. For example, perhaps
> the page LSN on what is actually a newly recycled page could be set
> such that the backend following a stale right spuriously raises a
> "snapshot too old" error.

That particular detail doesn't seem to be a realistic concern,
though -- if a page has been deleted, assigned to a new place in
the index with an LSN corresponding to that action, it would be a
pretty big bug if a right-pointer still referenced it.

> I suggest you consider making amcheck [1] a part of your testing
> strategy. I think that this patch is a good idea, and I'd be happy to
> take feedback from you on how to make amcheck more effective for
> testing this patch in particular.

I'm not sure how that would fit in; could you elaborate?

The biggest contradiction making testing hard is that everyone (and
I mean everyone!) preferred to see this configured by time rather
than number of transactions, so there is no change in behavior
without some sort of wait for elapsed time.  But nobody wants to
drive time needed for running regression tests too high.  Testing
was far easier when a transaction count was used for configuration.
old_snapshot_threshold = -1 (the default) completely disables the
new behavior, and I basically abused a configuration setting of 0
to mean a few seconds so I could get some basic testing added to
make check-world while keeping the additional time for the tests
(barely) below one minute.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Peter Geoghegan
Date:
On Wed, Mar 30, 2016 at 11:53 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
> When the initial "proof of concept" patch was tested by the
> customer, it was not effective due to issues related to what you
> raise.  Autovacuum workers were blocking due to the page pins for
> scans using these old snapshots, causing the bloat to accumulate in
> spite of this particular patch.  This was addressed, at least to a
> degree sufficient for this customer, with this patch:
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ed5b87f96d473962ec5230fd820abfeaccb2069
>
> Basically, for most common cases the "super-exclusive" locking has
> been eliminated from btree logic; and I have been happy to see that
> Simon has been working on dealing with the corner cases where I
> hadn't rooted it out.
>
>> I worry that something weird could happen there. For example, perhaps
>> the page LSN on what is actually a newly recycled page could be set
>> such that the backend following a stale right spuriously raises a
>> "snapshot too old" error.
>
> That particular detail doesn't seem to be a realistic concern,
> though -- if a page has been deleted, assigned to a new place in
> the index with an LSN corresponding to that action, it would be a
> pretty big bug if a right-pointer still referenced it.

Yes, that would be a big bug. But I wasn't talking about
"super-exclusive" locking. Rather, I was talking about the general way
in which index scans are guaranteed to not land on an already-recycled
page (not a half-dead page, and not a fully deleted page -- a fully
reclaimed/recycled page). This works without buffer pins or buffer
locks needing to be held at all -- there is a global interlock against
page *recycling* based on RecentGlobalXmin, per the nbtree README. So,
this vague concern of mine is about VACUUM's B-Tree page recycling.

During an index scan, we expect to be able to land on the next page,
and to be at least able to reason about it being deleted, even though
we don't hold a pin on anything for a period. We certainly are shy
about explaining all this, but if you look at a routine like
_bt_search() carefully (the routine that is used to descend a B-Tree),
it doesn't actually hold a pin concurrently, as we drop a level (and
certainly not a buffer lock, either). The next page/block should be
substantively the same page as expected from the downlink (or right
link) we followed, entirely because of the RecentGlobalXmin interlock.
Backwards scans also rely on this.

This is just a vague concern, and possibly this is completely
irrelevant. I haven't read the patch.

>> I suggest you consider making amcheck [1] a part of your testing
>> strategy. I think that this patch is a good idea, and I'd be happy to
>> take feedback from you on how to make amcheck more effective for
>> testing this patch in particular.
>
> I'm not sure how that would fit in; could you elaborate?

Well, amcheck is a tool that in essence makes sure that B-Trees look
structurally sound, and respect invariants like having every item on
each page in logical order. That could catch a bug of the kind I just
described, because it's quite likely that the recycled page would
happen to have items that didn't comport with the ordering on the
page. The block has been reused essentially at random. Importantly,
amcheck can catch this without anything more than an AccessShareLock,
so you have some hope of catching this kind of race condition (the
stale downlink that you followed to get to the
spuriously-recycled-early page doesn't stay stale for long). Or, maybe
it would happen to catch some other random problem. Difficult to say.

Again, this is based on a speculation that might be wrong. But it's
worth considering.

-- 
Peter Geoghegan



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Kevin Grittner wrote:
> On Wed, Mar 30, 2016 at 11:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> >> I think a safer proposition would be to replace all current
> >> BufferGetPage() calls (there are about 500) by adding the necessary
> >> arguments: buffer, snapshot, rel, and an integer "flags".  All this
> >> without adding the feature.  Then a subsequent commit would add the
> >> TestForOldSnapshot inside BufferGetPage, *except* when a
> >> BUFFER_NO_SNAPSHOT_TEST flag is passed.  That way, new code always get
> >> the snapshot test by default.
> >
> > That seems awfully invasive,
> 
> That's the argument I made, which Álvaro described as "dismissing"
> his suggestion.  In this post from October of 2015, I pointed out
> that there are 36 calls where we need a snapshot and 450 where we
> don't.
> 
> http://www.postgresql.org/message-id/56479263.1140984.1444945639606.JavaMail.yahoo@mail.yahoo.com

I understand the invasiveness argument, but to me the danger of
introducing new bugs trumps that.  The problem is not the current code,
but future patches: it is just too easy to make the mistake of not
checking the snapshot in new additions of BufferGetPage.  And you said
that the result of missing a check is silent wrong results from queries
that should instead be cancelled, which seems fairly bad to me.  My
impression was that you were actually considering doing something about
that -- sorry for the lack of clarity.

We have made similarly invasive changes in the past -- the
SearchSysCache API for instance.

> > not to mention performance-killing if the expectation is that
> > most such calls are going to need a snapshot check.
> 
> This patch is one which has allowed a customer where we could not
> meet their performance requirements to pass them.  It is the
> opposite of performance-killing.

I think Tom misunderstood what I said and you misunderstood what Tom
said.  Let me attempt to set things straight.

I said that we should change BufferGetPage into having the snapshot
check built-in, except in the cases where a flag is passed; and the flag
would be passed in all cases except those 30-something you identified.
In other words, the behavior in all the current callsites would be
identical to what's there today; we could have a macro do the first
check so that we don't introduce the overhead of a function call in the
450 cases where it's not needed.

Tom said that my proposal would be performance-killing, not that your
patch would be performance-killing.  But as I argue above, with my
proposal performance would stay the same, so we're actually okay.


I don't think nobody disputes that your patch is good in general.
I would be happy with it in 9.6, but I have my reservations about the
aforementioned problem.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Peter Geoghegan
Date:
On Wed, Mar 30, 2016 at 12:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Well, amcheck is a tool that in essence makes sure that B-Trees look
> structurally sound, and respect invariants like having every item on
> each page in logical order. That could catch a bug of the kind I just
> described, because it's quite likely that the recycled page would
> happen to have items that didn't comport with the ordering on the
> page.

I mean: didn't comport with the ordering of the last page "on the same
level" (but, due to this issue, maybe not actually on the same level).
We check if the first item on the "right page" (in actuality, due to
this bug, the new page image following a spurious early recycle) is
greater than or equal to the previous page (the page whose right-link
we followed) last item. On each level, everything should be in order
-- that's an invariant that (it is posited by me) we can safely check
with only an AccessShareLock.

Making the relation lock only a AccessShareLock is not just about
reducing the impact on production systems. It's also about making the
tool more effective at finding these kinds of transient races, with
subtle user-visible symptoms.

Again, I don't want to prejudice anyone against your patch, which I
haven't read.

-- 
Peter Geoghegan



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Sat, Mar 19, 2016 at 1:27 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

> I'm not sure if this is operating as expected.
>
> I set the value to 1min.
>
> I set up a test like this:
>
> pgbench -i
>
> pgbench -c4 -j4 -T 3600 &
>
> ### watch the size of branches table
> while (true) ; do psql -c "\dt+" | fgrep _branches; sleep 10; done &
>
> ### set up a long lived snapshot.
> psql -c 'begin; set transaction isolation level repeatable read;
> select sum(bbalance) from pgbench_branches; select pg_sleep(300);
> select sum(bbalance) from pgbench_branches;'
>
> As this runs, I can see the size of the pgbench_branches bloating once
> the snapshot is taken, and continues bloating at a linear rate for the
> full 300 seconds.
>
> Once the 300 second pg_sleep is up, the long-lived snapshot holder
> receives an error once it tries to access the table again, and then
> the bloat stops increasing.  But shouldn't the bloat have stopped
> increasing as soon as the snapshot became doomed, which would be after
> a minute or so?

This is actually operating as intended, not a bug.  Try running a
manual VACUUM command about two minutes after the snapshot is taken
and you should get a handle on what's going on.  The old tuples
become eligible for vacuuming after one minute, but that doesn't
necessarily mean that autovacuum jumps in and that the space starts
getting reused.  The manual vacuum will allow that, as you should
see on your monitoring window.  A connection should not get the
error just because it is using a snapshot that tries to look at
data that might be wrong, and the connection holding the long-lived
snapshot doesn't do that until it awakes from the sleep and runs
the next SELECT command.

All is well as far as I can see here.

Thanks for checking, though!  It is an interesting test!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Wed, Mar 30, 2016 at 2:29 PM, Peter Geoghegan <pg@heroku.com> wrote:

> [Does the patch allow dangling page pointers?]

> Again, I don't want to prejudice anyone against your patch, which I
> haven't read.

I don't believe that the way the patch does its business opens any
new vulnerabilities of this type.  If you see such after looking at
the patch, let me know.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Wed, Mar 30, 2016 at 2:34 PM, Kevin Grittner <kgrittn@gmail.com> wrote:

> A connection should not get the
> error just because it is using a snapshot that tries to look at
> data that might be wrong, and the connection holding the long-lived
> snapshot doesn't do that until it awakes from the sleep and runs
> the next SELECT command.

Well, that came out wrong.

A connection should not get the "snapshot too old" error just
because it is *holds* an old snapshot; it must actually attempt to
read an affected page *using* an old snapshot, and the connection
holding the long-lived snapshot doesn't do that until it awakes
from the sleep and runs the next SELECT command.

Sorry for any confusion from the sloppy editing.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Wed, Mar 30, 2016 at 2:22 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

> I understand the invasiveness argument, but to me the danger of
> introducing new bugs trumps that.  The problem is not the current code,
> but future patches: it is just too easy to make the mistake of not
> checking the snapshot in new additions of BufferGetPage.  And you said
> that the result of missing a check is silent wrong results from queries
> that should instead be cancelled, which seems fairly bad to me.

Fair point.

> I said that we should change BufferGetPage into having the snapshot
> check built-in, except in the cases where a flag is passed; and the flag
> would be passed in all cases except those 30-something you identified.
> In other words, the behavior in all the current callsites would be
> identical to what's there today; we could have a macro do the first
> check so that we don't introduce the overhead of a function call in the
> 450 cases where it's not needed.

In many of the places that BufferGetPage is called there is not a
snapshot available.  I assume that you would be OK with an Assert
that the flag was passed if the snapshot is NULL?  I had been
picturing what you were requesting as just adding a snapshot
parameter and assuming that NULL meant not to check; adding two
parameters where the flag explicitly calls that the check is not
needed might do more to prevent accidents, but I do wonder how much
it would help during copy/paste frenzy.  Touching all spots to use
the new function signature would be a mechanical job with the
compiler catching any errors, so it doesn't seem crazy to refactor
that now, but I would like to hear what some others think about
this.

> Tom said that my proposal would be performance-killing, not that your
> patch would be performance-killing.  But as I argue above, with my
> proposal performance would stay the same, so we're actually okay.
>
> I don't think nobody disputes that your patch is good in general.
> I would be happy with it in 9.6, but I have my reservations about the
> aforementioned problem.

We have a lot of places in our code where people need to know
things that they are not reminded of by the surrounding code, but
I'm not about to argue that's a good thing; if the consensus is
that this would help prevent future bugs when new BufferGetPage
calls are added, I can go with the flow.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Kevin Grittner wrote:
> On Wed, Mar 30, 2016 at 2:22 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > I said that we should change BufferGetPage into having the snapshot
> > check built-in, except in the cases where a flag is passed; and the flag
> > would be passed in all cases except those 30-something you identified.
> > In other words, the behavior in all the current callsites would be
> > identical to what's there today; we could have a macro do the first
> > check so that we don't introduce the overhead of a function call in the
> > 450 cases where it's not needed.
> 
> In many of the places that BufferGetPage is called there is not a
> snapshot available.  I assume that you would be OK with an Assert
> that the flag was passed if the snapshot is NULL?

Sure, that's fine.

BTW I said "a macro" but I was forgetting that we have static inline
functions in headers now, which means you can avoid the horrors of
actually writing a macro.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Thu, Mar 24, 2016 at 2:24 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

> I have been looking at 4a, the test module, and things are looking
> good IMO. Something that I think would be adapted would be to define
> the options for isolation tests in a variable, like ISOLATION_OPTS to
> allow MSVC scripts to fetch those option values more easily.

Maybe, but that seems like material for a separate patch.

> +submake-test_decoding:
> +   $(MAKE) -C $(top_builddir)/src/test/modules/snapshot_too_old
> The target name here is incorrect. This should refer to snapshot_too_old.

Good catch.  Fixed.

So far, pending the resolution of the suggestion to add three new
parameters to BufferGetPage in 450 places that otherwise don't need
to be touched, this is the only change from the flurry of recent
testing and review, so I'm holding off on posting a new patch just
for this.

> Regarding the main patch:
> +        <primary><varname>old_snapshot_threshold</> configuration
> parameter</primary>
> snapshot_valid_limit?

There have already been responses supporting
old_snapshot_threshold, so I would need to hear a few more votes to
consider a change.

>     page = BufferGetPage(buf);
> +   TestForOldSnapshot(scan->xs_snapshot, rel, page);
> This is a sequence repeated many times in this patch, a new routine,
> say BufferGetPageExtended with a uint8 flag, one flag being used to
> test old snapshots would be more adapted. But this would require
> creating a header dependency between the buffer manager and
> SnapshotData.. Or more simply we may want a common code path when
> fetching a page that a backend is going to use to fetch tuples. I am
> afraid of TestForOldSnapshot() being something that could be easily
> forgotten in code paths introduced in future patches...

Let's keep that discussion on the other branch of this thread.

> +   if (whenTaken < 0)
> +   {
> +       elog(DEBUG1,
> +            "MaintainOldSnapshotTimeMapping called with negative
> whenTaken = %ld",
> +            (long) whenTaken);
> +       return;
> +   }
> +   if (!TransactionIdIsNormal(xmin))
> +   {
> +       elog(DEBUG1,
> +            "MaintainOldSnapshotTimeMapping called with xmin = %lu",
> +            (unsigned long) xmin);
> +       return;
> +   }
> Material for two assertions?

You omitted the immediately preceding comment block:

+    /*
+     * We don't want to do something stupid with unusual values, but we don't
+     * want to litter the log with warnings or break otherwise normal
+     * processing for this feature; so if something seems unreasonable, just
+     * log at DEBUG level and return without doing anything.
+     */

I'm not clear that more drastic action is a good idea, since the
"fallback" is existing behavior.  I fear that doing something more
aggressive might force other logic to become more precise about
aggressive clean-up, adding overhead beyond the value gained.

Thanks for the review!

-- 
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Michael Paquier
Date:
On Thu, Mar 31, 2016 at 5:09 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
> On Wed, Mar 30, 2016 at 2:22 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> I understand the invasiveness argument, but to me the danger of
>> introducing new bugs trumps that.  The problem is not the current code,
>> but future patches: it is just too easy to make the mistake of not
>> checking the snapshot in new additions of BufferGetPage.  And you said
>> that the result of missing a check is silent wrong results from queries
>> that should instead be cancelled, which seems fairly bad to me.
>
> Fair point.

That's my main concern after going through the patch, and the patch
written as-is does not help much future users. This could be easily
forgotten by committers as well.

>> I said that we should change BufferGetPage into having the snapshot
>> check built-in, except in the cases where a flag is passed; and the flag
>> would be passed in all cases except those 30-something you identified.
>> In other words, the behavior in all the current callsites would be
>> identical to what's there today; we could have a macro do the first
>> check so that we don't introduce the overhead of a function call in the
>> 450 cases where it's not needed.
>
> In many of the places that BufferGetPage is called there is not a
> snapshot available.  I assume that you would be OK with an Assert
> that the flag was passed if the snapshot is NULL?  I had been
> picturing what you were requesting as just adding a snapshot
> parameter and assuming that NULL meant not to check; adding two
> parameters where the flag explicitly calls that the check is not
> needed might do more to prevent accidents, but I do wonder how much
> it would help during copy/paste frenzy.  Touching all spots to use
> the new function signature would be a mechanical job with the
> compiler catching any errors, so it doesn't seem crazy to refactor
> that now, but I would like to hear what some others think about
> this.

That's better than what the existing patch for sure. When calling
BufferGetPage() one could be tempted to forget to set snapshot to NULL
though. It should be clearly documented in the header of
BufferGetPage() where and for which purpose a snapshot should be set,
and in which code paths it is expected to be used. In our case, that's
mainly when a page is fetched from shared buffers and that it is used
for reading tuples from it.

Just a note: I began looking at the tests, but finished looking at the
patch entirely at the end by curiosity. Regarding the integration of
this patch for 9.6, I think that bumping that to 9.7 would be wiser
because the patch needs to be re-written largely, and that's never a
good sign at this point of the development cycle.
-- 
Michael



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Michael Paquier wrote:

> Just a note: I began looking at the tests, but finished looking at the
> patch entirely at the end by curiosity. Regarding the integration of
> this patch for 9.6, I think that bumping that to 9.7 would be wiser
> because the patch needs to be re-written largely, and that's never a
> good sign at this point of the development cycle.

Not rewritten surelY?  It will need a very large mechanical change to
existing BufferGetPage calls, but to me that doesn't equate "rewriting"
it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Wed, Mar 30, 2016 at 9:19 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>
>> Just a note: I began looking at the tests, but finished looking at the
>> patch entirely at the end by curiosity. Regarding the integration of
>> this patch for 9.6, I think that bumping that to 9.7 would be wiser
>> because the patch needs to be re-written largely, and that's never a
>> good sign at this point of the development cycle.
>
> Not rewritten surelY?  It will need a very large mechanical change to
> existing BufferGetPage calls, but to me that doesn't equate "rewriting"
> it.

I'll submit patches later today to make the mechanical change to
the nearly 500 BufferGetPage() calls and to tweak to the 36 places
to use the new "test" flag with the new signature rather than
adding a line for the test.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Wed, Mar 30, 2016 at 3:26 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Kevin Grittner wrote:
>> On Wed, Mar 30, 2016 at 2:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
>> > I said that we should change BufferGetPage into having the snapshot
>> > check built-in, except in the cases where a flag is passed; and the flag
>> > would be passed in all cases except those 30-something you identified.
>> > In other words, the behavior in all the current callsites would be
>> > identical to what's there today; we could have a macro do the first
>> > check so that we don't introduce the overhead of a function call in the
>> > 450 cases where it's not needed.
>>
>> In many of the places that BufferGetPage is called there is not a
>> snapshot available.  I assume that you would be OK with an Assert
>> that the flag was passed if the snapshot is NULL?
>
> Sure, that's fine.
>
> BTW I said "a macro" but I was forgetting that we have static inline
> functions in headers now, which means you can avoid the horrors of
> actually writing a macro.

Attached is what I think you're talking about for the first patch.
AFAICS this should generate identical executable code to unpatched.
Then the patch to actually implement the feature would, instead
of adding 30-some lines with TestForOldSnapshot() would implement
that as the behavior for the other enum value, and alter those
30-some BufferGetPage() calls.

Álvaro and Michael, is this what you were looking for?

Is everyone else OK with this approach?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Kevin Grittner wrote:

> Attached is what I think you're talking about for the first patch.
> AFAICS this should generate identical executable code to unpatched.
> Then the patch to actually implement the feature would, instead
> of adding 30-some lines with TestForOldSnapshot() would implement
> that as the behavior for the other enum value, and alter those
> 30-some BufferGetPage() calls.
> 
> Álvaro and Michael, is this what you were looking for?

Yes, this is what I was thinking, thanks.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Michael Paquier
Date:
On Fri, Apr 1, 2016 at 11:45 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Kevin Grittner wrote:
>
>> Attached is what I think you're talking about for the first patch.
>> AFAICS this should generate identical executable code to unpatched.
>> Then the patch to actually implement the feature would, instead
>> of adding 30-some lines with TestForOldSnapshot() would implement
>> that as the behavior for the other enum value, and alter those
>> 30-some BufferGetPage() calls.
>>
>> Álvaro and Michael, is this what you were looking for?
>
> Yes, this is what I was thinking, thanks.

A small thing:
$ git diff master --check
src/include/storage/bufmgr.h:181: trailing whitespace.
+#define BufferGetPage(buffer, snapshot, relation, agetest)
((Page)BufferGetBlock(buffer))

-   Page        page = BufferGetPage(buf);
+   Page        page = BufferGetPage(buf, NULL, NULL, BGP_NO_SNAPSHOT_TEST);
Having a BufferGetPageExtended() with some flags and a default
corresponding to NO_SNAPSHOT_TEST would reduce the diff impact. And as
long as the check is integrated with BufferGetPage[Extended]() I would
not complain, the patch as proposed being 174kB...
--
Michael



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Sat, Apr 2, 2016 at 7:12 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Apr 1, 2016 at 11:45 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> Kevin Grittner wrote:
>>
>>> Attached is what I think you're talking about for the first patch.
>>> AFAICS this should generate identical executable code to unpatched.
>>> Then the patch to actually implement the feature would, instead
>>> of adding 30-some lines with TestForOldSnapshot() would implement
>>> that as the behavior for the other enum value, and alter those
>>> 30-some BufferGetPage() calls.
>>>
>>> Álvaro and Michael, is this what you were looking for?
>>
>> Yes, this is what I was thinking, thanks.
>
> A small thing:
> $ git diff master --check
> src/include/storage/bufmgr.h:181: trailing whitespace.
> +#define BufferGetPage(buffer, snapshot, relation, agetest)
> ((Page)BufferGetBlock(buffer))
>
> -   Page        page = BufferGetPage(buf);
> +   Page        page = BufferGetPage(buf, NULL, NULL, BGP_NO_SNAPSHOT_TEST);
> Having a BufferGetPageExtended() with some flags and a default
> corresponding to NO_SNAPSHOT_TEST would reduce the diff impact. And as
> long as the check is integrated with BufferGetPage[Extended]() I would
> not complain, the patch as proposed being 174kB...

If you are saying that the 450 places that don't need the check
would remain unchanged, and the only difference would be to use
BufferGetPageExtended() instead of BufferGetPage() followed by
TestForOldSnapshot() in the 30-some places that need the check, I
don't see the point.  That would eliminate the "forced choice"
aspect of what Álvaro is asking for, and it seems to me that it
would do next to nothing to prevent the errors of omission that are
the concern here.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Jeff Janes
Date:
On Wed, Mar 30, 2016 at 12:34 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
> On Sat, Mar 19, 2016 at 1:27 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>> I'm not sure if this is operating as expected.
>>
>> I set the value to 1min.
>>
>> I set up a test like this:
>>
>> pgbench -i
>>
>> pgbench -c4 -j4 -T 3600 &
>>
>> ### watch the size of branches table
>> while (true) ; do psql -c "\dt+" | fgrep _branches; sleep 10; done &
>>
>> ### set up a long lived snapshot.
>> psql -c 'begin; set transaction isolation level repeatable read;
>> select sum(bbalance) from pgbench_branches; select pg_sleep(300);
>> select sum(bbalance) from pgbench_branches;'
>>
>> As this runs, I can see the size of the pgbench_branches bloating once
>> the snapshot is taken, and continues bloating at a linear rate for the
>> full 300 seconds.
>>
>> Once the 300 second pg_sleep is up, the long-lived snapshot holder
>> receives an error once it tries to access the table again, and then
>> the bloat stops increasing.  But shouldn't the bloat have stopped
>> increasing as soon as the snapshot became doomed, which would be after
>> a minute or so?
>
> This is actually operating as intended, not a bug.  Try running a
> manual VACUUM command about two minutes after the snapshot is taken
> and you should get a handle on what's going on.  The old tuples
> become eligible for vacuuming after one minute, but that doesn't
> necessarily mean that autovacuum jumps in and that the space starts
> getting reused.

I can verify that a manual vacuum does stop the bloat from continuing
to increase.  But I don't see why autovacuum is not already stopping
the bloat.  It is running often enough that it really ought to do so
(as verified by setting log_autovacuum_min_duration = 0 and looking in
the log files to see that it is vacuuming the table once per nap-time,
although it is not accomplishing much by doing so as no tuples can be
removed.)

Also, HOT-cleanup should stop the bloat increase once the snapshot
crosses the old_snapshot_threshold without even needing to wait until
the next autovac runs.

Does the code intentionally only work for manual vacuums?  If so, that
seems quite surprising.  Or perhaps I am missing something else here.

Thanks,

Jeff



Re: snapshot too old, configured by time

From
Peter Geoghegan
Date:
On Wed, Mar 30, 2016 at 12:46 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
> On Wed, Mar 30, 2016 at 2:29 PM, Peter Geoghegan <pg@heroku.com> wrote:
>
>> [Does the patch allow dangling page pointers?]
>
>> Again, I don't want to prejudice anyone against your patch, which I
>> haven't read.
>
> I don't believe that the way the patch does its business opens any
> new vulnerabilities of this type.  If you see such after looking at
> the patch, let me know.

Okay, let me be more concrete about this. The patch does this:

> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -92,12 +92,21 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
>      * need to use the horizon that includes slots, otherwise the data-only
>      * horizon can be used. Note that the toast relation of user defined
>      * relations are *not* considered catalog relations.
> +    *
> +    * It is OK to apply the old snapshot limit before acquiring the cleanup
> +    * lock because the worst that can happen is that we are not quite as
> +    * aggressive about the cleanup (by however many transaction IDs are
> +    * consumed between this point and acquiring the lock).  This allows us to
> +    * save significant overhead in the case where the page is found not to be
> +    * prunable.
>      */
>     if (IsCatalogRelation(relation) ||
>         RelationIsAccessibleInLogicalDecoding(relation))
>         OldestXmin = RecentGlobalXmin;
>     else
> -       OldestXmin = RecentGlobalDataXmin;
> +       OldestXmin =
> +               TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
> +                                                   relation);

This new intermediary function TransactionIdLimitedForOldSnapshots()
is called to decide what OldestXmin actually gets to be above, based
in part on the new GUC old_snapshot_threshold:

> +/*
> + * TransactionIdLimitedForOldSnapshots
> + *
> + * Apply old snapshot limit, if any.  This is intended to be called for page
> + * pruning and table vacuuming, to allow old_snapshot_threshold to override
> + * the normal global xmin value.  Actual testing for snapshot too old will be
> + * based on whether a snapshot timestamp is prior to the threshold timestamp
> + * set in this function.
> + */
> +TransactionId
> +TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
> +                                   Relation relation)

It might not be RecentGlobalDataXmin that is usually returned as
OldestXmin as it is today, which is exactly the point of this patch:
VACUUM can be more aggressive in cleaning up bloat, not unlike the
non-catalog logical decoding case, on the theory that we can reliably
detect when that causes failures for old snapshots, and just raise a
"snapshot too old" error. (RecentGlobalDataXmin is morally about the
same as RecentGlobalXmin, as far as this patch goes).

So far, so good. It's okay that _bt_page_recyclable() never got the
memo about any of this...:

/**  _bt_page_recyclable() -- Is an existing page recyclable?** This exists to make sure _bt_getbuf and btvacuumscan
havethe same* policy about whether a page is safe to re-use.*/
 
bool
_bt_page_recyclable(Page page)
{   BTPageOpaque opaque;
  ...
   /*    * Otherwise, recycle if deleted and too old to have any processes    * interested in it.    */   opaque =
(BTPageOpaque)PageGetSpecialPointer(page);   if (P_ISDELETED(opaque) &&       TransactionIdPrecedes(opaque->btpo.xact,
RecentGlobalXmin))      return true;   return false;
 
}

...because this patch does nothing to advance RecentGlobalXmin (or
RecentGlobalDataXmin) itself more aggressively. It does make
vacuum_set_xid_limits() get a more aggressive cutoff point, but we
don't see that being passed back down by lazy vacuum here; within
_bt_page_recyclable(), we rely on the more conservative
RecentGlobalXmin, which is not subject to any clever optimization in
the patch.

Fortunately, this seems correct, since index scans will always succeed
in finding a deleted page, per nbtree README notes on
RecentGlobalXmin. Unfortunately, this does stop recycling from
happening early for B-Tree pages, even though that's probably safe in
principle. This is probably not so bad -- it just needs to be
considered when reviewing this patch (the same is true of logical
decoding's RecentGlobalDataXmin; it also doesn't appear in
_bt_page_recyclable(), and I guess that that was never a problem).
Index relations will not get smaller in some important cases, but they
will be made less bloated by VACUUM in a sense that's still probably
very useful. Maybe that explains some of what Jeff talked about.

I think another part of the problems that Jeff mentioned (with
pruning) could be this existing code within heap_hot_search_buffer():
       /*        * If we can't see it, maybe no one else can either.  At caller        * request, check whether all
chainmembers are dead to all        * transactions.        */       if (all_dead && *all_dead &&
!HeapTupleIsSurelyDead(heapTuple,RecentGlobalXmin))           *all_dead = false;
 

This is used within routines like btgettuple(), to do the LP_DEAD
thing to kill index tuples (not HOT chain pruning).

Aside: Not sure offhand why it might be okay, performance-wise, that
this code doesn't care about RecentGlobalDataXmin; pruning was a big
part of why RecentGlobalDataXmin was added for logical decoding, I
thought, although I guess the _bt_killitems() stuff doesn't count as
pruning.

-- 
Peter Geoghegan



Re: snapshot too old, configured by time

From
Peter Geoghegan
Date:
On Sun, Apr 3, 2016 at 2:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Also, HOT-cleanup should stop the bloat increase once the snapshot
> crosses the old_snapshot_threshold without even needing to wait until
> the next autovac runs.
>
> Does the code intentionally only work for manual vacuums?  If so, that
> seems quite surprising.  Or perhaps I am missing something else here.

What proportion of the statements in your simulated workload were
updates? Per my last mail to this thread, I'm interested in knowing if
this was a delete heavy workload.

-- 
Peter Geoghegan



Re: snapshot too old, configured by time

From
Jeff Janes
Date:
On Mon, Apr 4, 2016 at 8:38 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sun, Apr 3, 2016 at 2:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Also, HOT-cleanup should stop the bloat increase once the snapshot
>> crosses the old_snapshot_threshold without even needing to wait until
>> the next autovac runs.
>>
>> Does the code intentionally only work for manual vacuums?  If so, that
>> seems quite surprising.  Or perhaps I am missing something else here.
>
> What proportion of the statements in your simulated workload were
> updates? Per my last mail to this thread, I'm interested in knowing if
> this was a delete heavy workload.

It was pgbench's built in  TPC-B-like, so 3 UPDATE, 1 SELECT, 1 INSERT
per transaction.

So I would say that it is ridiculously update heavy compared to almost
any real-world use patterns.

That is the active workload.  The long-term snapshot holder just does
a sum(abalance) at the repeatable read level in order to force a
snapshot to be taken and held, and then goes idle for a long time.

Cheers,

Jeff



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Sun, Apr 3, 2016 at 4:09 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Mar 30, 2016 at 12:34 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> On Sat, Mar 19, 2016 at 1:27 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

>>> I set the value to 1min.
>>>
>>> I set up a test like this:
>>>
>>> pgbench -i
>>>
>>> pgbench -c4 -j4 -T 3600 &
>>>
>>> ### watch the size of branches table
>>> while (true) ; do psql -c "\dt+" | fgrep _branches; sleep 10; done &
>>>
>>> ### set up a long lived snapshot.
>>> psql -c 'begin; set transaction isolation level repeatable read;
>>> select sum(bbalance) from pgbench_branches; select pg_sleep(300);
>>> select sum(bbalance) from pgbench_branches;'
>>>
>>> As this runs, I can see the size of the pgbench_branches bloating once
>>> the snapshot is taken, and continues bloating at a linear rate for the
>>> full 300 seconds.

I'm not seeing that on my i7 box.

>>> Once the 300 second pg_sleep is up, the long-lived snapshot holder
>>> receives an error once it tries to access the table again, and then
>>> the bloat stops increasing.  But shouldn't the bloat have stopped
>>> increasing as soon as the snapshot became doomed, which would be after
>>> a minute or so?

It will, limited by how well your autovacuum can keep up with the
workload on your system.  See attached graph.  I ran 5 times each
in 3 configurations and graphed the median and average of be each.

master: development checkout from today, no config changes

patch: patch with no config changes except:
  old_snapshot_threshold = '1min'

patch + av: patch with these config changes:
  old_snapshot_threshold = '1min'
  autovacuum_max_workers = 8
  autovacuum_vacuum_cost_limit = 2000
  autovacuum_naptime = '10s'
  autovacuum_work_mem = '1GB'

As expected, differences are minimal at first, then the patch
starts to win, and wins even better with more aggressive
autovacuum.

> I can verify that a manual vacuum does stop the bloat from continuing
> to increase.  But I don't see why autovacuum is not already stopping
> the bloat.  It is running often enough that it really ought to do so
> (as verified by setting log_autovacuum_min_duration = 0 and looking in
> the log files to see that it is vacuuming the table once per nap-time,
> although it is not accomplishing much by doing so as no tuples can be
> removed.)

Perhaps the CPUs I have or the way I have my machine tuned allows
autovacuum to be more effective in the face of the pgbench load
than on yours?

> Also, HOT-cleanup should stop the bloat increase once the snapshot
> crosses the old_snapshot_threshold without even needing to wait until
> the next autovac runs.

It should help some, but you really need a vacuum in there to take
care things thoroughly.

> Does the code intentionally only work for manual vacuums?  If so, that
> seems quite surprising.  Or perhaps I am missing something else here.

Perhaps it is that VACUUM tries harder to get the work done, while
autovacuum steps out of the way when it detects that it is blocking
something.  This is a pretty small table (it never gets to 1MB even
when bloating) with multiple clients pounding on it.  It might just
be that your system doesn't allow much autovacuum activity before
it find itself blocking a pgbench process.

FWIW, our customer's 30-day test runs were on databases of hundreds
of GB and showed similar benefits -- linear growth indefinitely
without the patch, settling in to a pretty steady state after a few
hours with the patch.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Mon, Apr 4, 2016 at 9:15 PM, Peter Geoghegan <pg@heroku.com> wrote:

> The patch does this:
>
>> --- a/src/backend/access/heap/pruneheap.c
>> +++ b/src/backend/access/heap/pruneheap.c
>> @@ -92,12 +92,21 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
>>      * need to use the horizon that includes slots, otherwise the data-only
>>      * horizon can be used. Note that the toast relation of user defined
>>      * relations are *not* considered catalog relations.
>> +    *
>> +    * It is OK to apply the old snapshot limit before acquiring the cleanup
>> +    * lock because the worst that can happen is that we are not quite as
>> +    * aggressive about the cleanup (by however many transaction IDs are
>> +    * consumed between this point and acquiring the lock).  This allows us to
>> +    * save significant overhead in the case where the page is found not to be
>> +    * prunable.
>>      */
>>     if (IsCatalogRelation(relation) ||
>>         RelationIsAccessibleInLogicalDecoding(relation))
>>         OldestXmin = RecentGlobalXmin;
>>     else
>> -       OldestXmin = RecentGlobalDataXmin;
>> +       OldestXmin =
>> +               TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
>> +                                                   relation);
>
> This new intermediary function TransactionIdLimitedForOldSnapshots()
> is called to decide what OldestXmin actually gets to be above, based
> in part on the new GUC old_snapshot_threshold:
>
>> +/*
>> + * TransactionIdLimitedForOldSnapshots
>> + *
>> + * Apply old snapshot limit, if any.  This is intended to be called for page
>> + * pruning and table vacuuming, to allow old_snapshot_threshold to override
>> + * the normal global xmin value.  Actual testing for snapshot too old will be
>> + * based on whether a snapshot timestamp is prior to the threshold timestamp
>> + * set in this function.
>> + */
>> +TransactionId
>> +TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
>> +                                   Relation relation)
>
> It might not be RecentGlobalDataXmin that is usually returned as
> OldestXmin as it is today, which is exactly the point of this patch:

Right.

> VACUUM can be more aggressive in cleaning up bloat, [...], on the
> theory that we can reliably detect when that causes failures for
> old snapshots, and just raise a "snapshot too old" error.

Right.

> ...because this patch does nothing to advance RecentGlobalXmin (or
> RecentGlobalDataXmin) itself more aggressively. It does make
> vacuum_set_xid_limits() get a more aggressive cutoff point, but we
> don't see that being passed back down by lazy vacuum here; within
> _bt_page_recyclable(), we rely on the more conservative
> RecentGlobalXmin, which is not subject to any clever optimization in
> the patch.

Right.

> Fortunately, this seems correct, since index scans will always succeed
> in finding a deleted page, per nbtree README notes on
> RecentGlobalXmin.

Right.

> Unfortunately, this does stop recycling from
> happening early for B-Tree pages, even though that's probably safe in
> principle. This is probably not so bad -- it just needs to be
> considered when reviewing this patch (the same is true of logical
> decoding's RecentGlobalDataXmin; it also doesn't appear in
> _bt_page_recyclable(), and I guess that that was never a problem).
> Index relations will not get smaller in some important cases, but they
> will be made less bloated by VACUUM in a sense that's still probably
> very useful.

As I see it, if the long-running transaction(s) have written (and
thus acquired transaction IDs), we can't safely advance the global
Xmin values until they complete.  If the long-running transactions
with the old snapshots don't have transaction IDs, the bloat will
be contained.

> Maybe that explains some of what Jeff talked about.

I think he just didn't have autovacuum configured to where it was
being very effective on the tiny tables involved.  See my reply to
Jeff and the graphs from running his test on my system.  I don't
think the lines could get much more flat than what I'm seeing with
the patch.

It may be that this general approach could be made more aggressive
and effective by pushing things in the direction you suggest, but
we are far past the time to consider that sort of change for 9.6.
This patch has been through several rounds of 30-day testing; a
change like you propose would require that those tests be redone.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Peter Geoghegan
Date:
On Thu, Apr 7, 2016 at 3:56 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> Unfortunately, this does stop recycling from
>> happening early for B-Tree pages, even though that's probably safe in
>> principle. This is probably not so bad -- it just needs to be
>> considered when reviewing this patch (the same is true of logical
>> decoding's RecentGlobalDataXmin; it also doesn't appear in
>> _bt_page_recyclable(), and I guess that that was never a problem).
>> Index relations will not get smaller in some important cases, but they
>> will be made less bloated by VACUUM in a sense that's still probably
>> very useful.
>
> As I see it, if the long-running transaction(s) have written (and
> thus acquired transaction IDs), we can't safely advance the global
> Xmin values until they complete.  If the long-running transactions
> with the old snapshots don't have transaction IDs, the bloat will
> be contained.

I'm not really that concerned about it. I'm mostly just explaining my
thought process.

>> Maybe that explains some of what Jeff talked about.
>
> I think he just didn't have autovacuum configured to where it was
> being very effective on the tiny tables involved.  See my reply to
> Jeff and the graphs from running his test on my system.  I don't
> think the lines could get much more flat than what I'm seeing with
> the patch.
>
> It may be that this general approach could be made more aggressive
> and effective by pushing things in the direction you suggest, but
> we are far past the time to consider that sort of change for 9.6.
> This patch has been through several rounds of 30-day testing; a
> change like you propose would require that those tests be redone.

I think that there is a good argument in favor of this patch that you
may have failed to make yourself, which is: it limits bloat in a way
that's analogous to how RecentGlobalDataXmin can do so for logical
decoding (i.e. where wal_level = logical, and RecentGlobalXmin and
RecentGlobalDataXmin could actually differ). Therefore, it benefits to
a significant degree from the testing that Andres did to make sure
logical decoding doesn't cause excessive bloat when RecentGlobalXmin
is pinned to make historic MVCC catalog snapshots work (he did so at
my insistence at the time; pruning turned out to be very important for
many common workloads, and Andres got that right). I can't really
imagine a way that what you have here could be any less effective than
what Andres did for logical decoding. This is reassuring, since that
mechanism has to be pretty well battle-hardened by now.

-- 
Peter Geoghegan



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Thu, Apr 7, 2016 at 6:12 PM, Peter Geoghegan <pg@heroku.com> wrote:

> I think that there is a good argument in favor of this patch that you
> may have failed to make yourself, which is: it limits bloat in a way
> that's analogous to how RecentGlobalDataXmin can do so for logical
> decoding (i.e. where wal_level = logical, and RecentGlobalXmin and
> RecentGlobalDataXmin could actually differ). Therefore, it benefits to
> a significant degree from the testing that Andres did to make sure
> logical decoding doesn't cause excessive bloat when RecentGlobalXmin
> is pinned to make historic MVCC catalog snapshots work (he did so at
> my insistence at the time; pruning turned out to be very important for
> many common workloads, and Andres got that right). I can't really
> imagine a way that what you have here could be any less effective than
> what Andres did for logical decoding. This is reassuring, since that
> mechanism has to be pretty well battle-hardened by now.

Interesting.  I had not noticed that relationship.

Anyway, pushed as two patches -- the no-op patch to create the "forced
choice" on whether to do the test at each BufferGetPage point, and the
actual feature.

Sadly, I forgot to include the reviewer information when writing the
commit messages.  :-(

-- 
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Peter Geoghegan
Date:
On Fri, Apr 8, 2016 at 12:55 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
> Sadly, I forgot to include the reviewer information when writing the
> commit messages.  :-(

Oh well. I'm just glad we got the patch over the line. I think that
there are some types of users that will very significantly benefit
from this patch.

I am reminded of this blog post, written by a friend and former
co-worker: https://brandur.org/postgres-queues


-- 
Peter Geoghegan



Re: snapshot too old, configured by time

From
David Steele
Date:
On 4/8/16 4:30 PM, Peter Geoghegan wrote:
> On Fri, Apr 8, 2016 at 12:55 PM, Kevin Grittner <kgrittn@gmail.com> wrote:
>> Sadly, I forgot to include the reviewer information when writing the
>> commit messages.  :-(
> 
> Oh well. I'm just glad we got the patch over the line. I think that
> there are some types of users that will very significantly benefit
> from this patch.

I'm also very happy to see this go in.  While I used to dread the
"snapshot too old" error back in my O****e architect days it's nice to
have the option, especially when it can be configured by time rather
than by size.

-- 
-David
david@pgmasters.net



Re: snapshot too old, configured by time

From
Tom Lane
Date:
Kevin Grittner <kgrittn@gmail.com> writes:
> On Wed, Mar 30, 2016 at 3:26 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Kevin Grittner wrote:
>>> On Wed, Mar 30, 2016 at 2:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> I said that we should change BufferGetPage into having the snapshot
>>> check built-in, except in the cases where a flag is passed; and the flag
>>> would be passed in all cases except those 30-something you identified.
>>> In other words, the behavior in all the current callsites would be
>>> identical to what's there today; we could have a macro do the first
>>> check so that we don't introduce the overhead of a function call in the
>>> 450 cases where it's not needed.

> Attached is what I think you're talking about for the first patch.
> AFAICS this should generate identical executable code to unpatched.
> Then the patch to actually implement the feature would, instead
> of adding 30-some lines with TestForOldSnapshot() would implement
> that as the behavior for the other enum value, and alter those
> 30-some BufferGetPage() calls.

> Álvaro and Michael, is this what you were looking for?

> Is everyone else OK with this approach?

After struggling with back-patching a GIN bug fix, I wish to offer up the
considered opinion that this was an impressively bad idea.  It's inserted
450 or so pain points for back-patching, which we will have to deal with
for the next five years.  Moreover, I do not believe that it will do a
damn thing for ensuring that future calls of BufferGetPage think about
what to do; they'll most likely be copied-and-pasted from nearby calls,
just as people have always done.  With luck, the nearby calls will have
the right semantics, but this change won't help very much at all if they
don't.

I think we should revert BufferGetPage to be what it was before (with
no snapshot test) and invent BufferGetPageExtended or similar to be
used in the small number of places that need a snapshot test.
        regards, tom lane



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Sun, Apr 17, 2016 at 5:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Kevin Grittner <kgrittn@gmail.com> writes:
>> On Wed, Mar 30, 2016 at 3:26 PM, Alvaro Herrera> <alvherre@2ndquadrant.com> wrote:
>>> Kevin Grittner wrote:
>>>> On Wed, Mar 30, 2016 at 2:22 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>>> I said that we should change BufferGetPage into having the snapshot
>>>> check built-in, except in the cases where a flag is passed; and the flag
>>>> would be passed in all cases except those 30-something you identified.
>>>> In other words, the behavior in all the current callsites would be
>>>> identical to what's there today; we could have a macro do the first
>>>> check so that we don't introduce the overhead of a function call in the
>>>> 450 cases where it's not needed.
>
>> Attached is what I think you're talking about for the first patch.
>> AFAICS this should generate identical executable code to unpatched.
>> Then the patch to actually implement the feature would, instead
>> of adding 30-some lines with TestForOldSnapshot() would implement
>> that as the behavior for the other enum value, and alter those
>> 30-some BufferGetPage() calls.
>
>> Álvaro and Michael, is this what you were looking for?
>
>> Is everyone else OK with this approach?
>
> After struggling with back-patching a GIN bug fix, I wish to offer up the
> considered opinion that this was an impressively bad idea.  It's inserted
> 450 or so pain points for back-patching, which we will have to deal with
> for the next five years.  Moreover, I do not believe that it will do a
> damn thing for ensuring that future calls of BufferGetPage think about
> what to do; they'll most likely be copied-and-pasted from nearby calls,
> just as people have always done.  With luck, the nearby calls will have
> the right semantics, but this change won't help very much at all if they
> don't.
>
> I think we should revert BufferGetPage to be what it was before (with
> no snapshot test) and invent BufferGetPageExtended or similar to be
> used in the small number of places that need a snapshot test.

I'm not sure what BufferGetPageExtended() buys us over simply
inserting TestForOldSnapshot() where it is needed.  Other than that
question, I have no objections to the course outlined, but figure I
should not jump on it without allowing at least a couple days for
discussion.  That also may give me time to perform the benchmarks I
wanted -- VPN issues have blocked me from the big test machines so
far.  I think I see where the time may be going when the feature is
disabled, and if I'm right I have a fix; but without a big NUMA
machine there is no way to confirm it.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Michael Paquier
Date:
On Mon, Apr 18, 2016 at 9:52 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
> On Sun, Apr 17, 2016 at 5:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Kevin Grittner <kgrittn@gmail.com> writes:
>>> On Wed, Mar 30, 2016 at 3:26 PM, Alvaro Herrera> <alvherre@2ndquadrant.com> wrote:
>>>> Kevin Grittner wrote:
>> I think we should revert BufferGetPage to be what it was before (with
>> no snapshot test) and invent BufferGetPageExtended or similar to be
>> used in the small number of places that need a snapshot test.
>
> I'm not sure what BufferGetPageExtended() buys us over simply
> inserting TestForOldSnapshot() where it is needed.  Other than that
> question, I have no objections to the course outlined, but figure I
> should not jump on it without allowing at least a couple days for
> discussion.  That also may give me time to perform the benchmarks I
> wanted -- VPN issues have blocked me from the big test machines so
> far.  I think I see where the time may be going when the feature is
> disabled, and if I'm right I have a fix; but without a big NUMA
> machine there is no way to confirm it.

TBH, BufferGetPageExtended() still looks like a good idea to me.
Backpatching those code paths is going to make the maintenance far
harder, on top of the compilation of extensions for perhaps no good
reason. Even if this is a low-level change, if this feature goes in
with 9.6, it would be really good to mention as well that callers of
BufferGetPage should update their calls accordingly if they care about
the checks with the old snapshot. This is a routine used a lot in many
plugins and extensions. Usually such low-level things are not
mentioned in the release notes, but this time I think that's really
important to say it loudly.
-- 
Michael



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Tom Lane wrote:

> After struggling with back-patching a GIN bug fix, I wish to offer up the
> considered opinion that this was an impressively bad idea.  It's inserted
> 450 or so pain points for back-patching, which we will have to deal with
> for the next five years.  Moreover, I do not believe that it will do a
> damn thing for ensuring that future calls of BufferGetPage think about
> what to do; they'll most likely be copied-and-pasted from nearby calls,
> just as people have always done.  With luck, the nearby calls will have
> the right semantics, but this change won't help very much at all if they
> don't.

I disagree.  A developer that sees an unadorned BufferGetPage() call
doesn't stop to think twice about whether they need to add a snapshot
test.  Many reviewers will miss the necessary addition also.  A
developer that sees BufferGetPage(NO_SNAPSHOT_TEST) will at least
consider the idea that the flag might be right; if that developer
doesn't think about it, some reviewer may notice a new call with the
flag and consider the idea that the flag may be wrong.

I understand the backpatching pain argument, but my opinion was the
contrary of yours even so.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Sun, Apr 17, 2016 at 10:38 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Tom Lane wrote:
>
>> After struggling with back-patching a GIN bug fix, I wish to offer up the
>> considered opinion that this was an impressively bad idea.  It's inserted
>> 450 or so pain points for back-patching, which we will have to deal with
>> for the next five years.

> I understand the backpatching pain argument, but my opinion was the
> contrary of yours even so.

The other possibility would be to backpatch the no-op patch which
just uses the new syntax without any change in semantics.

I'm not arguing for that; just putting it on the table....

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Tom Lane
Date:
Kevin Grittner <kgrittn@gmail.com> writes:
> On Sun, Apr 17, 2016 at 10:38 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> I understand the backpatching pain argument, but my opinion was the
>> contrary of yours even so.

> The other possibility would be to backpatch the no-op patch which
> just uses the new syntax without any change in semantics.

That would break 3rd-party extensions in a minor release, wouldn't it?
Or do I misunderstand your suggestion?
        regards, tom lane



Re: snapshot too old, configured by time

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I disagree.  A developer that sees an unadorned BufferGetPage() call
> doesn't stop to think twice about whether they need to add a snapshot
> test.  Many reviewers will miss the necessary addition also.  A
> developer that sees BufferGetPage(NO_SNAPSHOT_TEST) will at least
> consider the idea that the flag might be right; if that developer
> doesn't think about it, some reviewer may notice a new call with the
> flag and consider the idea that the flag may be wrong.

I'm unconvinced ...

> I understand the backpatching pain argument, but my opinion was the
> contrary of yours even so.

I merely point out that the problem came up less than ten days after
that patch hit the tree.  If that does not give you pause about the
size of the back-patching problem we've just introduced, it should.

TBH, there is nothing that I like about this feature: not the underlying
concept, not the invasiveness of the implementation, nothing.  I would
dearly like to see it reverted altogether.  I do not think it is worth
the pain that the current implementation will impose, both on developers
and on potential users.  Surely there was another way to get a similar
end result without mucking with things at the level of BufferGetPage.
        regards, tom lane



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Mon, Apr 18, 2016 at 8:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Kevin Grittner <kgrittn@gmail.com> writes:
>> On Sun, Apr 17, 2016 at 10:38 PM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>>> I understand the backpatching pain argument, but my opinion was the
>>> contrary of yours even so.
>
>> The other possibility would be to backpatch the no-op patch which
>> just uses the new syntax without any change in semantics.
>
> That would break 3rd-party extensions in a minor release, wouldn't it?
> Or do I misunderstand your suggestion?

With a little bit of a change to the headers I think we could avoid
that breakage.

The original no-op patch didn't change the executable code, but it
would have interfered with 3rd-party compiles; but with a minor
adjustment (using a modified name for the BufferGetPage with the
extra parameters), we could avoid that problem.  That would seem to
address Álvaro's concern while avoiding five years of backpatch
nightmares.

I don't claim it's an *elegant* solution, but it might be a workable compromise.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Robert Haas
Date:
On Sun, Apr 17, 2016 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After struggling with back-patching a GIN bug fix, I wish to offer up the
> considered opinion that this was an impressively bad idea.  It's inserted
> 450 or so pain points for back-patching, which we will have to deal with
> for the next five years.  Moreover, I do not believe that it will do a
> damn thing for ensuring that future calls of BufferGetPage think about
> what to do; they'll most likely be copied-and-pasted from nearby calls,
> just as people have always done.  With luck, the nearby calls will have
> the right semantics, but this change won't help very much at all if they
> don't.

I hit this problem over the weekend, too, when I tried to rebase a
patch a colleague of mine is working on.  So I tend to agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > I disagree.  A developer that sees an unadorned BufferGetPage() call
> > doesn't stop to think twice about whether they need to add a snapshot
> > test.  Many reviewers will miss the necessary addition also.  A
> > developer that sees BufferGetPage(NO_SNAPSHOT_TEST) will at least
> > consider the idea that the flag might be right; if that developer
> > doesn't think about it, some reviewer may notice a new call with the
> > flag and consider the idea that the flag may be wrong.
> 
> I'm unconvinced ...

Well, nobody opposed this when I proposed it originally.  Robert just
stated that it caused a problem for him while backpatching but didn't
state opinion on reverting that change or not.  Maybe we should call for
a vote here.

> > I understand the backpatching pain argument, but my opinion was the
> > contrary of yours even so.
> 
> I merely point out that the problem came up less than ten days after
> that patch hit the tree.  If that does not give you pause about the
> size of the back-patching problem we've just introduced, it should.

Undersootd.  Kevin's idea of applying a no-op syntax change is on the
table.  I don't like it either, but ISTM better than the other options
so far.

> TBH, there is nothing that I like about this feature: not the underlying
> concept, not the invasiveness of the implementation, nothing.  I would
> dearly like to see it reverted altogether.  I do not think it is worth
> the pain that the current implementation will impose, both on developers
> and on potential users.  Surely there was another way to get a similar
> end result without mucking with things at the level of BufferGetPage.

Ah well, that's a completely different angle, and perhaps we should
explore this before doing anything in the back branches.

So it seems to me we have these options

1) revert the whole feature
2) revert the BufferGetPage syntax change
3) apply a no-op syntax change so that BufferGetPage looks the same on  backbranches as it does on master today,
keepingAPI and ABI  compatibility with existing code
 
4) do nothing

Any others?  (If we decide to call for a vote, I suggest we open a new
thread)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Mon, Apr 18, 2016 at 8:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Surely there was another way to get a similar end result without
> mucking with things at the level of BufferGetPage.

To get the feature that some customers have been demanding, a check
has to be made somewhere near where any page is read in a scan.  It
didn't take me long in working on this to notice that grepping for
BufferGetPage() calls was a good way to find candidate spots to
insert the check (even if only 7% of BufferGetPage() calls need to
be followed by such a check) -- but the BufferGetPage() itself
clearly does *not* need to be modified to implement the feature.

We could:

(1) Add calls to a check function where needed, and just document
that addition of a BufferGetPage() call should be considered a clue
that a new check might be needed.  (original plan)

(2) Replace the 7% of the BufferGetPage() calls that need to check
the age of the snapshot with something that wraps the two function
calls, and does nothing but call one and then the other.  (favored
by Michael)

(3) Add parameters to BufferGetPage() to specify whether the check
is needed and provide sufficient information to perform the check
if it is.  (current master)

(4) Replace BufferGetPage() with some other function name having
the characteristics of (3) to minimize back-patch pain.
(grudgingly favored by Álvaro)

(5) Revert from community code, leaving it as an EDB value-add
Advanced Server feature.

Does someone see another (better) alternative?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Tom Lane
Date:
Kevin Grittner <kgrittn@gmail.com> writes:
> On Mon, Apr 18, 2016 at 8:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Surely there was another way to get a similar end result without
>> mucking with things at the level of BufferGetPage.

> To get the feature that some customers have been demanding, a check
> has to be made somewhere near where any page is read in a scan.

I'm not really convinced that we need to define it exactly like that,
though.  In particular, why not just kill transactions as soon as their
oldest snapshot is too old?  That might not work exactly like this does,
but it would have some pretty substantial benefits --- for one, that the
timeout could be configured locally per session rather than having to be
PGC_POSTMASTER.  And it would likely be far easier to limit the
performance side-effects.

I complained about this back when the feature was first discussed,
and you insisted that that answer was no good, and I figured I'd hold
my nose and look the other way as long as the final patch wasn't too
invasive.  Well, now we've seen the end result, and it's very invasive
and has got performance issues as well.  It's time to reconsider.

Or in short: this is a whole lot further than I'm prepared to go to
satisfy one customer with a badly-designed application.  And from what
I can tell from the Feb 2015 discussion, that's what this has been
written for.
        regards, tom lane



Re: snapshot too old, configured by time

From
Michael Paquier
Date:
On Tue, Apr 19, 2016 at 3:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Or in short: this is a whole lot further than I'm prepared to go to
> satisfy one customer with a badly-designed application.  And from what
> I can tell from the Feb 2015 discussion, that's what this has been
> written for.

This holds true. I imagine that a lot of people at least on this list
have already spent some time in tracking down long-running
transactions in someone's application and actually tuned the
application so as the bloat gets reduced and things perform better for
other transactions taking a shorter time. Without the need of this
feature.
-- 
Michael



Re: snapshot too old, configured by time

From
Robert Haas
Date:
On Tue, Apr 19, 2016 at 1:23 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Apr 19, 2016 at 3:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Or in short: this is a whole lot further than I'm prepared to go to
>> satisfy one customer with a badly-designed application.  And from what
>> I can tell from the Feb 2015 discussion, that's what this has been
>> written for.
>
> This holds true. I imagine that a lot of people at least on this list
> have already spent some time in tracking down long-running
> transactions in someone's application and actually tuned the
> application so as the bloat gets reduced and things perform better for
> other transactions taking a shorter time. Without the need of this
> feature.

So I don't want to be too vociferous in defending a feature that (a)
was written by a colleague and (b) obviously isn't perfect, but I will
point out that:

1. There was a surprising amount of positive reaction when Kevin first
proposed this.  I expected a lot more people to say this kind of thing
at the beginning, when Kevin first brought this up, but in fact a
number of people wrote into say they'd really like to have this.
Those positive reaction shouldn't be forgotten just because those
people aren't wading into a discussion about the merits of adding
arguments to BufferGetPage.

2. Without this feature, you can kill sessions or transactions to
control bloat, but this feature is properly thought of as a way to
avoid bloat *without* killing sessions or transactions.  You can let
the session live, without having it generate bloat, just so long as it
doesn't try to touch any data that has been recently modified.  We
have no other feature in PostgreSQL that does something like that.

At the moment, what I see happening is that Tom, the one person who
has hated this feature since the beginning, still hates it, and we're
on the verge of asking Kevin to revert it because (1) Tom hates it and
(2) Kevin changed the BufferGetPage stuff in the way that Alvaro
requested.  I think that's not quite fair.  If we want to demand that
this feature be reverted because it causes a performance loss even
when turned off, I get that.  If we think that it's badly implemented,
fine, I get that, too.  But asking somebody to revert a patch because
the author adjusted things to match what the reviewer wanted is not
fair.  The right thing to do about that is just change it back to the
way Kevin had it originally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Tue, Apr 19, 2016 at 6:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> The right thing to do about that is just change it back to the
> way Kevin had it originally.

Since this change to BufferGetPage() has caused severe back-patch
pain for at least two committers so far, I will revert that (very
recent) change to this patch later today unless I hear an
objections.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Kevin Grittner wrote:
> On Tue, Apr 19, 2016 at 6:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> > The right thing to do about that is just change it back to the
> > way Kevin had it originally.
> 
> Since this change to BufferGetPage() has caused severe back-patch
> pain for at least two committers so far, I will revert that (very
> recent) change to this patch later today unless I hear an
> objections.

I vote for back-patching a no-op change instead, as discussed elsewhere.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Andres Freund
Date:
On 2016-04-19 12:03:22 -0300, Alvaro Herrera wrote:
> Kevin Grittner wrote:
> > On Tue, Apr 19, 2016 at 6:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > 
> > > The right thing to do about that is just change it back to the
> > > way Kevin had it originally.
> > 
> > Since this change to BufferGetPage() has caused severe back-patch
> > pain for at least two committers so far, I will revert that (very
> > recent) change to this patch later today unless I hear an
> > objections.
> 
> I vote for back-patching a no-op change instead, as discussed elsewhere.

What about Tom's argument that that'd be problematic for external code?



Re: snapshot too old, configured by time

From
Robert Haas
Date:
On Tue, Apr 19, 2016 at 11:03 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Kevin Grittner wrote:
>> On Tue, Apr 19, 2016 at 6:38 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> > The right thing to do about that is just change it back to the
>> > way Kevin had it originally.
>>
>> Since this change to BufferGetPage() has caused severe back-patch
>> pain for at least two committers so far, I will revert that (very
>> recent) change to this patch later today unless I hear an
>> objections.
>
> I vote for back-patching a no-op change instead, as discussed elsewhere.

That wouldn't have fixed my problem, which involved rebasing a patch.
I really think it's also a bad precedent to back-patch things into
older branches that are not themselves bug fixes.  Users count on us
not to destabilize older branches, and that means being minimalist
about what we put into them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2016-04-19 12:03:22 -0300, Alvaro Herrera wrote:

> > > Since this change to BufferGetPage() has caused severe back-patch
> > > pain for at least two committers so far, I will revert that (very
> > > recent) change to this patch later today unless I hear an
> > > objections.
> > 
> > I vote for back-patching a no-op change instead, as discussed elsewhere.
> 
> What about Tom's argument that that'd be problematic for external code?

Kevin offered to code it in a way that maintains ABI and API
compatibility with some trickery.


Robert Haas wrote:

> That wouldn't have fixed my problem, which involved rebasing a patch.

True.  I note that it's possible to munge a patch mechanically to sort
out this situation.

> I really think it's also a bad precedent to back-patch things into
> older branches that are not themselves bug fixes.  Users count on us
> not to destabilize older branches, and that means being minimalist
> about what we put into them.

Well, this wouldn't change the inner working of the code at all, only
how it looks, so it wouldn't affect users.  I grant that it would affect
developers of forks.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Tue, Apr 19, 2016 at 11:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Andres Freund wrote:
>> On 2016-04-19 12:03:22 -0300, Alvaro Herrera wrote:
>
>>>> Since this change to BufferGetPage() has caused severe back-patch
>>>> pain for at least two committers so far, I will revert that (very
>>>> recent) change to this patch later today unless I hear an
>>>> objections.
>>>
>>> I vote for back-patching a no-op change instead, as discussed elsewhere.
>>
>> What about Tom's argument that that'd be problematic for external code?
>
> Kevin offered to code it in a way that maintains ABI and API
> compatibility with some trickery.

I pointed out that it would be possible to do so, but specifically
said I wasn't arguing for that.  We would need to create a new name
for what BufferGetPage() does on master, and have that call the old
BufferGetPage() on back-branches.  That seems pretty ugly.

I tend to think that the original approach, while it puts the
burden on coders to recognize when TestForOldSnapshot() must be
called, is no more onerous than many existing issues coders much
worry about -- like whether to add something to outfuncs.c, as an
example.  I have been skeptical of the nanny approach all along,
and after seeing the impact of having it in the tree for a few
days, I really am inclined to pull back and put this on the same
footing as the other things hackers need to learn and tend to as
they code.

> Robert Haas wrote:
>
>> That wouldn't have fixed my problem, which involved rebasing a patch.
>
> True.  I note that it's possible to munge a patch mechanically to sort
> out this situation.

I admit it is possible.  I'm becoming more convinced with each post
that it's the wrong approach.  I feel like I have been in the
modern version of an Æsop fable here:

http://www.bartleby.com/17/1/62.html

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Alvaro Herrera
Date:
Kevin Grittner wrote:
> On Tue, Apr 19, 2016 at 11:02 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > Robert Haas wrote:
> >
> >> That wouldn't have fixed my problem, which involved rebasing a patch.
> >
> > True.  I note that it's possible to munge a patch mechanically to sort
> > out this situation.
> 
> I admit it is possible.  I'm becoming more convinced with each post
> that it's the wrong approach.  I feel like I have been in the
> modern version of an Æsop fable here:
> 
> http://www.bartleby.com/17/1/62.html

LOL.

Well, it seems I'm outvoted.  I leave you to do with your donkey as you
please.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Tue, Apr 19, 2016 at 12:49 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

> Well, it seems I'm outvoted.

The no-op change to attempt to force an explicit choice of whether
to test for snapshot age after calling BufferGetPage() has been
reverted.  This eliminates about 500 back-patching pain points in
65 files.

In case anyone notices some code left at the bottom of bufmgr.h
related to inline functions, that was left on purpose, because I am
pretty sure that the fix for the performance regression observed
when the "snapshot too old" feature is disabled will involve making
at least part of TestForOldSnapshot() an inline function -- so it
seemed dumb to rip that out now only to put it back again right
away.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Wed, Apr 20, 2016 at 8:50 AM, Kevin Grittner <kgrittn@gmail.com> wrote:

> In case anyone notices some code left at the bottom of bufmgr.h
> related to inline functions, that was left on purpose, because I am
> pretty sure that the fix for the performance regression observed
> when the "snapshot too old" feature is disabled will involve making
> at least part of TestForOldSnapshot() an inline function -- so it
> seemed dumb to rip that out now only to put it back again right
> away.

I pushed something along those lines.  I didn't want to inline the
whole function because IsCatalogRelation() and
RelationIsAccessibleInLogicalDecoding() seemed kinda big to inline
and require rel.h to be included; so bringing them into bufmgr.h
would have spread that around too far.  Putting the quick tests in
an inline function which calls a non-inlined _impl function seemed
like the best compromise.

My connectivity problems to our big NUMA machines have not yet
been resolved, so I didn't have a better test case for this than
200 read-only clients at saturation on my single-socket i7, which
was only a 2.2% to 2.3% regression -- so I encourage anyone who was
able to create something more significant with
old_snapshot_threshold = -1 to try with the latest and report the
impact for your environment.  I'm not sure whether any more is
needed here.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Bruce Momjian
Date:
On Tue, Apr 19, 2016 at 07:38:04AM -0400, Robert Haas wrote:
> 2. Without this feature, you can kill sessions or transactions to
> control bloat, but this feature is properly thought of as a way to
> avoid bloat *without* killing sessions or transactions.  You can let
> the session live, without having it generate bloat, just so long as it
> doesn't try to touch any data that has been recently modified.  We
> have no other feature in PostgreSQL that does something like that.

I kind of agreed with Tom about just aborting transactions that held
snapshots for too long, and liked the idea this could be set per
session, but the idea that we abort only if a backend actually touches
the old data is very nice.  I can see why the patch author worked hard
to do that.

How does/did Oracle handle this?  I assume we can't find a way to set
this per session, right?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: snapshot too old, configured by time

From
Amit Kapila
Date:
On Sat, Apr 23, 2016 at 8:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Apr 19, 2016 at 07:38:04AM -0400, Robert Haas wrote:
> > 2. Without this feature, you can kill sessions or transactions to
> > control bloat, but this feature is properly thought of as a way to
> > avoid bloat *without* killing sessions or transactions.  You can let
> > the session live, without having it generate bloat, just so long as it
> > doesn't try to touch any data that has been recently modified.  We
> > have no other feature in PostgreSQL that does something like that.
>
> I kind of agreed with Tom about just aborting transactions that held
> snapshots for too long, and liked the idea this could be set per
> session, but the idea that we abort only if a backend actually touches
> the old data is very nice.  I can see why the patch author worked hard
> to do that.
>
> How does/did Oracle handle this?
>

IIRC then Oracle gives this error when the space in undo tablespace (aka rollback segment) is low.  When the rollback segment gets full, it overwrites the changed data which might be required by some old snapshot and when that old snapshot statement tries to access the data (which is already overwritten), it gets "snapshot too old" error.  Assuming there is enough space in rollback segment, Oracle seems to provide a way via Alter System set undo_retention = <time_in_secs>. 

Now, if the above understanding of mine is correct, then I think the current implementation done by Kevin is closer to what Oracle provides.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: snapshot too old, configured by time

From
Bruce Momjian
Date:
On Sat, Apr 23, 2016 at 12:48:08PM +0530, Amit Kapila wrote:
> On Sat, Apr 23, 2016 at 8:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
> >
> > I kind of agreed with Tom about just aborting transactions that held
> > snapshots for too long, and liked the idea this could be set per
> > session, but the idea that we abort only if a backend actually touches
> > the old data is very nice.  I can see why the patch author worked hard
> > to do that.
> >
> > How does/did Oracle handle this?
> >
> 
> IIRC then Oracle gives this error when the space in undo tablespace (aka
> rollback segment) is low.  When the rollback segment gets full, it overwrites
> the changed data which might be required by some old snapshot and when that old
> snapshot statement tries to access the data (which is already overwritten), it
> gets "snapshot too old" error.  Assuming there is enough space in rollback
> segment, Oracle seems to provide a way via Alter System set undo_retention =
> <time_in_secs>. 
> 
> Now, if the above understanding of mine is correct, then I think the current
> implementation done by Kevin is closer to what Oracle provides.

But does the rollback only happen if the long-running Oracle transaction
tries to _access_ specific data that was in the undo segment, or _any_
data that potentially could have been in the undo segment?  If the
later, it seems Kevin's approach is better because you would have to
actually need to access old data that was there to be canceled, not just
any data that could have been overwritten based on the xid.

Also, it seems we have similar behavior already in applying WAL on the
standby --- we delay WAL replay when there is a long-running
transaction.  Once the time expires, we apply the WAL.  Do we cancel the
long-running transaction at that time, or wait for the long-running
transaction to touch some WAL we just applied?  If the former, does
Kevin's new code allow us to do the later?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: snapshot too old, configured by time

From
Amit Kapila
Date:
On Sat, Apr 23, 2016 at 7:50 PM, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Sat, Apr 23, 2016 at 12:48:08PM +0530, Amit Kapila wrote:
> > On Sat, Apr 23, 2016 at 8:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > >
> > > I kind of agreed with Tom about just aborting transactions that held
> > > snapshots for too long, and liked the idea this could be set per
> > > session, but the idea that we abort only if a backend actually touches
> > > the old data is very nice.  I can see why the patch author worked hard
> > > to do that.
> > >
> > > How does/did Oracle handle this?
> > >
> >
> > IIRC then Oracle gives this error when the space in undo tablespace (aka
> > rollback segment) is low.  When the rollback segment gets full, it overwrites
> > the changed data which might be required by some old snapshot and when that old
> > snapshot statement tries to access the data (which is already overwritten), it
> > gets "snapshot too old" error.  Assuming there is enough space in rollback
> > segment, Oracle seems to provide a way via Alter System set undo_retention =
> > <time_in_secs>.
> >
> > Now, if the above understanding of mine is correct, then I think the current
> > implementation done by Kevin is closer to what Oracle provides.
>
> But does the rollback only happen if the long-running Oracle transaction
> tries to _access_ specific data that was in the undo segment, or _any_
> data that potentially could have been in the undo segment?
>

It does when long running transaction tries to access specific data.  If you want to know in more detail then you can read slides 7~29 from the attached presentation (with focus on slides 28 and 29).

>  If the
> later, it seems Kevin's approach is better because you would have to
> actually need to access old data that was there to be canceled, not just
> any data that could have been overwritten based on the xid.
>
> Also, it seems we have similar behavior already in applying WAL on the
> standby --- we delay WAL replay when there is a long-running
> transaction.  Once the time expires, we apply the WAL.  Do we cancel the
> long-running transaction at that time, or wait for the long-running
> transaction to touch some WAL we just applied?
>

As per my understanding, the error is given when any transaction tries to access the data.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: snapshot too old, configured by time

From
Alexander Korotkov
Date:
On Sat, Apr 23, 2016 at 5:20 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Sat, Apr 23, 2016 at 12:48:08PM +0530, Amit Kapila wrote:
> On Sat, Apr 23, 2016 at 8:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
> >
> > I kind of agreed with Tom about just aborting transactions that held
> > snapshots for too long, and liked the idea this could be set per
> > session, but the idea that we abort only if a backend actually touches
> > the old data is very nice.  I can see why the patch author worked hard
> > to do that.
> >
> > How does/did Oracle handle this?
> >
>
> IIRC then Oracle gives this error when the space in undo tablespace (aka
> rollback segment) is low.  When the rollback segment gets full, it overwrites
> the changed data which might be required by some old snapshot and when that old
> snapshot statement tries to access the data (which is already overwritten), it
> gets "snapshot too old" error.  Assuming there is enough space in rollback
> segment, Oracle seems to provide a way via Alter System set undo_retention =
> <time_in_secs>. 
>
> Now, if the above understanding of mine is correct, then I think the current
> implementation done by Kevin is closer to what Oracle provides.

But does the rollback only happen if the long-running Oracle transaction
tries to _access_ specific data that was in the undo segment, or _any_
data that potentially could have been in the undo segment?  If the
later, it seems Kevin's approach is better because you would have to
actually need to access old data that was there to be canceled, not just
any data that could have been overwritten based on the xid.
 
I'm not sure that we should rely that much on Oracle behavior.  It has very different MVCC model.
Thus we can't apply same features one-by-one: they would have different pro and cons for us.

Also, it seems we have similar behavior already in applying WAL on the
standby --- we delay WAL replay when there is a long-running
transaction.  Once the time expires, we apply the WAL.  Do we cancel the
long-running transaction at that time, or wait for the long-running
transaction to touch some WAL we just applied?  If the former, does
Kevin's new code allow us to do the later?

That makes sense for me.  If we could improve read-only queries on slaves this way, Kevin's new code becomes much more justified.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: snapshot too old, configured by time

From
Bruce Momjian
Date:
On Sat, Apr 23, 2016 at 10:20:19AM -0400, Bruce Momjian wrote:
> On Sat, Apr 23, 2016 at 12:48:08PM +0530, Amit Kapila wrote:
> > On Sat, Apr 23, 2016 at 8:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > >
> > > I kind of agreed with Tom about just aborting transactions that held
> > > snapshots for too long, and liked the idea this could be set per
> > > session, but the idea that we abort only if a backend actually touches
> > > the old data is very nice.  I can see why the patch author worked hard
> > > to do that.

As I understand it, a transaction trying to access a shared buffer
aborts if there was a cleanup on the page that removed rows it might be
interested in.  How does this handle cases where vacuum removes _pages_
from the table?  Does vacuum avoid this when there are running
transactions?

> Also, it seems we have similar behavior already in applying WAL on the
> standby --- we delay WAL replay when there is a long-running
> transaction.  Once the time expires, we apply the WAL.  Do we cancel the
> long-running transaction at that time, or wait for the long-running
> transaction to touch some WAL we just applied?  If the former, does
> Kevin's new code allow us to do the later?

Is this a TODO item?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: snapshot too old, configured by time

From
Kevin Grittner
Date:
On Sun, May 1, 2016 at 11:54 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sat, Apr 23, 2016 at 10:20:19AM -0400, Bruce Momjian wrote:
>> On Sat, Apr 23, 2016 at 12:48:08PM +0530, Amit Kapila wrote:
>>> On Sat, Apr 23, 2016 at 8:34 AM, Bruce Momjian <bruce@momjian.us> wrote:
>>>>
>>>> I kind of agreed with Tom about just aborting transactions that held
>>>> snapshots for too long, and liked the idea this could be set per
>>>> session, but the idea that we abort only if a backend actually touches
>>>> the old data is very nice.  I can see why the patch author worked hard
>>>> to do that.
>
> As I understand it, a transaction trying to access a shared buffer
> aborts if there was a cleanup on the page that removed rows it might be
> interested in.  How does this handle cases where vacuum removes _pages_
> from the table?

(1)  When the "snapshot too old" feature is enabled
(old_snapshot_threshold >= 0) relations are not truncated, so pages
cannot be removed that way.  This mainly protects against a seq
scan having the problem you describe.

(2)  Other than a seq scan, you could miss a page when descending
through an index or following sibling pointers within an index.  In
either case you can't remove a page without modifying the page
pointing to it to no longer do so, so the modified LSN on the
parent or sibling will trigger the error.

Note that a question has recently been raised regarding hash
indexes (which should perhaps be generalized to any non-WAL-logged
index on a permanent table.  Since that is a correctness issue,
versus a performance issue affecting only how many people will find
the feature useful, I will add that to the release blockers list
and prioritize it ahead of those issues only affecting how many
people will find it useful.

> Does vacuum avoid this when there are running transactions?

I'm not sure I understand the question.

>> Also, it seems we have similar behavior already in applying WAL on the
>> standby --- we delay WAL replay when there is a long-running
>> transaction.  Once the time expires, we apply the WAL.  Do we cancel the
>> long-running transaction at that time, or wait for the long-running
>> transaction to touch some WAL we just applied?  If the former, does
>> Kevin's new code allow us to do the later?
>
> Is this a TODO item?

I'm not aware of any TODO items existing or needed here.  The
feature operates by adjusting the xmin used by vacuum and pruning,
and leaving all the other mechanisms functioning as they were.
That looked to me like it should interact with replication streams
correctly.  If someone sees something that needs adjustment please
speak up Real Soon Now.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: snapshot too old, configured by time

From
Bruce Momjian
Date:
On Mon, May  2, 2016 at 03:50:36PM -0500, Kevin Grittner wrote:
> >> Also, it seems we have similar behavior already in applying WAL on the
> >> standby --- we delay WAL replay when there is a long-running
> >> transaction.  Once the time expires, we apply the WAL.  Do we cancel the
> >> long-running transaction at that time, or wait for the long-running
> >> transaction to touch some WAL we just applied?  If the former, does
> >> Kevin's new code allow us to do the later?
> >
> > Is this a TODO item?
> 
> I'm not aware of any TODO items existing or needed here.  The
> feature operates by adjusting the xmin used by vacuum and pruning,
> and leaving all the other mechanisms functioning as they were.
> That looked to me like it should interact with replication streams
> correctly.  If someone sees something that needs adjustment please
> speak up Real Soon Now.

My question is whether this method could also be used to avoid read-only
query cancel when we force replay of a conflicting wal record.  Could we
wait for the read-only query to try to _access_ some old data before
cancelling it?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +