Thread: GetOldestXmin going backwards is dangerous after all

GetOldestXmin going backwards is dangerous after all

From

Tom Lane

Date:

01 February 2013, 19:35:31

I've been able to reproduce the problem reported by Pius Chan in bug
#7819. With some logging added that prints the OldestXmin values used
by vacuum and cluster operations, the reason is fairly clear:

2013-02-01 13:41:12 EST 8011 LOG: vacuuming "test" with OldestXmin 1760160 FreezeLimit 4246727456
2013-02-01 13:41:14 EST 8011 LOG: automatic vacuum of table "test.public.test": index scans: 1 pages: 0 removed,
635remain tuples: 48895 removed, 91 remain system usage: CPU 0.00s/0.03u sec elapsed 1.64 sec

2013-02-01 13:41:14 EST 8011 LOG: automatic analyze of table "test.public.test" system usage: CPU 0.00s/0.00u sec
elapsed0.09 sec

2013-02-01 13:41:14 EST 8011 LOG: vacuuming "pg_toast_115435" with OldestXmin 1761518 FreezeLimit 4246728814
2013-02-01 13:41:57 EST 8011 LOG: automatic vacuum of table "test.pg_toast.pg_toast_115435": index scans: 1
pages:0 removed, 12904 remain tuples: 7624 removed, 43992 remain system usage: CPU 0.21s/0.13u sec elapsed
42.79sec

2013-02-01 13:41:58 EST 7025 LOG: clustering "test" with OldestXmin 1752719 FreezeXid 1630979
2013-02-01 13:41:58 EST 7025 ERROR: missing chunk number 0 for toast value 215828 in pg_toast_115435

IOW, autovacuum worker 8011 vacuumed "test" with one xmin, and then a
bit later vacuumed its toast table with a slightly newer xmin. This
means that certain tuples' toast tuples could get cleaned out of the
toast table while the parent tuples are still there in the heap.
Normally this is fine because the parent tuples are certainly dead and
beyond snapshot range, so no one will make any attempt to fetch their
toast values. But then, here comes CLUSTER with an older xmin ---
so it thinks it needs to copy over some of those dead-but-possibly-
still-visible-to-somebody tuples, and kaboom.

Now, CLUSTER is very careful to make sure that it gets a lock on the
toast table before computing OldestXmin, with the express intent of
preventing this type of problem (cf comment in copy_heap_data).
However, if the result of GetOldestXmin() can ever go backwards,
that doesn't help. And guess what ... it can.

The header comment for GetOldestXmin() contains a long rationalization
for why it's safe for the result to sometimes go backwards, but now
that I've seen this example I'm pretty sure that that's all horsepucky.

I think we need to either ensure that GetOldestXmin is strictly
nondecreasing, or find some more-direct way of interlocking vacuuming
of toast tables with their parents.

We could possibly do the former by tracking the last-known value in
shared memory, but it'd be complicated because the value can
(legitimately) vary depending on which database you ask about.

The only backpatchable way I can think of to do the latter is to undo
the decoupling of main and toast-table vacuuming, and then say that
we simply don't change the value of OldestXmin when going to vacuum
the toast table. This is a bit unpleasant.

In any case, I no longer have much faith in the idea that letting
GetOldestXmin go backwards is really safe.

The one bright spot is that AFAICS this isn't a true data loss hazard;
no possibly-still-needed data can be lost. Any failures would be
transient failures of CLUSTER or VACUUM FULL.

Thoughts?
regards, tom lane

Re: GetOldestXmin going backwards is dangerous after all

From

Robert Haas

Date:

01 February 2013, 21:15:51

On Fri, Feb 1, 2013 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In any case, I no longer have much faith in the idea that letting
> GetOldestXmin go backwards is really safe.

That is admittedly kind of weird behavior, but I think one could
equally blame this on CLUSTER.  This is hardly the first time we've
had to patch CLUSTER's handling of TOAST tables (cf commits
21b446dd0927f8f2a187d9461a0d3f11db836f77,
7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05,
83b7584944b3a9df064cccac06822093f1a83793) and it doesn't seem unlikely
that we might go the full ten rounds.

Having said that, I agree that a fix in GetOldestXmin() would be nice
if we could find one, but since the comment describes at least three
different ways the value can move backwards, I'm not sure that there's
really a practical solution there, especially if you want something we
can back-patch.  I'm more inclined to think that we need to find a way
to make CLUSTER more robust against xmin regressions, but I confess I
don't have any good ideas about how to do that, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GetOldestXmin going backwards is dangerous after all

From

Tom Lane

Date:

01 February 2013, 23:56:25

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Feb 1, 2013 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In any case, I no longer have much faith in the idea that letting
>> GetOldestXmin go backwards is really safe.

> That is admittedly kind of weird behavior, but I think one could
> equally blame this on CLUSTER.  This is hardly the first time we've
> had to patch CLUSTER's handling of TOAST tables (cf commits
> 21b446dd0927f8f2a187d9461a0d3f11db836f77,
> 7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05,
> 83b7584944b3a9df064cccac06822093f1a83793) and it doesn't seem unlikely
> that we might go the full ten rounds.

Yeah, but I'm not sure whether CLUSTER is the appropriate blamee or
whether it's more like the canary in the coal mine, first to expose
problematic behaviors elsewhere.  The general problem here is really
that we're cleaning out toast tuples while the referencing main-heap
tuple still physically exists.  How safe do you think that is?  That
did not ever happen before we decoupled autovacuuming of main and toast
tables, either --- so a good case could be made that that idea is
fundamentally broken.

> Having said that, I agree that a fix in GetOldestXmin() would be nice
> if we could find one, but since the comment describes at least three
> different ways the value can move backwards, I'm not sure that there's
> really a practical solution there, especially if you want something we
> can back-patch.

Well, if we were tracking the latest value in shared memory, we could
certainly clamp to that to ensure it didn't go backwards.  The problem
is where to find storage for a per-DB value.

I thought about storing each session's latest value in its PGPROC and
taking the max over same-DB sessions during GetOldestXmin's ProcArray
scan, but that doesn't work because an autovacuum process might
disappear and thus destroy the needed info just before CLUSTER looks
for it.
        regards, tom lane

Re: GetOldestXmin going backwards is dangerous after all

From

Tom Lane

Date:

02 February 2013, 00:24:05

Robert Haas <robertmhaas@gmail.com> writes:
> Having said that, I agree that a fix in GetOldestXmin() would be nice
> if we could find one, but since the comment describes at least three
> different ways the value can move backwards, I'm not sure that there's
> really a practical solution there, especially if you want something we
> can back-patch.

Actually, wait a second.  As you say, the comment describes three known
ways to make it go backwards.  It strikes me that all three are fixable:
* if allDbs is FALSE and there are no transactions running in the current* database, GetOldestXmin() returns
latestCompletedXid.If a transaction* begins after that, its xmin will include in-progress transactions in other*
databasesthat started earlier, so another call will return a lower value.

The reason this is a problem is that GetOldestXmin ignores XIDs of
processes that are connected to other DBs.  It now seems to me that this
is a flat-out bug.  It can ignore their xmins, but it should include
their XIDs, because the point of considering those XIDs is that they may
contribute to the xmins of snapshots computed in the future by processes
in our own DB.  And snapshots never exclude any XIDs on the basis of
which DB they're in.  (They can't really, since we can't know when the
snap is taken whether it might be used to examine shared catalogs.)
* There are also replication-related effects: a walsender* process can set its xmin based on transactions that are no
longerrunning* in the master but are still being replayed on the standby, thus possibly* making the GetOldestXmin
readinggo backwards.  In this case there is a* possibility that we lose data that the standby would like to have, but*
thereis little we can do about that --- data is only protected if the* walsender runs continuously while queries are
executedon the standby.* (The Hot Standby code deals with such cases by failing standby queries* that needed to access
already-removeddata, so there's no integrity bug.)

This is just bogus.  Why don't we make it a requirement on walsenders
that they never move their advertised xmin backwards (or initially set
it to less than the prevailing global xmin)?  There's no real benefit to
allowing them to try to move the global xmin backwards, because any data
that they might hope to protect that way could be gone already.
* The return value is also adjusted with vacuum_defer_cleanup_age, so* increasing that setting on the fly is another
easyway to make* GetOldestXmin() move backwards, with no consequences for data integrity.

And as for that, it's been pretty clear for awhile that allowing
vacuum_defer_cleanup_age to change on the fly was a bad idea we'd
eventually have to undo.  The day of reckoning has arrived: it needs
to be PGC_POSTMASTER.
        regards, tom lane

Re: GetOldestXmin going backwards is dangerous after all

From

Simon Riggs

Date:

02 February 2013, 11:01:02

On 1 February 2013 23:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Well, if we were tracking the latest value in shared memory, we could
> certainly clamp to that to ensure it didn't go backwards.  The problem
> is where to find storage for a per-DB value.

Adding new data columns to catalogs in backbranches seems like a great
reason to have an hstore column on every catalog table.

That way we can just add anything we need without causing other problems.

Obviously something for the future.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: GetOldestXmin going backwards is dangerous after all

From

Simon Riggs

Date:

02 February 2013, 11:07:31

On 1 February 2013 23:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Feb 1, 2013 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> In any case, I no longer have much faith in the idea that letting
>>> GetOldestXmin go backwards is really safe.
>
>> That is admittedly kind of weird behavior, but I think one could
>> equally blame this on CLUSTER.  This is hardly the first time we've
>> had to patch CLUSTER's handling of TOAST tables (cf commits
>> 21b446dd0927f8f2a187d9461a0d3f11db836f77,
>> 7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05,
>> 83b7584944b3a9df064cccac06822093f1a83793) and it doesn't seem unlikely
>> that we might go the full ten rounds.
>
> Yeah, but I'm not sure whether CLUSTER is the appropriate blamee or
> whether it's more like the canary in the coal mine, first to expose
> problematic behaviors elsewhere.  The general problem here is really
> that we're cleaning out toast tuples while the referencing main-heap
> tuple still physically exists.  How safe do you think that is?

Agree that CLUSTER is just first to hit the problem. As long as the
problem exists, we have issues.

> That
> did not ever happen before we decoupled autovacuuming of main and toast
> tables, either --- so a good case could be made that that idea is
> fundamentally broken.

It's broken, but not fundamentally. We can decouple the autovacuuming,
as long as we don't decouple the xid used.

"Fixing" GetOldestXmin() seems like the easiest thing to do for
backbranches, but seems a little heavy handed for a general fix -
though doing that will probably close any future thinkos of similar
nature.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: GetOldestXmin going backwards is dangerous after all

From

Andres Freund

Date:

02 February 2013, 14:24:45

On 2013-02-01 19:24:02 -0500, Tom Lane wrote:
>  * There are also replication-related effects: a walsender
>  * process can set its xmin based on transactions that are no longer running
>  * in the master but are still being replayed on the standby, thus possibly
>  * making the GetOldestXmin reading go backwards.  In this case there is a
>  * possibility that we lose data that the standby would like to have, but
>  * there is little we can do about that --- data is only protected if the
>  * walsender runs continuously while queries are executed on the standby.
>  * (The Hot Standby code deals with such cases by failing standby queries
>  * that needed to access already-removed data, so there's no integrity bug.)
> 
> This is just bogus.  Why don't we make it a requirement on walsenders
> that they never move their advertised xmin backwards (or initially set
> it to less than the prevailing global xmin)?  There's no real benefit to
> allowing them to try to move the global xmin backwards, because any data
> that they might hope to protect that way could be gone already.

The problem imo has multiple sides.

a) We currently don't have a way to assign the minimal valid xmin to the
current backend. I already need that for logical replication, so we
might just as well add it now:

http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commitdiff;h=84f6ed9550b65b0584589977877e9a997daeb61d

With that its just a matter of taking ProcArrayLock exlusively before
calling GetOldestXmin and assigning it to MyPgXact->xmin before
releasing it.

b) We don't assign the xmin early enough, we only set it when the first
feedback message arrives, but we should set it when walsender starts
streaming.

c) After a disconnect the feedback message will rather likely ask for an
xmin horizon thats not valid anymore on the primary. If the disconnect
was short enough often enough that doesn't matter because nothing has
been cleared out, but it doesn't really work all that well.
Thats still better than setting it to the currently valid minimal xmin
horizon because it prevents cleanup from that moment on.
I don't see how this can be significantly improved without persistent
knowledge about standbys.

>  * The return value is also adjusted with vacuum_defer_cleanup_age, so
>  * increasing that setting on the fly is another easy way to make
>  * GetOldestXmin() move backwards, with no consequences for data integrity.
> 
> And as for that, it's been pretty clear for awhile that allowing
> vacuum_defer_cleanup_age to change on the fly was a bad idea we'd
> eventually have to undo.  The day of reckoning has arrived: it needs
> to be PGC_POSTMASTER.

Not sure if that would really solve anything, similar problems seem to
exist even if the postmaster has been restarted inbetween.

Btw, I wonder if that comment in GetOldestXmin still has good
justification, now that we allow cascading standbys:
> Also note that we intentionally don't apply vacuum_defer_cleanup_age
> on standby servers.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: GetOldestXmin going backwards is dangerous after all

From

Robert Haas

Date:

02 February 2013, 15:23:14

On Fri, Feb 1, 2013 at 7:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Having said that, I agree that a fix in GetOldestXmin() would be nice
>> if we could find one, but since the comment describes at least three
>> different ways the value can move backwards, I'm not sure that there's
>> really a practical solution there, especially if you want something we
>> can back-patch.
>
> Actually, wait a second.  As you say, the comment describes three known
> ways to make it go backwards.  It strikes me that all three are fixable:
>
>  * if allDbs is FALSE and there are no transactions running in the current
>  * database, GetOldestXmin() returns latestCompletedXid. If a transaction
>  * begins after that, its xmin will include in-progress transactions in other
>  * databases that started earlier, so another call will return a lower value.
>
> The reason this is a problem is that GetOldestXmin ignores XIDs of
> processes that are connected to other DBs.  It now seems to me that this
> is a flat-out bug.  It can ignore their xmins, but it should include
> their XIDs, because the point of considering those XIDs is that they may
> contribute to the xmins of snapshots computed in the future by processes
> in our own DB.  And snapshots never exclude any XIDs on the basis of
> which DB they're in.  (They can't really, since we can't know when the
> snap is taken whether it might be used to examine shared catalogs.)

This idea is superficially appealing, but I can't get excited about
it, because it will cause GetOldestXmin() to compute older (that is,
more conservative) XIDs than it does today, which means we'll vacuum
away less.  Considering the amount of pain we have around vacuum
today, that's surely got to be going the wrong direction.  I'm
inclined to think that the root of the problem is that the computation
is too conservative already.  I mean, what do we really need to
include in the computation in order not to vacuum away any data that's
still needed?  There are basically two concerns:

- You can't remove data that any snapshot in the current database
might try to read.  So you have to include the xmins of those
snapshots in your oldest-xmin value.  But, not really, because those
xmins were computed by looking at all running XIDs across the whole
database, and if their oldest xmin value came from some XID in another
database, it's not relevant.  What you want (but of course, can't get)
is the db-specific xmin horizon for each other snapshot.

- You have to worry about new snapshots that are about to appear but
haven't yet.  But, really, how big of a concern is this?  No
transactions can exit the running set while we hold ProcArrayLock, so
it seems to me that there ought to be a pretty tight limit on what can
show up here.  For example, if we've already taken a snapshot, surely
a subsequently taken snapshot can't really need an older xmin horizon
than the one we computed for our own snapshot.

In theory, it seems like we might be able to solve these problems by
having each backend advertise TWO xmin horizons: one representing the
oldest xmin it might be able to see in a system catalog, and the other
representing the oldest xmin it might be able to see in its own
database.  This would allow significantly more precise tracking of
what is really needed in the current database, but there's an obvious
implementation concern, which is that GetSnapshotData is hot enough
that individual instructions matter.

>  * There are also replication-related effects: a walsender
>  * process can set its xmin based on transactions that are no longer running
>  * in the master but are still being replayed on the standby, thus possibly
>  * making the GetOldestXmin reading go backwards.  In this case there is a
>  * possibility that we lose data that the standby would like to have, but
>  * there is little we can do about that --- data is only protected if the
>  * walsender runs continuously while queries are executed on the standby.
>  * (The Hot Standby code deals with such cases by failing standby queries
>  * that needed to access already-removed data, so there's no integrity bug.)
>
> This is just bogus.  Why don't we make it a requirement on walsenders
> that they never move their advertised xmin backwards (or initially set
> it to less than the prevailing global xmin)?  There's no real benefit to
> allowing them to try to move the global xmin backwards, because any data
> that they might hope to protect that way could be gone already.

I think there is already some code that aims to do this, but I might
be misremembering.  IOW, this comment might already be incorrect.

>  * The return value is also adjusted with vacuum_defer_cleanup_age, so
>  * increasing that setting on the fly is another easy way to make
>  * GetOldestXmin() move backwards, with no consequences for data integrity.
>
> And as for that, it's been pretty clear for awhile that allowing
> vacuum_defer_cleanup_age to change on the fly was a bad idea we'd
> eventually have to undo.  The day of reckoning has arrived: it needs
> to be PGC_POSTMASTER.

We might even want to consider getting rid of it altogether at this
point, but neither that nor a change to PGC_POSTMASTER seems a likely
candidate for back-patching.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GetOldestXmin going backwards is dangerous after all

From

Robert Haas

Date:

02 February 2013, 15:34:51

On Fri, Feb 1, 2013 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That is admittedly kind of weird behavior, but I think one could
>> equally blame this on CLUSTER.  This is hardly the first time we've
>> had to patch CLUSTER's handling of TOAST tables (cf commits
>> 21b446dd0927f8f2a187d9461a0d3f11db836f77,
>> 7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05,
>> 83b7584944b3a9df064cccac06822093f1a83793) and it doesn't seem unlikely
>> that we might go the full ten rounds.
>
> Yeah, but I'm not sure whether CLUSTER is the appropriate blamee or
> whether it's more like the canary in the coal mine, first to expose
> problematic behaviors elsewhere.  The general problem here is really
> that we're cleaning out toast tuples while the referencing main-heap
> tuple still physically exists.  How safe do you think that is?  That
> did not ever happen before we decoupled autovacuuming of main and toast
> tables, either --- so a good case could be made that that idea is
> fundamentally broken.

I confess I'm not entirely sanguine about that.  A similar issue
arises with index tuples referencing heap tuples, and we go to great
lengths to make sure that the last vestiges of the heap tuple can't be
removed until the index tuples are well and truly dead.  If we were to
adopt the same approach here, it would mean that, instead of
consulting XIDs at all, we'd remove TOAST pointers only when there
were provably no main-table tuples pointing to them.  I'm not really
proposing that, but it's an interesting thought-experiment.  What
we're actually doing is relying on values to match exactly that were
never intended as more than an approximation.

Can you point me to the commit that decoupled this?  What was the
motivation for it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: GetOldestXmin going backwards is dangerous after all

From

Simon Riggs

Date:

02 February 2013, 18:33:11

On 2 February 2013 14:24, Andres Freund <andres@2ndquadrant.com> wrote:

> b) We don't assign the xmin early enough, we only set it when the first
> feedback message arrives, but we should set it when walsender starts
> streaming.

That's easy to fix.

> c) After a disconnect the feedback message will rather likely ask for an
> xmin horizon thats not valid anymore on the primary. If the disconnect
> was short enough often enough that doesn't matter because nothing has
> been cleared out, but it doesn't really work all that well.
> Thats still better than setting it to the currently valid minimal xmin
> horizon because it prevents cleanup from that moment on.
> I don't see how this can be significantly improved without persistent
> knowledge about standbys.

We could delay startup of the standby until the xmin on the standby
reaches the xmin on the master.

So when the standby has hot_standby_feedback = on,  at standby
connection we set the xmin of the walsender to be the current value on
the master, then we disallow connections on standby until we have
replayed up to that xmin on the standby. That way the xmin of the
walsender never goes backwards nor do we get cancelations on the
standby.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: GetOldestXmin going backwards is dangerous after all

From

Simon Riggs

Date:

02 February 2013, 18:34:54

On 2 February 2013 00:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>  * The return value is also adjusted with vacuum_defer_cleanup_age, so
>  * increasing that setting on the fly is another easy way to make
>  * GetOldestXmin() move backwards, with no consequences for data integrity.
>
> And as for that, it's been pretty clear for awhile that allowing
> vacuum_defer_cleanup_age to change on the fly was a bad idea we'd
> eventually have to undo.  The day of reckoning has arrived: it needs
> to be PGC_POSTMASTER.

Agreed, will fix.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: GetOldestXmin going backwards is dangerous after all

From

Simon Riggs

Date:

02 February 2013, 18:35:26

On 2 February 2013 00:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>  * if allDbs is FALSE and there are no transactions running in the current
>  * database, GetOldestXmin() returns latestCompletedXid. If a transaction
>  * begins after that, its xmin will include in-progress transactions in other
>  * databases that started earlier, so another call will return a lower value.
>
> The reason this is a problem is that GetOldestXmin ignores XIDs of
> processes that are connected to other DBs.  It now seems to me that this
> is a flat-out bug.  It can ignore their xmins, but it should include
> their XIDs, because the point of considering those XIDs is that they may
> contribute to the xmins of snapshots computed in the future by processes
> in our own DB.  And snapshots never exclude any XIDs on the basis of
> which DB they're in.  (They can't really, since we can't know when the
> snap is taken whether it might be used to examine shared catalogs.)

Agree thats a bug.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: GetOldestXmin going backwards is dangerous after all

From

Andres Freund

Date:

02 February 2013, 18:38:15

On 2013-02-02 18:32:44 +0000, Simon Riggs wrote:
> On 2 February 2013 14:24, Andres Freund <andres@2ndquadrant.com> wrote:
> 
> > b) We don't assign the xmin early enough, we only set it when the first
> > feedback message arrives, but we should set it when walsender starts
> > streaming.
> 
> That's easy to fix.

Not trivial, but not too hard, yes. When the standby initially connects
we don't yet know which xid will be required because consistency hasn't
yet been achieved.

> > c) After a disconnect the feedback message will rather likely ask for an
> > xmin horizon thats not valid anymore on the primary. If the disconnect
> > was short enough often enough that doesn't matter because nothing has
> > been cleared out, but it doesn't really work all that well.
> > Thats still better than setting it to the currently valid minimal xmin
> > horizon because it prevents cleanup from that moment on.
> > I don't see how this can be significantly improved without persistent
> > knowledge about standbys.
> 
> We could delay startup of the standby until the xmin on the standby
> reaches the xmin on the master.
> 
> So when the standby has hot_standby_feedback = on,  at standby
> connection we set the xmin of the walsender to be the current value on
> the master, then we disallow connections on standby until we have
> replayed up to that xmin on the standby. That way the xmin of the
> walsender never goes backwards nor do we get cancelations on the
> standby.

Thats easy enough when the standby is initially started but doesn't work
that well if just the connection between both failed (or the master was
restarted) and a reconnect worked. To make it work similarly in that
case we would have to throw everyone out which would kind of counteract
the whole idea.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: GetOldestXmin going backwards is dangerous after all

From

Simon Riggs

Date:

02 February 2013, 20:11:32

On 2 February 2013 18:38, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-02-02 18:32:44 +0000, Simon Riggs wrote:
>> On 2 February 2013 14:24, Andres Freund <andres@2ndquadrant.com> wrote:
>>
>> > b) We don't assign the xmin early enough, we only set it when the first
>> > feedback message arrives, but we should set it when walsender starts
>> > streaming.
>>
>> That's easy to fix.
>
> Not trivial, but not too hard, yes. When the standby initially connects
> we don't yet know which xid will be required because consistency hasn't
> yet been achieved.

We don't need to change the protocol, we just need to issue a
hot_standby_feedback message immediately after connection.

That looks trivial to me.

>> > c) After a disconnect the feedback message will rather likely ask for an
>> > xmin horizon thats not valid anymore on the primary. If the disconnect
>> > was short enough often enough that doesn't matter because nothing has
>> > been cleared out, but it doesn't really work all that well.
>> > Thats still better than setting it to the currently valid minimal xmin
>> > horizon because it prevents cleanup from that moment on.
>> > I don't see how this can be significantly improved without persistent
>> > knowledge about standbys.
>>
>> We could delay startup of the standby until the xmin on the standby
>> reaches the xmin on the master.
>>
>> So when the standby has hot_standby_feedback = on,  at standby
>> connection we set the xmin of the walsender to be the current value on
>> the master, then we disallow connections on standby until we have
>> replayed up to that xmin on the standby. That way the xmin of the
>> walsender never goes backwards nor do we get cancelations on the
>> standby.
>
> Thats easy enough when the standby is initially started but doesn't work
> that well if just the connection between both failed (or the master was
> restarted) and a reconnect worked. To make it work similarly in that
> case we would have to throw everyone out which would kind of counteract
> the whole idea.

Agreed. It was worth considering, at least.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: GetOldestXmin going backwards is dangerous after all

From

Andres Freund

Date:

04 February 2013, 10:37:59

On 2013-02-01 19:24:02 -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Having said that, I agree that a fix in GetOldestXmin() would be nice
> > if we could find one, but since the comment describes at least three
> > different ways the value can move backwards, I'm not sure that there's
> > really a practical solution there, especially if you want something we
> > can back-patch.
> 
> Actually, wait a second.  As you say, the comment describes three known
> ways to make it go backwards.  It strikes me that all three are fixable:
> 
>  * if allDbs is FALSE and there are no transactions running in the current
>  * database, GetOldestXmin() returns latestCompletedXid. If a transaction
>  * begins after that, its xmin will include in-progress transactions in other
>  * databases that started earlier, so another call will return a lower value.
> 
> The reason this is a problem is that GetOldestXmin ignores XIDs of
> processes that are connected to other DBs.  It now seems to me that this
> is a flat-out bug.  It can ignore their xmins, but it should include
> their XIDs, because the point of considering those XIDs is that they may
> contribute to the xmins of snapshots computed in the future by processes
> in our own DB.  And snapshots never exclude any XIDs on the basis of
> which DB they're in.  (They can't really, since we can't know when the
> snap is taken whether it might be used to examine shared catalogs.)

>  * The return value is also adjusted with vacuum_defer_cleanup_age, so
>  * increasing that setting on the fly is another easy way to make
>  * GetOldestXmin() move backwards, with no consequences for data integrity.
>
> And as for that, it's been pretty clear for awhile that allowing
> vacuum_defer_cleanup_age to change on the fly was a bad idea we'd
> eventually have to undo.  The day of reckoning has arrived: it needs
> to be PGC_POSTMASTER.

ISTM that the original problem can still occur, even after Simon's
commit.
1) start with -c vacuum_defer_cleanup_age=0
2) autovacuum vacuums "test";
3) restart with -c vacuum_defer_cleanup_age=10000
4) autovacuum vacuums "test"'s toast table;

should result in about the same ERROR, shouldn't it?

Given that there seemingly isn't yet a way to fix that people agree on
and that it "only" result in a transient error I think the fix for this
should be pushed after the next point release.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: GetOldestXmin going backwards is dangerous after all

From

Tom Lane

Date:

04 February 2013, 16:07:23

Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-02-01 19:24:02 -0500, Tom Lane wrote:
>> And as for that, it's been pretty clear for awhile that allowing
>> vacuum_defer_cleanup_age to change on the fly was a bad idea we'd
>> eventually have to undo.  The day of reckoning has arrived: it needs
>> to be PGC_POSTMASTER.

> ISTM that the original problem can still occur, even after Simon's
> commit.
> 1) start with -c vacuum_defer_cleanup_age=0
> 2) autovacuum vacuums "test";
> 3) restart with -c vacuum_defer_cleanup_age=10000
> 4) autovacuum vacuums "test"'s toast table;

> should result in about the same ERROR, shouldn't it?

Hm ... yeah, you're right.  So that's still not bulletproof.

> Given that there seemingly isn't yet a way to fix that people agree on
> and that it "only" result in a transient error I think the fix for this
> should be pushed after the next point release.

Agreed, we can let this go until we have a more complete solution.

Simon, would you revert the vacuum_defer_cleanup_age changes?
We should wait till we have a complete fix before forcing that
redefinition on users.
        regards, tom lane

Re: GetOldestXmin going backwards is dangerous after all

From

Simon Riggs

Date:

04 February 2013, 16:33:45

On 4 February 2013 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Simon, would you revert the vacuum_defer_cleanup_age changes?

Will do

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: GetOldestXmin going backwards is dangerous after all

From

Andres Freund

Date:

04 February 2013, 16:43:53

On 2013-02-04 11:07:17 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-02-01 19:24:02 -0500, Tom Lane wrote:
> >> And as for that, it's been pretty clear for awhile that allowing
> >> vacuum_defer_cleanup_age to change on the fly was a bad idea we'd
> >> eventually have to undo.  The day of reckoning has arrived: it needs
> >> to be PGC_POSTMASTER.
> 
> > ISTM that the original problem can still occur, even after Simon's
> > commit.
> > 1) start with -c vacuum_defer_cleanup_age=0
> > 2) autovacuum vacuums "test";
> > 3) restart with -c vacuum_defer_cleanup_age=10000
> > 4) autovacuum vacuums "test"'s toast table;
> 
> > should result in about the same ERROR, shouldn't it?
> 
> Hm ... yeah, you're right.  So that's still not bulletproof.

ISTM that, unless we can guarantee that this won't cause GetOldestXmin
to backwards, there is no point in trying to make GetOldestXmin any
guarantees not to do so. Especially if that comes with a loss of
VACUUM's effectiveness.

I absolutely hate to suggest it, but what about storing the last
vacuum's xmin horizon in the main table's pg_class.options in the back
branches?
During heap_copy_data the assumed xmin horizon can be
Max(OldestXmin, main_table_last_vacuum_horizon, toast_table_last_vacuum_horizon)

That should solve the problem, right?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: GetOldestXmin going backwards is dangerous after all

From

Tom Lane

Date:

04 February 2013, 16:52:11

Andres Freund <andres@2ndquadrant.com> writes:
> I absolutely hate to suggest it, but what about storing the last
> vacuum's xmin horizon in the main table's pg_class.options in the back
> branches?

Not workable.  This would require a non-in-place update of the table's
pg_class row (at least in cases where the option wasn't stored already).
Which would require VACUUM to have an XID, which would make a whole lot
of assumptions fall over.
        regards, tom lane

Re: GetOldestXmin going backwards is dangerous after all

From

Andres Freund

Date:

04 February 2013, 17:02:10

On 2013-02-04 11:52:05 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I absolutely hate to suggest it, but what about storing the last
> > vacuum's xmin horizon in the main table's pg_class.options in the back
> > branches?
> 
> Not workable.  This would require a non-in-place update of the table's
> pg_class row (at least in cases where the option wasn't stored already).
> Which would require VACUUM to have an XID, which would make a whole lot
> of assumptions fall over.

Don't get me wrong, I hate that solution, but that specific issue seems
relatively easily solvable by pre-creating the option intially in
vacuum_rel() in a separate transaction before entering
lazy_vacuum_rel().

I unfortunately don't yet see a robust way without storing the last used
horizon :(.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: GetOldestXmin going backwards is dangerous after all

From

Simon Riggs

Date:

04 February 2013, 17:22:20

On 4 February 2013 17:02, Andres Freund <andres@2ndquadrant.com> wrote:

> I unfortunately don't yet see a robust way without storing the last used
> horizon :(.

We can't go backwards, but we can go forwards.

We can move the next xid forwards by an amount equal to the increase
in vacuum_defer_cleanup_age.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: GetOldestXmin going backwards is dangerous after all

From

Andres Freund

Date:

04 February 2013, 17:29:19

On 2013-02-04 17:21:50 +0000, Simon Riggs wrote:
> On 4 February 2013 17:02, Andres Freund <andres@2ndquadrant.com> wrote:
> 
> > I unfortunately don't yet see a robust way without storing the last used
> > horizon :(.
> 
> We can't go backwards, but we can go forwards.
> 
> We can move the next xid forwards by an amount equal to the increase
> in vacuum_defer_cleanup_age.

Don't think that helps, the problem is not new writes but already
removed rows in the toast table.

Besides, advancing the next xid by vacuum_defer_cleanup_age every
restart could get expensive. Unless we find a space to store the old
value which we haven't really got in the back branches (control file?)...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services