Re: GetOldestXmin going backwards is dangerous after all - Mailing list pgsql-hackers

From Robert Haas
Subject Re: GetOldestXmin going backwards is dangerous after all
Date
Msg-id CA+TgmobX8fO1mXXWMbr_TpvHyVH9K2i5xjz=Xf48CxcWjD1UrA@mail.gmail.com
Whole thread Raw
In response to Re: GetOldestXmin going backwards is dangerous after all  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Fri, Feb 1, 2013 at 7:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Having said that, I agree that a fix in GetOldestXmin() would be nice
>> if we could find one, but since the comment describes at least three
>> different ways the value can move backwards, I'm not sure that there's
>> really a practical solution there, especially if you want something we
>> can back-patch.
>
> Actually, wait a second.  As you say, the comment describes three known
> ways to make it go backwards.  It strikes me that all three are fixable:
>
>  * if allDbs is FALSE and there are no transactions running in the current
>  * database, GetOldestXmin() returns latestCompletedXid. If a transaction
>  * begins after that, its xmin will include in-progress transactions in other
>  * databases that started earlier, so another call will return a lower value.
>
> The reason this is a problem is that GetOldestXmin ignores XIDs of
> processes that are connected to other DBs.  It now seems to me that this
> is a flat-out bug.  It can ignore their xmins, but it should include
> their XIDs, because the point of considering those XIDs is that they may
> contribute to the xmins of snapshots computed in the future by processes
> in our own DB.  And snapshots never exclude any XIDs on the basis of
> which DB they're in.  (They can't really, since we can't know when the
> snap is taken whether it might be used to examine shared catalogs.)

This idea is superficially appealing, but I can't get excited about
it, because it will cause GetOldestXmin() to compute older (that is,
more conservative) XIDs than it does today, which means we'll vacuum
away less.  Considering the amount of pain we have around vacuum
today, that's surely got to be going the wrong direction.  I'm
inclined to think that the root of the problem is that the computation
is too conservative already.  I mean, what do we really need to
include in the computation in order not to vacuum away any data that's
still needed?  There are basically two concerns:

- You can't remove data that any snapshot in the current database
might try to read.  So you have to include the xmins of those
snapshots in your oldest-xmin value.  But, not really, because those
xmins were computed by looking at all running XIDs across the whole
database, and if their oldest xmin value came from some XID in another
database, it's not relevant.  What you want (but of course, can't get)
is the db-specific xmin horizon for each other snapshot.

- You have to worry about new snapshots that are about to appear but
haven't yet.  But, really, how big of a concern is this?  No
transactions can exit the running set while we hold ProcArrayLock, so
it seems to me that there ought to be a pretty tight limit on what can
show up here.  For example, if we've already taken a snapshot, surely
a subsequently taken snapshot can't really need an older xmin horizon
than the one we computed for our own snapshot.

In theory, it seems like we might be able to solve these problems by
having each backend advertise TWO xmin horizons: one representing the
oldest xmin it might be able to see in a system catalog, and the other
representing the oldest xmin it might be able to see in its own
database.  This would allow significantly more precise tracking of
what is really needed in the current database, but there's an obvious
implementation concern, which is that GetSnapshotData is hot enough
that individual instructions matter.

>  * There are also replication-related effects: a walsender
>  * process can set its xmin based on transactions that are no longer running
>  * in the master but are still being replayed on the standby, thus possibly
>  * making the GetOldestXmin reading go backwards.  In this case there is a
>  * possibility that we lose data that the standby would like to have, but
>  * there is little we can do about that --- data is only protected if the
>  * walsender runs continuously while queries are executed on the standby.
>  * (The Hot Standby code deals with such cases by failing standby queries
>  * that needed to access already-removed data, so there's no integrity bug.)
>
> This is just bogus.  Why don't we make it a requirement on walsenders
> that they never move their advertised xmin backwards (or initially set
> it to less than the prevailing global xmin)?  There's no real benefit to
> allowing them to try to move the global xmin backwards, because any data
> that they might hope to protect that way could be gone already.

I think there is already some code that aims to do this, but I might
be misremembering.  IOW, this comment might already be incorrect.

>  * The return value is also adjusted with vacuum_defer_cleanup_age, so
>  * increasing that setting on the fly is another easy way to make
>  * GetOldestXmin() move backwards, with no consequences for data integrity.
>
> And as for that, it's been pretty clear for awhile that allowing
> vacuum_defer_cleanup_age to change on the fly was a bad idea we'd
> eventually have to undo.  The day of reckoning has arrived: it needs
> to be PGC_POSTMASTER.

We might even want to consider getting rid of it altogether at this
point, but neither that nor a change to PGC_POSTMASTER seems a likely
candidate for back-patching.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Vlad Bailescu
Date:
Subject: Re: Turning auto-analyze off (was Re: [GENERAL] Unusually high IO for autovacuum worker)
Next
From: Robert Haas
Date:
Subject: Re: GetOldestXmin going backwards is dangerous after all