Thread: GetOldestXmin going backwards is dangerous after all
I've been able to reproduce the problem reported by Pius Chan in bug #7819. With some logging added that prints the OldestXmin values used by vacuum and cluster operations, the reason is fairly clear: 2013-02-01 13:41:12 EST 8011 LOG: vacuuming "test" with OldestXmin 1760160 FreezeLimit 4246727456 2013-02-01 13:41:14 EST 8011 LOG: automatic vacuum of table "test.public.test": index scans: 1 pages: 0 removed, 635remain tuples: 48895 removed, 91 remain system usage: CPU 0.00s/0.03u sec elapsed 1.64 sec 2013-02-01 13:41:14 EST 8011 LOG: automatic analyze of table "test.public.test" system usage: CPU 0.00s/0.00u sec elapsed0.09 sec 2013-02-01 13:41:14 EST 8011 LOG: vacuuming "pg_toast_115435" with OldestXmin 1761518 FreezeLimit 4246728814 2013-02-01 13:41:57 EST 8011 LOG: automatic vacuum of table "test.pg_toast.pg_toast_115435": index scans: 1 pages:0 removed, 12904 remain tuples: 7624 removed, 43992 remain system usage: CPU 0.21s/0.13u sec elapsed 42.79sec 2013-02-01 13:41:58 EST 7025 LOG: clustering "test" with OldestXmin 1752719 FreezeXid 1630979 2013-02-01 13:41:58 EST 7025 ERROR: missing chunk number 0 for toast value 215828 in pg_toast_115435 IOW, autovacuum worker 8011 vacuumed "test" with one xmin, and then a bit later vacuumed its toast table with a slightly newer xmin. This means that certain tuples' toast tuples could get cleaned out of the toast table while the parent tuples are still there in the heap. Normally this is fine because the parent tuples are certainly dead and beyond snapshot range, so no one will make any attempt to fetch their toast values. But then, here comes CLUSTER with an older xmin --- so it thinks it needs to copy over some of those dead-but-possibly- still-visible-to-somebody tuples, and kaboom. Now, CLUSTER is very careful to make sure that it gets a lock on the toast table before computing OldestXmin, with the express intent of preventing this type of problem (cf comment in copy_heap_data). However, if the result of GetOldestXmin() can ever go backwards, that doesn't help. And guess what ... it can. The header comment for GetOldestXmin() contains a long rationalization for why it's safe for the result to sometimes go backwards, but now that I've seen this example I'm pretty sure that that's all horsepucky. I think we need to either ensure that GetOldestXmin is strictly nondecreasing, or find some more-direct way of interlocking vacuuming of toast tables with their parents. We could possibly do the former by tracking the last-known value in shared memory, but it'd be complicated because the value can (legitimately) vary depending on which database you ask about. The only backpatchable way I can think of to do the latter is to undo the decoupling of main and toast-table vacuuming, and then say that we simply don't change the value of OldestXmin when going to vacuum the toast table. This is a bit unpleasant. In any case, I no longer have much faith in the idea that letting GetOldestXmin go backwards is really safe. The one bright spot is that AFAICS this isn't a true data loss hazard; no possibly-still-needed data can be lost. Any failures would be transient failures of CLUSTER or VACUUM FULL. Thoughts? regards, tom lane
On Fri, Feb 1, 2013 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In any case, I no longer have much faith in the idea that letting > GetOldestXmin go backwards is really safe. That is admittedly kind of weird behavior, but I think one could equally blame this on CLUSTER. This is hardly the first time we've had to patch CLUSTER's handling of TOAST tables (cf commits 21b446dd0927f8f2a187d9461a0d3f11db836f77, 7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05, 83b7584944b3a9df064cccac06822093f1a83793) and it doesn't seem unlikely that we might go the full ten rounds. Having said that, I agree that a fix in GetOldestXmin() would be nice if we could find one, but since the comment describes at least three different ways the value can move backwards, I'm not sure that there's really a practical solution there, especially if you want something we can back-patch. I'm more inclined to think that we need to find a way to make CLUSTER more robust against xmin regressions, but I confess I don't have any good ideas about how to do that, either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 1, 2013 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In any case, I no longer have much faith in the idea that letting >> GetOldestXmin go backwards is really safe. > That is admittedly kind of weird behavior, but I think one could > equally blame this on CLUSTER. This is hardly the first time we've > had to patch CLUSTER's handling of TOAST tables (cf commits > 21b446dd0927f8f2a187d9461a0d3f11db836f77, > 7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05, > 83b7584944b3a9df064cccac06822093f1a83793) and it doesn't seem unlikely > that we might go the full ten rounds. Yeah, but I'm not sure whether CLUSTER is the appropriate blamee or whether it's more like the canary in the coal mine, first to expose problematic behaviors elsewhere. The general problem here is really that we're cleaning out toast tuples while the referencing main-heap tuple still physically exists. How safe do you think that is? That did not ever happen before we decoupled autovacuuming of main and toast tables, either --- so a good case could be made that that idea is fundamentally broken. > Having said that, I agree that a fix in GetOldestXmin() would be nice > if we could find one, but since the comment describes at least three > different ways the value can move backwards, I'm not sure that there's > really a practical solution there, especially if you want something we > can back-patch. Well, if we were tracking the latest value in shared memory, we could certainly clamp to that to ensure it didn't go backwards. The problem is where to find storage for a per-DB value. I thought about storing each session's latest value in its PGPROC and taking the max over same-DB sessions during GetOldestXmin's ProcArray scan, but that doesn't work because an autovacuum process might disappear and thus destroy the needed info just before CLUSTER looks for it. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Having said that, I agree that a fix in GetOldestXmin() would be nice > if we could find one, but since the comment describes at least three > different ways the value can move backwards, I'm not sure that there's > really a practical solution there, especially if you want something we > can back-patch. Actually, wait a second. As you say, the comment describes three known ways to make it go backwards. It strikes me that all three are fixable: * if allDbs is FALSE and there are no transactions running in the current* database, GetOldestXmin() returns latestCompletedXid.If a transaction* begins after that, its xmin will include in-progress transactions in other* databasesthat started earlier, so another call will return a lower value. The reason this is a problem is that GetOldestXmin ignores XIDs of processes that are connected to other DBs. It now seems to me that this is a flat-out bug. It can ignore their xmins, but it should include their XIDs, because the point of considering those XIDs is that they may contribute to the xmins of snapshots computed in the future by processes in our own DB. And snapshots never exclude any XIDs on the basis of which DB they're in. (They can't really, since we can't know when the snap is taken whether it might be used to examine shared catalogs.) * There are also replication-related effects: a walsender* process can set its xmin based on transactions that are no longerrunning* in the master but are still being replayed on the standby, thus possibly* making the GetOldestXmin readinggo backwards. In this case there is a* possibility that we lose data that the standby would like to have, but* thereis little we can do about that --- data is only protected if the* walsender runs continuously while queries are executedon the standby.* (The Hot Standby code deals with such cases by failing standby queries* that needed to access already-removeddata, so there's no integrity bug.) This is just bogus. Why don't we make it a requirement on walsenders that they never move their advertised xmin backwards (or initially set it to less than the prevailing global xmin)? There's no real benefit to allowing them to try to move the global xmin backwards, because any data that they might hope to protect that way could be gone already. * The return value is also adjusted with vacuum_defer_cleanup_age, so* increasing that setting on the fly is another easyway to make* GetOldestXmin() move backwards, with no consequences for data integrity. And as for that, it's been pretty clear for awhile that allowing vacuum_defer_cleanup_age to change on the fly was a bad idea we'd eventually have to undo. The day of reckoning has arrived: it needs to be PGC_POSTMASTER. regards, tom lane
On 1 February 2013 23:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, if we were tracking the latest value in shared memory, we could > certainly clamp to that to ensure it didn't go backwards. The problem > is where to find storage for a per-DB value. Adding new data columns to catalogs in backbranches seems like a great reason to have an hstore column on every catalog table. That way we can just add anything we need without causing other problems. Obviously something for the future. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 1 February 2013 23:56, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Feb 1, 2013 at 2:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> In any case, I no longer have much faith in the idea that letting >>> GetOldestXmin go backwards is really safe. > >> That is admittedly kind of weird behavior, but I think one could >> equally blame this on CLUSTER. This is hardly the first time we've >> had to patch CLUSTER's handling of TOAST tables (cf commits >> 21b446dd0927f8f2a187d9461a0d3f11db836f77, >> 7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05, >> 83b7584944b3a9df064cccac06822093f1a83793) and it doesn't seem unlikely >> that we might go the full ten rounds. > > Yeah, but I'm not sure whether CLUSTER is the appropriate blamee or > whether it's more like the canary in the coal mine, first to expose > problematic behaviors elsewhere. The general problem here is really > that we're cleaning out toast tuples while the referencing main-heap > tuple still physically exists. How safe do you think that is? Agree that CLUSTER is just first to hit the problem. As long as the problem exists, we have issues. > That > did not ever happen before we decoupled autovacuuming of main and toast > tables, either --- so a good case could be made that that idea is > fundamentally broken. It's broken, but not fundamentally. We can decouple the autovacuuming, as long as we don't decouple the xid used. "Fixing" GetOldestXmin() seems like the easiest thing to do for backbranches, but seems a little heavy handed for a general fix - though doing that will probably close any future thinkos of similar nature. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-02-01 19:24:02 -0500, Tom Lane wrote: > * There are also replication-related effects: a walsender > * process can set its xmin based on transactions that are no longer running > * in the master but are still being replayed on the standby, thus possibly > * making the GetOldestXmin reading go backwards. In this case there is a > * possibility that we lose data that the standby would like to have, but > * there is little we can do about that --- data is only protected if the > * walsender runs continuously while queries are executed on the standby. > * (The Hot Standby code deals with such cases by failing standby queries > * that needed to access already-removed data, so there's no integrity bug.) > > This is just bogus. Why don't we make it a requirement on walsenders > that they never move their advertised xmin backwards (or initially set > it to less than the prevailing global xmin)? There's no real benefit to > allowing them to try to move the global xmin backwards, because any data > that they might hope to protect that way could be gone already. The problem imo has multiple sides. a) We currently don't have a way to assign the minimal valid xmin to the current backend. I already need that for logical replication, so we might just as well add it now: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commitdiff;h=84f6ed9550b65b0584589977877e9a997daeb61d With that its just a matter of taking ProcArrayLock exlusively before calling GetOldestXmin and assigning it to MyPgXact->xmin before releasing it. b) We don't assign the xmin early enough, we only set it when the first feedback message arrives, but we should set it when walsender starts streaming. c) After a disconnect the feedback message will rather likely ask for an xmin horizon thats not valid anymore on the primary. If the disconnect was short enough often enough that doesn't matter because nothing has been cleared out, but it doesn't really work all that well. Thats still better than setting it to the currently valid minimal xmin horizon because it prevents cleanup from that moment on. I don't see how this can be significantly improved without persistent knowledge about standbys. > * The return value is also adjusted with vacuum_defer_cleanup_age, so > * increasing that setting on the fly is another easy way to make > * GetOldestXmin() move backwards, with no consequences for data integrity. > > And as for that, it's been pretty clear for awhile that allowing > vacuum_defer_cleanup_age to change on the fly was a bad idea we'd > eventually have to undo. The day of reckoning has arrived: it needs > to be PGC_POSTMASTER. Not sure if that would really solve anything, similar problems seem to exist even if the postmaster has been restarted inbetween. Btw, I wonder if that comment in GetOldestXmin still has good justification, now that we allow cascading standbys: > Also note that we intentionally don't apply vacuum_defer_cleanup_age > on standby servers. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 1, 2013 at 7:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Having said that, I agree that a fix in GetOldestXmin() would be nice >> if we could find one, but since the comment describes at least three >> different ways the value can move backwards, I'm not sure that there's >> really a practical solution there, especially if you want something we >> can back-patch. > > Actually, wait a second. As you say, the comment describes three known > ways to make it go backwards. It strikes me that all three are fixable: > > * if allDbs is FALSE and there are no transactions running in the current > * database, GetOldestXmin() returns latestCompletedXid. If a transaction > * begins after that, its xmin will include in-progress transactions in other > * databases that started earlier, so another call will return a lower value. > > The reason this is a problem is that GetOldestXmin ignores XIDs of > processes that are connected to other DBs. It now seems to me that this > is a flat-out bug. It can ignore their xmins, but it should include > their XIDs, because the point of considering those XIDs is that they may > contribute to the xmins of snapshots computed in the future by processes > in our own DB. And snapshots never exclude any XIDs on the basis of > which DB they're in. (They can't really, since we can't know when the > snap is taken whether it might be used to examine shared catalogs.) This idea is superficially appealing, but I can't get excited about it, because it will cause GetOldestXmin() to compute older (that is, more conservative) XIDs than it does today, which means we'll vacuum away less. Considering the amount of pain we have around vacuum today, that's surely got to be going the wrong direction. I'm inclined to think that the root of the problem is that the computation is too conservative already. I mean, what do we really need to include in the computation in order not to vacuum away any data that's still needed? There are basically two concerns: - You can't remove data that any snapshot in the current database might try to read. So you have to include the xmins of those snapshots in your oldest-xmin value. But, not really, because those xmins were computed by looking at all running XIDs across the whole database, and if their oldest xmin value came from some XID in another database, it's not relevant. What you want (but of course, can't get) is the db-specific xmin horizon for each other snapshot. - You have to worry about new snapshots that are about to appear but haven't yet. But, really, how big of a concern is this? No transactions can exit the running set while we hold ProcArrayLock, so it seems to me that there ought to be a pretty tight limit on what can show up here. For example, if we've already taken a snapshot, surely a subsequently taken snapshot can't really need an older xmin horizon than the one we computed for our own snapshot. In theory, it seems like we might be able to solve these problems by having each backend advertise TWO xmin horizons: one representing the oldest xmin it might be able to see in a system catalog, and the other representing the oldest xmin it might be able to see in its own database. This would allow significantly more precise tracking of what is really needed in the current database, but there's an obvious implementation concern, which is that GetSnapshotData is hot enough that individual instructions matter. > * There are also replication-related effects: a walsender > * process can set its xmin based on transactions that are no longer running > * in the master but are still being replayed on the standby, thus possibly > * making the GetOldestXmin reading go backwards. In this case there is a > * possibility that we lose data that the standby would like to have, but > * there is little we can do about that --- data is only protected if the > * walsender runs continuously while queries are executed on the standby. > * (The Hot Standby code deals with such cases by failing standby queries > * that needed to access already-removed data, so there's no integrity bug.) > > This is just bogus. Why don't we make it a requirement on walsenders > that they never move their advertised xmin backwards (or initially set > it to less than the prevailing global xmin)? There's no real benefit to > allowing them to try to move the global xmin backwards, because any data > that they might hope to protect that way could be gone already. I think there is already some code that aims to do this, but I might be misremembering. IOW, this comment might already be incorrect. > * The return value is also adjusted with vacuum_defer_cleanup_age, so > * increasing that setting on the fly is another easy way to make > * GetOldestXmin() move backwards, with no consequences for data integrity. > > And as for that, it's been pretty clear for awhile that allowing > vacuum_defer_cleanup_age to change on the fly was a bad idea we'd > eventually have to undo. The day of reckoning has arrived: it needs > to be PGC_POSTMASTER. We might even want to consider getting rid of it altogether at this point, but neither that nor a change to PGC_POSTMASTER seems a likely candidate for back-patching. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 1, 2013 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That is admittedly kind of weird behavior, but I think one could >> equally blame this on CLUSTER. This is hardly the first time we've >> had to patch CLUSTER's handling of TOAST tables (cf commits >> 21b446dd0927f8f2a187d9461a0d3f11db836f77, >> 7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05, >> 83b7584944b3a9df064cccac06822093f1a83793) and it doesn't seem unlikely >> that we might go the full ten rounds. > > Yeah, but I'm not sure whether CLUSTER is the appropriate blamee or > whether it's more like the canary in the coal mine, first to expose > problematic behaviors elsewhere. The general problem here is really > that we're cleaning out toast tuples while the referencing main-heap > tuple still physically exists. How safe do you think that is? That > did not ever happen before we decoupled autovacuuming of main and toast > tables, either --- so a good case could be made that that idea is > fundamentally broken. I confess I'm not entirely sanguine about that. A similar issue arises with index tuples referencing heap tuples, and we go to great lengths to make sure that the last vestiges of the heap tuple can't be removed until the index tuples are well and truly dead. If we were to adopt the same approach here, it would mean that, instead of consulting XIDs at all, we'd remove TOAST pointers only when there were provably no main-table tuples pointing to them. I'm not really proposing that, but it's an interesting thought-experiment. What we're actually doing is relying on values to match exactly that were never intended as more than an approximation. Can you point me to the commit that decoupled this? What was the motivation for it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2 February 2013 14:24, Andres Freund <andres@2ndquadrant.com> wrote: > b) We don't assign the xmin early enough, we only set it when the first > feedback message arrives, but we should set it when walsender starts > streaming. That's easy to fix. > c) After a disconnect the feedback message will rather likely ask for an > xmin horizon thats not valid anymore on the primary. If the disconnect > was short enough often enough that doesn't matter because nothing has > been cleared out, but it doesn't really work all that well. > Thats still better than setting it to the currently valid minimal xmin > horizon because it prevents cleanup from that moment on. > I don't see how this can be significantly improved without persistent > knowledge about standbys. We could delay startup of the standby until the xmin on the standby reaches the xmin on the master. So when the standby has hot_standby_feedback = on, at standby connection we set the xmin of the walsender to be the current value on the master, then we disallow connections on standby until we have replayed up to that xmin on the standby. That way the xmin of the walsender never goes backwards nor do we get cancelations on the standby. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2 February 2013 00:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > * The return value is also adjusted with vacuum_defer_cleanup_age, so > * increasing that setting on the fly is another easy way to make > * GetOldestXmin() move backwards, with no consequences for data integrity. > > And as for that, it's been pretty clear for awhile that allowing > vacuum_defer_cleanup_age to change on the fly was a bad idea we'd > eventually have to undo. The day of reckoning has arrived: it needs > to be PGC_POSTMASTER. Agreed, will fix. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2 February 2013 00:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > * if allDbs is FALSE and there are no transactions running in the current > * database, GetOldestXmin() returns latestCompletedXid. If a transaction > * begins after that, its xmin will include in-progress transactions in other > * databases that started earlier, so another call will return a lower value. > > The reason this is a problem is that GetOldestXmin ignores XIDs of > processes that are connected to other DBs. It now seems to me that this > is a flat-out bug. It can ignore their xmins, but it should include > their XIDs, because the point of considering those XIDs is that they may > contribute to the xmins of snapshots computed in the future by processes > in our own DB. And snapshots never exclude any XIDs on the basis of > which DB they're in. (They can't really, since we can't know when the > snap is taken whether it might be used to examine shared catalogs.) Agree thats a bug. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-02-02 18:32:44 +0000, Simon Riggs wrote: > On 2 February 2013 14:24, Andres Freund <andres@2ndquadrant.com> wrote: > > > b) We don't assign the xmin early enough, we only set it when the first > > feedback message arrives, but we should set it when walsender starts > > streaming. > > That's easy to fix. Not trivial, but not too hard, yes. When the standby initially connects we don't yet know which xid will be required because consistency hasn't yet been achieved. > > c) After a disconnect the feedback message will rather likely ask for an > > xmin horizon thats not valid anymore on the primary. If the disconnect > > was short enough often enough that doesn't matter because nothing has > > been cleared out, but it doesn't really work all that well. > > Thats still better than setting it to the currently valid minimal xmin > > horizon because it prevents cleanup from that moment on. > > I don't see how this can be significantly improved without persistent > > knowledge about standbys. > > We could delay startup of the standby until the xmin on the standby > reaches the xmin on the master. > > So when the standby has hot_standby_feedback = on, at standby > connection we set the xmin of the walsender to be the current value on > the master, then we disallow connections on standby until we have > replayed up to that xmin on the standby. That way the xmin of the > walsender never goes backwards nor do we get cancelations on the > standby. Thats easy enough when the standby is initially started but doesn't work that well if just the connection between both failed (or the master was restarted) and a reconnect worked. To make it work similarly in that case we would have to throw everyone out which would kind of counteract the whole idea. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2 February 2013 18:38, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-02-02 18:32:44 +0000, Simon Riggs wrote: >> On 2 February 2013 14:24, Andres Freund <andres@2ndquadrant.com> wrote: >> >> > b) We don't assign the xmin early enough, we only set it when the first >> > feedback message arrives, but we should set it when walsender starts >> > streaming. >> >> That's easy to fix. > > Not trivial, but not too hard, yes. When the standby initially connects > we don't yet know which xid will be required because consistency hasn't > yet been achieved. We don't need to change the protocol, we just need to issue a hot_standby_feedback message immediately after connection. That looks trivial to me. >> > c) After a disconnect the feedback message will rather likely ask for an >> > xmin horizon thats not valid anymore on the primary. If the disconnect >> > was short enough often enough that doesn't matter because nothing has >> > been cleared out, but it doesn't really work all that well. >> > Thats still better than setting it to the currently valid minimal xmin >> > horizon because it prevents cleanup from that moment on. >> > I don't see how this can be significantly improved without persistent >> > knowledge about standbys. >> >> We could delay startup of the standby until the xmin on the standby >> reaches the xmin on the master. >> >> So when the standby has hot_standby_feedback = on, at standby >> connection we set the xmin of the walsender to be the current value on >> the master, then we disallow connections on standby until we have >> replayed up to that xmin on the standby. That way the xmin of the >> walsender never goes backwards nor do we get cancelations on the >> standby. > > Thats easy enough when the standby is initially started but doesn't work > that well if just the connection between both failed (or the master was > restarted) and a reconnect worked. To make it work similarly in that > case we would have to throw everyone out which would kind of counteract > the whole idea. Agreed. It was worth considering, at least. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-02-01 19:24:02 -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Having said that, I agree that a fix in GetOldestXmin() would be nice > > if we could find one, but since the comment describes at least three > > different ways the value can move backwards, I'm not sure that there's > > really a practical solution there, especially if you want something we > > can back-patch. > > Actually, wait a second. As you say, the comment describes three known > ways to make it go backwards. It strikes me that all three are fixable: > > * if allDbs is FALSE and there are no transactions running in the current > * database, GetOldestXmin() returns latestCompletedXid. If a transaction > * begins after that, its xmin will include in-progress transactions in other > * databases that started earlier, so another call will return a lower value. > > The reason this is a problem is that GetOldestXmin ignores XIDs of > processes that are connected to other DBs. It now seems to me that this > is a flat-out bug. It can ignore their xmins, but it should include > their XIDs, because the point of considering those XIDs is that they may > contribute to the xmins of snapshots computed in the future by processes > in our own DB. And snapshots never exclude any XIDs on the basis of > which DB they're in. (They can't really, since we can't know when the > snap is taken whether it might be used to examine shared catalogs.) > * The return value is also adjusted with vacuum_defer_cleanup_age, so > * increasing that setting on the fly is another easy way to make > * GetOldestXmin() move backwards, with no consequences for data integrity. > > And as for that, it's been pretty clear for awhile that allowing > vacuum_defer_cleanup_age to change on the fly was a bad idea we'd > eventually have to undo. The day of reckoning has arrived: it needs > to be PGC_POSTMASTER. ISTM that the original problem can still occur, even after Simon's commit. 1) start with -c vacuum_defer_cleanup_age=0 2) autovacuum vacuums "test"; 3) restart with -c vacuum_defer_cleanup_age=10000 4) autovacuum vacuums "test"'s toast table; should result in about the same ERROR, shouldn't it? Given that there seemingly isn't yet a way to fix that people agree on and that it "only" result in a transient error I think the fix for this should be pushed after the next point release. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-02-01 19:24:02 -0500, Tom Lane wrote: >> And as for that, it's been pretty clear for awhile that allowing >> vacuum_defer_cleanup_age to change on the fly was a bad idea we'd >> eventually have to undo. The day of reckoning has arrived: it needs >> to be PGC_POSTMASTER. > ISTM that the original problem can still occur, even after Simon's > commit. > 1) start with -c vacuum_defer_cleanup_age=0 > 2) autovacuum vacuums "test"; > 3) restart with -c vacuum_defer_cleanup_age=10000 > 4) autovacuum vacuums "test"'s toast table; > should result in about the same ERROR, shouldn't it? Hm ... yeah, you're right. So that's still not bulletproof. > Given that there seemingly isn't yet a way to fix that people agree on > and that it "only" result in a transient error I think the fix for this > should be pushed after the next point release. Agreed, we can let this go until we have a more complete solution. Simon, would you revert the vacuum_defer_cleanup_age changes? We should wait till we have a complete fix before forcing that redefinition on users. regards, tom lane
On 4 February 2013 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon, would you revert the vacuum_defer_cleanup_age changes? Will do -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-02-04 11:07:17 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-02-01 19:24:02 -0500, Tom Lane wrote: > >> And as for that, it's been pretty clear for awhile that allowing > >> vacuum_defer_cleanup_age to change on the fly was a bad idea we'd > >> eventually have to undo. The day of reckoning has arrived: it needs > >> to be PGC_POSTMASTER. > > > ISTM that the original problem can still occur, even after Simon's > > commit. > > 1) start with -c vacuum_defer_cleanup_age=0 > > 2) autovacuum vacuums "test"; > > 3) restart with -c vacuum_defer_cleanup_age=10000 > > 4) autovacuum vacuums "test"'s toast table; > > > should result in about the same ERROR, shouldn't it? > > Hm ... yeah, you're right. So that's still not bulletproof. ISTM that, unless we can guarantee that this won't cause GetOldestXmin to backwards, there is no point in trying to make GetOldestXmin any guarantees not to do so. Especially if that comes with a loss of VACUUM's effectiveness. I absolutely hate to suggest it, but what about storing the last vacuum's xmin horizon in the main table's pg_class.options in the back branches? During heap_copy_data the assumed xmin horizon can be Max(OldestXmin, main_table_last_vacuum_horizon, toast_table_last_vacuum_horizon) That should solve the problem, right? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > I absolutely hate to suggest it, but what about storing the last > vacuum's xmin horizon in the main table's pg_class.options in the back > branches? Not workable. This would require a non-in-place update of the table's pg_class row (at least in cases where the option wasn't stored already). Which would require VACUUM to have an XID, which would make a whole lot of assumptions fall over. regards, tom lane
On 2013-02-04 11:52:05 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > I absolutely hate to suggest it, but what about storing the last > > vacuum's xmin horizon in the main table's pg_class.options in the back > > branches? > > Not workable. This would require a non-in-place update of the table's > pg_class row (at least in cases where the option wasn't stored already). > Which would require VACUUM to have an XID, which would make a whole lot > of assumptions fall over. Don't get me wrong, I hate that solution, but that specific issue seems relatively easily solvable by pre-creating the option intially in vacuum_rel() in a separate transaction before entering lazy_vacuum_rel(). I unfortunately don't yet see a robust way without storing the last used horizon :(. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 4 February 2013 17:02, Andres Freund <andres@2ndquadrant.com> wrote: > I unfortunately don't yet see a robust way without storing the last used > horizon :(. We can't go backwards, but we can go forwards. We can move the next xid forwards by an amount equal to the increase in vacuum_defer_cleanup_age. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-02-04 17:21:50 +0000, Simon Riggs wrote: > On 4 February 2013 17:02, Andres Freund <andres@2ndquadrant.com> wrote: > > > I unfortunately don't yet see a robust way without storing the last used > > horizon :(. > > We can't go backwards, but we can go forwards. > > We can move the next xid forwards by an amount equal to the increase > in vacuum_defer_cleanup_age. Don't think that helps, the problem is not new writes but already removed rows in the toast table. Besides, advancing the next xid by vacuum_defer_cleanup_age every restart could get expensive. Unless we find a space to store the old value which we haven't really got in the back branches (control file?)... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services