Thread: Toast issues with OldestXmin going backwards
Various comments in GetOldestXmin mention the possibility of the oldest xmin going backward, and assert that this is actually safe. It's not. Consider: A table has a toastable column. A row is updated in a way that changes the toasted value. There are now two row versions pointing to different toast values, one live and one dead. Now suppose the toast table - but not the main table - is vacuumed; the dead toast entries are removed, even though they are still referenced by the dead main-table row. Autovacuum treats the main table and toast table separately, so this can happen. Now suppose that OldestXmin goes backwards so that the older main table row version is no longer dead, but merely recently-dead. At this point, VACUUM FULL (or similar rewrites) on the table will fail with "missing chunk number 0 for ..." toast errors, because it tries to copy the recently-dead row, but that row's toasted values have been vacuumed away already. (I've been working for a while with someone on IRC to track down the source of unexplained semi-repeatable "missing chunk number 0 ..." errors from VACUUM FULL. I don't know at this stage whether this is the actual problem, but it matches the symptoms.) -- Andrew (irc:RhodiumToad)
On Thu, Apr 19, 2018 at 4:07 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Various comments in GetOldestXmin mention the possibility of the oldest > xmin going backward, and assert that this is actually safe. It's not. > > Consider: > > A table has a toastable column. A row is updated in a way that changes > the toasted value. There are now two row versions pointing to different > toast values, one live and one dead. > > Now suppose the toast table - but not the main table - is vacuumed; the > dead toast entries are removed, even though they are still referenced by > the dead main-table row. Autovacuum treats the main table and toast > table separately, so this can happen. > > Now suppose that OldestXmin goes backwards so that the older main table > row version is no longer dead, but merely recently-dead. > > At this point, VACUUM FULL (or similar rewrites) on the table will fail > with "missing chunk number 0 for ..." toast errors, because it tries to > copy the recently-dead row, but that row's toasted values have been > vacuumed away already. > I haven't tried to reproduce it, but I can see the possibility of the problem described by you. What should we do next? I could see few possibilities: (a) Vacuum for main and toast table should always use same OldestXmin, (b) Before removing the row from toast table, we should ensure that row in the main table is removed, (c) Change the logic during rewrite such that it can detect this situation and skip copying the tuple in the main table, on a quick look, this seems tricky, because the toast table TID might have been reused by that time, basically I am not sure if this can be a viable solution or (d) Ensure that GetOldestXmin doesn't move backwards or write a new API similar to it which doesn't allow OldestXmin to move backwards and use that for the purpose of vacuum. Any better ideas? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: Amit> I haven't tried to reproduce it, but I can see the possibility of Amit> the problem described by you. What should we do next? I could see Amit> few possibilities: (a) Vacuum for main and toast table should Amit> always use same OldestXmin, I don't see how this could be arranged without either disabling the ability to vacuum the toast table independently, or storing the xmin somewhere. Amit> (b) Before removing the row from toast table, we should ensure Amit> that row in the main table is removed, We can't find the main table row version(s) without scanning the whole main table, so this amounts (again) to disabling vacuum on the toast table only. Amit> (c) Change the logic during rewrite such that it can detect this Amit> situation and skip copying the tuple in the main table, This seems more promising, but the problem is how to detect the error safely (since currently, it's just ereport()ed from deep inside the toast code). How about: 1) add a toast_datum_exists() call or similar that returns false if the (first block of) the specified external toast item is missing 2) when copying a RECENTLY_DEAD tuple, check all its external column values using this function beforehand, and if any are missing, treat the tuple as DEAD instead. Amit> on a quick look, this seems tricky, because the toast table TID Amit> might have been reused by that time, Toast pointers don't point to TIDs fortunately, they point to OIDs. The OID could have been reused (if there's been an OID wraparound since the toast item was created), but that should be harmless: it means that we'd incorrectly copy the new value with the old tuple, but the old tuple is never going to be visible to anybody ever again so nothing will see that. Amit> or (d) Ensure that GetOldestXmin doesn't move backwards or write Amit> a new API similar to it which doesn't allow OldestXmin to move Amit> backwards and use that for the purpose of vacuum. This would presumably require storing a reference OldestXmin somewhere (in shared memory? but there'd have to be a global one and one per db, and I don't think we have any shared memory structures at present that are per-db). We'd even have to preserve an oldestXmin for databases with no currently active connections, though we wouldn't have to preserve it across restarts. Amit> Any better ideas? It turns out this issue has come up before (5 years ago!): https://www.postgresql.org/message-id/20362.1359747327%40sss.pgh.pa.us -- Andrew (irc:RhodiumToad)
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: Amit> (c) Change the logic during rewrite such that it can detect this Amit> situation and skip copying the tuple in the main table, So I dug into this one and it looks to me like the best approach. Here's a draft patch against 10-stable, if nobody can think of any showstoppers then I'll do the rest of the versions and commit them. -- Andrew (irc:RhodiumToad) diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c index e3d09db0a8..fd091f2de5 100644 --- a/src/backend/access/heap/tuptoaster.c +++ b/src/backend/access/heap/tuptoaster.c @@ -1455,6 +1455,50 @@ toast_get_valid_index(Oid toastoid, LOCKMODE lock) /* ---------- + * toast_validate_tuple - + * + * Given one HeapTuple, test whether all of its external toast rows (if any) + * exist and are the correct length. + * + * tup: tuple + * tupleDesc: tupdesc + */ +bool +toast_validate_tuple(HeapTuple tup, TupleDesc tupleDesc) +{ + Form_pg_attribute *att = tupleDesc->attrs; + int numAttrs = tupleDesc->natts; + int i; + Datum toast_values[MaxTupleAttributeNumber]; + bool toast_isnull[MaxTupleAttributeNumber]; + + /* + * Caller can check this if they like, but it doesn't hurt to do it here + * too. + */ + if (!HeapTupleHasExternal(tup)) + return true; + + /* + * Break down the tuple into fields. + */ + Assert(numAttrs <= MaxTupleAttributeNumber); + heap_deform_tuple(tup, tupleDesc, toast_values, toast_isnull); + + for (i = 0; i < numAttrs; i++) + { + if (!toast_isnull[i] && + att[i]->attlen == -1 && + !att[i]->attisdropped && + !toast_validate_datum(toast_values[i])) + return false; + } + + return true; +} + + +/* ---------- * toast_save_datum - * * Save one single datum into the secondary relation and return @@ -2037,6 +2081,141 @@ toast_fetch_datum(struct varlena *attr) } /* ---------- + * toast_validate_datum - + * + * Test whether a Datum's external toast rows (if any) exist with the correct + * lengths. + * + * This should track toast_fetch_datum to the extent that it returns false if + * toast_fetch_datum would have errored out due to missing chunks or incorrect + * lengths. (Other kinds of "can't happen" errors can still be thrown.) + * ---------- + */ +bool +toast_validate_datum(Datum value) +{ + bool result = true; + Relation toastrel; + Relation *toastidxs; + ScanKeyData toastkey; + SysScanDesc toastscan; + HeapTuple ttup; + TupleDesc toasttupDesc; + struct varlena *attr = (struct varlena *) DatumGetPointer(value); + struct varatt_external toast_pointer; + int32 ressize; + int32 residx, + nextidx; + int32 numchunks; + Pointer chunk; + bool isnull; + int32 chunksize; + int num_indexes; + int validIndex; + SnapshotData SnapshotToast; + + if (!VARATT_IS_EXTERNAL_ONDISK(attr)) + return true; + + /* Must copy to access aligned fields */ + VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr); + + ressize = toast_pointer.va_extsize; + numchunks = ((ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1; + + /* + * Open the toast relation and its indexes + */ + toastrel = heap_open(toast_pointer.va_toastrelid, AccessShareLock); + toasttupDesc = toastrel->rd_att; + + /* Look for the valid index of the toast relation */ + validIndex = toast_open_indexes(toastrel, + AccessShareLock, + &toastidxs, + &num_indexes); + + /* + * Setup a scan key to fetch from the index by va_valueid + */ + ScanKeyInit(&toastkey, + (AttrNumber) 1, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(toast_pointer.va_valueid)); + + /* + * Read the chunks by index + * + * Note that because the index is actually on (valueid, chunkidx) we will + * see the chunks in chunkidx order, even though we didn't explicitly ask + * for it. + */ + nextidx = 0; + + init_toast_snapshot(&SnapshotToast); + toastscan = systable_beginscan_ordered(toastrel, toastidxs[validIndex], + &SnapshotToast, 1, &toastkey); + while ((ttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL) + { + /* + * Have a chunk, extract the sequence number and the data + */ + residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull)); + Assert(!isnull); + chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull)); + Assert(!isnull); + if (!VARATT_IS_EXTENDED(chunk)) + { + chunksize = VARSIZE(chunk) - VARHDRSZ; + } + else if (VARATT_IS_SHORT(chunk)) + { + /* could happen due to heap_form_tuple doing its thing */ + chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT; + } + else + { + /* should never happen */ + elog(ERROR, "found toasted toast chunk for toast value %u in %s", + toast_pointer.va_valueid, + RelationGetRelationName(toastrel)); + chunksize = 0; /* keep compiler quiet */ + } + + /* + * Some checks on the data we've found + */ + if (residx != nextidx || + (residx < numchunks - 1 && + chunksize != TOAST_MAX_CHUNK_SIZE) || + (residx == numchunks - 1 && + ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != ressize)) || + (residx >= numchunks)) + { + result = false; + break; + } + + nextidx++; + } + + /* + * Final checks + */ + if (result && nextidx != numchunks) + result = false; + + /* + * End scan and close relations + */ + systable_endscan_ordered(toastscan); + toast_close_indexes(toastidxs, num_indexes, AccessShareLock); + heap_close(toastrel, AccessShareLock); + + return result; +} + +/* ---------- * toast_fetch_datum_slice - * * Reconstruct a segment of a Datum from the chunks saved diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index dfd089f63c..b241b67a5e 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -945,6 +945,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose, HeapTuple tuple; Buffer buf; bool isdead; + bool validate_toast = false; CHECK_FOR_INTERRUPTS(); @@ -979,6 +980,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose, break; case HEAPTUPLE_RECENTLY_DEAD: tups_recently_dead += 1; + validate_toast = true; /* fall through */ case HEAPTUPLE_LIVE: /* Live or recently dead, must copy it */ @@ -1022,6 +1024,20 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose, LockBuffer(buf, BUFFER_LOCK_UNLOCK); + /* + * OldestXmin going backwards can mean we have a RECENTLY_DEAD row that + * contains toast pointers whose toast rows have already been vacuumed + * away (or in the worst but unlikely case, recycled). If so, then the + * row must really be dead to all snapshots that could access it, so + * treat it as DEAD instead. + */ + if (validate_toast && + !toast_validate_tuple(tuple, oldTupDesc)) + { + isdead = true; + tups_recently_dead -= 1; + } + if (isdead) { tups_vacuumed += 1; diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h index fd9f83ac44..83b59257ef 100644 --- a/src/include/access/tuptoaster.h +++ b/src/include/access/tuptoaster.h @@ -236,4 +236,22 @@ extern Size toast_datum_size(Datum value); */ extern Oid toast_get_valid_index(Oid toastoid, LOCKMODE lock); +/* ---------- + * toast_validate_tuple - + * + * Given one HeapTuple, test whether all of its external toast rows (if any) + * exist and are the correct length. + * ---------- + */ +extern bool toast_validate_tuple(HeapTuple tup, TupleDesc tupleDesc); + +/* ---------- + * toast_validate_datum - + * + * Test whether a Datum's external toast rows (if any) exist with the correct + * lengths. + * ---------- + */ +extern bool toast_validate_datum(Datum value); + #endif /* TUPTOASTER_H */
On Sat, 21 Apr 2018 at 6:46 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
So I dug into this one and it looks to me like the best approach. Here's
a draft patch against 10-stable, if nobody can think of any showstoppers
then I'll do the rest of the versions and commit them.
How does this guard against the case when the same OID gets inserted in the toast table again, with matching lengths etc? Rare but seems possible, no?
I think we should look for a more complete solution how hard these bugs are to detect and fix.
Thanks,
Pavan
--
Andrew (irc:RhodiumToad)
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2018-04-21 14:16:27 +0100, Andrew Gierth wrote: > >>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: > > Amit> (c) Change the logic during rewrite such that it can detect this > Amit> situation and skip copying the tuple in the main table, > > So I dug into this one and it looks to me like the best approach. Here's > a draft patch against 10-stable, if nobody can think of any showstoppers > then I'll do the rest of the versions and commit them. Please wait for a bit. This isn't a trivial change, and I would like to wrap my head around it. At the very least this seems like it could cause extreme slowdowns for large tables that have the right update/delete patterns? > + /* > + * OldestXmin going backwards can mean we have a RECENTLY_DEAD row that > + * contains toast pointers whose toast rows have already been vacuumed > + * away (or in the worst but unlikely case, recycled). If so, then the > + * row must really be dead to all snapshots that could access it, so > + * treat it as DEAD instead. > + */ How is it guaranteed that the recheck doesn't find a different toast tuple at the relevant position? Greetings, Andres Freund
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes: Andres> Please wait for a bit. This isn't a trivial change, and I would Andres> like to wrap my head around it. Sure. Andres> At the very least this seems like it could cause extreme Andres> slowdowns for large tables that have the right update/delete Andres> patterns? I'm working on the basis that fetching the same set of toast rows twice in succession will at least mean they are likely in cache for the second time, so we shouldn't be doing any more actual I/O than before. (This is assuming that a system with really big toast values is going to also have plenty of RAM, which is a generally safe assumption since such values are copied many times in memory by pg_dump.) There is still some overhead in the check, of course; I will see about doing some benchmarks. >> + /* >> + * OldestXmin going backwards can mean we have a RECENTLY_DEAD row that >> + * contains toast pointers whose toast rows have already been vacuumed >> + * away (or in the worst but unlikely case, recycled). If so, then the >> + * row must really be dead to all snapshots that could access it, so >> + * treat it as DEAD instead. >> + */ Andres> How is it guaranteed that the recheck doesn't find a different Andres> toast tuple at the relevant position? It's not, but it doesn't matter as long as the new toast rows can pass through toast_fetch_datum without throwing an error (which is why toast_validate_datum has to scan all the entries and check all their lengths). Firstly, we can assume the following statements (or at least if we can't, we have far worse problems than this): 1. OldestXmin going backwards cannot make any tuple visible in any (non-vacuum) snapshot that it wasn't visible in before; 2. If a tuple is non-vacuumable itself, it won't have its toast entries vacuumed away; 3. Once a tuple is vacuumable at any single instant, it will never become visible in any snapshot. So the only cases where a recently-dead tuple can have pointers to a now-recycled toast entry OID is if it was at some point vacuumable, such that a toast-table-only vacuum removed its toast entries, and a new insert or update subsequently inserted a value of the same length that was assigned the OID. Since the tuple was vacuumable at that point, it will never again become visible in any snapshot, so the only code that should ever access its toast values is vacuum full/cluster, which only ever copies the value and does not attempt to decompress or otherwise interpret it. So the only possible consequences for copying the wrong value, as long as we don't throw an error in the process, are that (a) some space is possibly wasted in the toast table - which actually happens already in far more common circumstances - and (b) someone doing forensics or other non-standard access to dead tuples might (in this vanishingly rare case) see something unexpected. At no time would incorrect data be visible in any query. (regarding wasted space in the toast table: vacuum full / cluster will insert live toast entries for copied recently-dead tuples, and if those toast entries are not also referenced from any live tuple, they will never be removed by subsequent normal vacuums, since nothing can ever mark those entries dead; only a later vacuum full / cluster can remove them.) -- Andrew (irc:RhodiumToad)
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes: Andrew> There is still some overhead in the check, of course; I will Andrew> see about doing some benchmarks. Preliminary result suggests that the user-CPU cost of vacuum full increases by ~16% when the entire table is recently-dead rows (with a toasted column of ~10k length) compared to the same table when all rows are live. Since I doubt the practical value of vacuum full on a table which is 100% recently-dead, and that I would expect the live:recently-dead ratio to not normally be much worse than 1:1 (making the overhead 8%) and more likely up around 10:1 (making the overhead 1.5%), I think this is not an issue. (When you have a lot of recently-dead rows is exactly the _wrong_ time to be doing a vacuum full, since it wouldn't save you any space and you'd bloat the toast table as mentioned previously.) -- Andrew (irc:RhodiumToad)
On Sat, Apr 21, 2018 at 1:26 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: >>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: > > Amit> I haven't tried to reproduce it, but I can see the possibility of > Amit> the problem described by you. What should we do next? I could see > Amit> few possibilities: (a) Vacuum for main and toast table should > Amit> always use same OldestXmin, > > I don't see how this could be arranged without either disabling the > ability to vacuum the toast table independently, or storing the xmin > somewhere. > I think we can use the same xmin for both main heap and toast by computing it before scanning the main heap (lazy_vacuum_rel) and then pass it down. I have tried it and came up with the attached patch. > Amit> (b) Before removing the row from toast table, we should ensure > Amit> that row in the main table is removed, > > We can't find the main table row version(s) without scanning the whole > main table, so this amounts (again) to disabling vacuum on the toast > table only. > > Amit> (c) Change the logic during rewrite such that it can detect this > Amit> situation and skip copying the tuple in the main table, > > This seems more promising, but the problem is how to detect the error > safely (since currently, it's just ereport()ed from deep inside the > toast code). How about: > > 1) add a toast_datum_exists() call or similar that returns false if the > (first block of) the specified external toast item is missing > > 2) when copying a RECENTLY_DEAD tuple, check all its external column > values using this function beforehand, and if any are missing, treat the > tuple as DEAD instead. > > Amit> on a quick look, this seems tricky, because the toast table TID > Amit> might have been reused by that time, > > Toast pointers don't point to TIDs fortunately, they point to OIDs. The > OID could have been reused (if there's been an OID wraparound since the > toast item was created), but that should be harmless: it means that we'd > incorrectly copy the new value with the old tuple, but the old tuple is > never going to be visible to anybody ever again so nothing will see that. > Yeah, that's right, but it gives some uneasy feeling that we are attaching the wrong toast value. If you think there is some basic flaw in Approach-1 (as in attached patch) or it requires some major surgery then we can look further into this. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: Amit> I haven't tried to reproduce it, but I can see the possibility of Amit> the problem described by you. What should we do next? I could see Amit> few possibilities: (a) Vacuum for main and toast table should Amit> always use same OldestXmin, >> I don't see how this could be arranged without either disabling the >> ability to vacuum the toast table independently, or storing the xmin >> somewhere. Amit> I think we can use the same xmin for both main heap and toast But you're completely missing the point here, which is that the main heap might not be vacuumed AT ALL. Autovacuum can and will call vacuum() with the relid of a toast table - this is a deliberate decision to cope with access patterns that would otherwise bloat the toast table. (It's also currently legal to do VACUUM pg_toast.pg_toast_nnnn; manually, but that's of secondary importance compared to the fact that autovacuum does it.) Amit> by computing it before scanning the main heap (lazy_vacuum_rel) Amit> and then pass it down. I have tried it and came up with the Amit> attached patch. Your patch would actually be needed if (and only if) autovacuum was changed back to its old behavior of never vacuuming toast tables independently, and if manual VACUUM pg_toast.*; was disabled. But in the presence of either of those two possibilities, it does nothing useful. >> Toast pointers don't point to TIDs fortunately, they point to OIDs. >> The OID could have been reused (if there's been an OID wraparound >> since the toast item was created), but that should be harmless: it >> means that we'd incorrectly copy the new value with the old tuple, >> but the old tuple is never going to be visible to anybody ever again >> so nothing will see that. Amit> Yeah, that's right, but it gives some uneasy feeling that we are Amit> attaching the wrong toast value. If you think there is some basic Amit> flaw in Approach-1 (as in attached patch) or it requires some Amit> major surgery then we can look further into this. The basic flaw is that it doesn't even reach the problem. -- Andrew (irc:RhodiumToad)
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: Amit> Yeah, that's right, but it gives some uneasy feeling that we are Amit> attaching the wrong toast value. Oh, forgot to stress this before: referencing the wrong toast value in this case is something that can already happen, my patch does not affect either the probability of it occurring or the consequences of it. -- Andrew (irc:RhodiumToad)
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes: Andrew> Since the tuple was vacuumable at that point, it will never Andrew> again become visible in any snapshot, so the only code that Andrew> should ever access its toast values is vacuum full/cluster, and, unfortunately, CREATE INDEX (non-concurrently), including the index build done by vacfull/cluster itself, which rather messes things up because the toasted column might be a parameter to a functional index, or it might itself be an indexed column (within a narrow size range for btree, or more generally for other index types); this means the index AM or the indexed function may try and actually detoast and inspect the value. So right now it's possible (I think) in the worst case to crash the server if you can set up exactly the right circumstances (re-use of a prematurely vacuumed toast oid after wraparound, with the new value matching the old value's raw length but not its compression state). (My changes so far don't address this case and don't make it either more or less likely) -- Andrew (irc:RhodiumToad)
On Mon, Apr 23, 2018 at 1:55 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: >>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: > > Amit> by computing it before scanning the main heap (lazy_vacuum_rel) > Amit> and then pass it down. I have tried it and came up with the > Amit> attached patch. > > Your patch would actually be needed if (and only if) autovacuum was > changed back to its old behavior of never vacuuming toast tables > independently, and if manual VACUUM pg_toast.*; was disabled. But in the > presence of either of those two possibilities, it does nothing useful. > Yeah, right, I have missed the point that they can be vacuumed separately, however, I think that decision is somewhat questionable. Basically, it doesn't appear logical to leave the sort-of dangling toast pointer in heap. I think this is somewhat akin to our current index and heap binding where we never let heap itemid to go off till all index entries pointing to it are taken care of. I have gone through the commit (3ccde312ec8ee47f5f797b070d34a675799448ae) which appears to have decoupled this mechanism, but it doesn't provide any reason for doing so. I have also gone through the related thread [1] which is not much help either. I think there must be some patterns where toast table bloat causes much more harm than heap as toast rows are much larger in size. Are you aware of what were the concrete reasons behind this change? I think it would have been better if along with decoupling of vacuum for main heap and toast tables, we would have come up with a way to selectively remove the corresponding rows from the main heap, say by just vacuuming heap pages/rows which have toast pointers but maybe that is not viable or involves much more work without equivalent benefit. Also, we can think along the lines of another idea suggested by Andres [2] on the thread mentioned by you. [1] - https://www.postgresql.org/message-id/20080808165809.GB3800%40alvh.no-ip.org [2] - https://www.postgresql.org/message-id/20130204164341.GB22226%40awork2.anarazel.de -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: >> Your patch would actually be needed if (and only if) autovacuum was >> changed back to its old behavior of never vacuuming toast tables >> independently, and if manual VACUUM pg_toast.*; was disabled. But in >> the presence of either of those two possibilities, it does nothing >> useful. Amit> Yeah, right, I have missed the point that they can be vacuumed Amit> separately, however, I think that decision is somewhat Amit> questionable. Some previous discussion links for reference, for the background to the thread containing the patch: https://www.postgresql.org/message-id/flat/87y7gpiqx3.fsf%40oxford.xeocode.com https://www.postgresql.org/message-id/flat/20080608230348.GD11028%40alvh.no-ip.org Amit> I think it would have been better if along with decoupling of Amit> vacuum for main heap and toast tables, we would have come up with Amit> a way to selectively remove the corresponding rows from the main Amit> heap, say by just vacuuming heap pages/rows which have toast Amit> pointers but maybe that is not viable or involves much more work Amit> without equivalent benefit. It should be fairly obvious why this is unworkable - most toast-using tables will have toast pointers on every page, but without making a whole new index of toast pointer OIDs (unacceptable overhead), it would be impossible to find the toast pointers pointing to a specific item without searching the whole rel (in which case we might just as well have vacuumed it). Amit> Also, we can think along the lines of another idea suggested by Amit> Andres [2] on the thread mentioned by you. That one is tricky for various reasons (locking, order of operations in vacuum_rel, having to mess with the API of vacuum(), etc.) -- Andrew (irc:RhodiumToad)
On Mon, Apr 23, 2018 at 1:34 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: >>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: > > >> Your patch would actually be needed if (and only if) autovacuum was > >> changed back to its old behavior of never vacuuming toast tables > >> independently, and if manual VACUUM pg_toast.*; was disabled. But in > >> the presence of either of those two possibilities, it does nothing > >> useful. > > Amit> Yeah, right, I have missed the point that they can be vacuumed > Amit> separately, however, I think that decision is somewhat > Amit> questionable. > > Some previous discussion links for reference, for the background to the > thread containing the patch: > > https://www.postgresql.org/message-id/flat/87y7gpiqx3.fsf%40oxford.xeocode.com > https://www.postgresql.org/message-id/flat/20080608230348.GD11028%40alvh.no-ip.org > If I read correctly, it seems one of the main reason [1] is to save the extra pass over the heap and improve the code. Now, ideally, the extra pass over heap should also free up some space (occupied by the rows that contains old toast pointers corresponding to which we are going to remove rows from toast table), but it is quite possible that it is already clean due to a previous separate vacuum pass over the heap. I think it is good to save extra pass over heap which might not be as useful as we expect, but that can cost us correctness issues in boundary cases as in the case being discussed in this thread. > Amit> I think it would have been better if along with decoupling of > Amit> vacuum for main heap and toast tables, we would have come up with > Amit> a way to selectively remove the corresponding rows from the main > Amit> heap, say by just vacuuming heap pages/rows which have toast > Amit> pointers but maybe that is not viable or involves much more work > Amit> without equivalent benefit. > > It should be fairly obvious why this is unworkable - most toast-using > tables will have toast pointers on every page, but without making a > whole new index of toast pointer OIDs (unacceptable overhead), it would > be impossible to find the toast pointers pointing to a specific item > without searching the whole rel (in which case we might just as well > have vacuumed it). > Okay, such an optimization might not be much effective and it would anyway lead to extra pass over the heap, however, that will resolve the correctness issue. Now, I understand that it is not advisable to go back to the previous behavior for performance concerns, but I think it would be better if we find a bullet-proof way to fix this symptom, rather than just fixing the issue reported in this thread. [1] - From one of the email: "We go certain lengths in autovacuum to make sure tables are vacuumed when their toast table needs vacuuming and the main table does not, which is all quite kludgy. So we already look at their stats and make decisions about them. But what we do after that is force a vacuum to the main table, even if that one does not need any vacuuming, which is dumb." -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 4/23/18 00:33, Amit Kapila wrote: > Yeah, right, I have missed the point that they can be vacuumed > separately, however, I think that decision is somewhat questionable. Manually vacuuming the TOAST table was a way to repair the recently fixed TOAST bug, so it's kind of useful. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 23, 2018 at 11:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > If I read correctly, it seems one of the main reason [1] is to save > the extra pass over the heap and improve the code. Now, ideally, the > extra pass over heap should also free up some space (occupied by the > rows that contains old toast pointers corresponding to which we are > going to remove rows from toast table), but it is quite possible that > it is already clean due to a previous separate vacuum pass over the > heap. I think it is good to save extra pass over heap which might not > be as useful as we expect, but that can cost us correctness issues in > boundary cases as in the case being discussed in this thread. I don't think anybody disagrees with that, but it doesn't answer the question of how best to fix it. Making VACUUM a lot more expensive will not win us any awards, and an extra pass over the heap could be very expensive. You seem to think Andrew's fix isn't really addressing the root of the problem, but I think that just depends on how you define the problem. If you define the problem as "the table should never have dangling pointers to the TOAST table", then Andrew's approach is only fixing the symptom. But if you define the problem as "we shouldn't try to follow TOAST pointers in heap tuples that are not visible to any current or future MVCC snapshot", then it's fixing the root issue. I wonder if perhaps get_actual_variable_range() has a hazard of this type as well. If OldestXmin retreats, then the special snapshot type which it uses could see a row whose TOAST entry has been removed meanwhile. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I wonder if perhaps get_actual_variable_range() has a hazard of this > type as well. If OldestXmin retreats, then the special snapshot type > which it uses could see a row whose TOAST entry has been removed > meanwhile. Hm, yeah. I wonder if we could fix that by extracting the value from the index rather than recomputing it from the heap tuple, as we do now. We'd still have to visit the heap tuple to check liveness, but we'd not need to extract anything from its data part. Have we given up on the angle of "prevent OldestXmin from retreating"? That seems like it'd be far more bulletproof than trying to work around places that break one-by-one. regards, tom lane
On Thu, Apr 26, 2018 at 03:09:01PM -0400, Tom Lane wrote: > Have we given up on the angle of "prevent OldestXmin from retreating"? > That seems like it'd be far more bulletproof than trying to work > around places that break one-by-one. Strong +1 on that. -- Michael
Attachment
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Have we given up on the angle of "prevent OldestXmin from Tom> retreating"? I haven't, and I'm thinking about it, but I don't have an immediate solution. Thinking-out-loud mode: 1) we could store minimum values for OldestXmin for each database in some kind of shared-memory structure. Downside is that we can't easily make this fixed-size, since we don't know how many DBs there are - we can't size it based on connections since we need to be able to track values for dbs with no currently active connections. (Or do we need to track it across restarts? maybe we do, to deal with replication slaves without slots, or changes in parameters) 2) in-place updates of a new column in pg_database? and a control file field for the global values? not back-patchable, but would it work going forward? 3) there was the previous suggestion to store it per-table in the main table's reloptions and inplace-update that; this isn't impossible, but it would be fiddly to do because toast-table-only vacuums would need to locate the main table, and lock order issues might get interesting. (Maybe it would be sneakier to store it in the toast table's otherwise-unused reloptions, and have vacuum/create index on the main table consult that? This assumes that we only care about the issue when dealing with toast tables) 4) something else?? -- Andrew (irc:RhodiumToad)
>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes: Robert> I wonder if perhaps get_actual_variable_range() has a hazard of Robert> this type as well. If OldestXmin retreats, then the special Robert> snapshot type which it uses could see a row whose TOAST entry Robert> has been removed meanwhile. Actually get_actual_variable_range may have a hazard even if OldestXmin does not retreat; it uses RecentGlobalXmin, which is quite likely to be older than an OldestXmin value used by a vacuum. -- Andrew (irc:RhodiumToad)
On Fri, Apr 27, 2018 at 12:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 23, 2018 at 11:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> If I read correctly, it seems one of the main reason [1] is to save >> the extra pass over the heap and improve the code. Now, ideally, the >> extra pass over heap should also free up some space (occupied by the >> rows that contains old toast pointers corresponding to which we are >> going to remove rows from toast table), but it is quite possible that >> it is already clean due to a previous separate vacuum pass over the >> heap. I think it is good to save extra pass over heap which might not >> be as useful as we expect, but that can cost us correctness issues in >> boundary cases as in the case being discussed in this thread. > > I don't think anybody disagrees with that, but it doesn't answer the > question of how best to fix it. Making VACUUM a lot more expensive > will not win us any awards, and an extra pass over the heap could be > very expensive. You seem to think Andrew's fix isn't really > addressing the root of the problem, but I think that just depends on > how you define the problem. If you define the problem as "the table > should never have dangling pointers to the TOAST table", then Andrew's > approach is only fixing the symptom. But if you define the problem as > "we shouldn't try to follow TOAST pointers in heap tuples that are not > visible to any current or future MVCC snapshot", then it's fixing the > root issue. > Hmm, in some rare cases when OID will be reused, it can still lead to problems (when Create Index is performed in parallel to Cluster) as mentioned by Andrew. I am not sure if that is acceptable on the grounds that it is not a newly introduced problem by his proposed patch. I think if we can come up with a solution which prevents OldestXmin from retreating, it will be good, otherwise, we might have to think harder whether there is any other solution to this problem which will attack the root cause of the problem and if nothing better comes out, then probably we can go with his proposed patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Apr 26, 2018 at 9:03 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > (Or do we need to track it across restarts? maybe we do, to deal with > replication slaves without slots, or changes in parameters) Yeah, I'm worried that it might need to be persistent across restarts. One idea that occurred to me is to somehow record -- I guess in pg_class using non-transactional updates -- the last cutoff XID used to vacuum any given table. Then we could just make a rule that you can't vacuum the TOAST table with an XID that's newer than the last one used for the main table. That would preserve the property that you can vacuum the tables separately while avoiding dangling pointers. But that's obviously not back-patchable, and it sounds finicky to get right. It's really too bad that heap tables don't include a metapage where we could store details like this... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes: Robert> One idea that occurred to me is to somehow record -- I guess in Robert> pg_class using non-transactional updates -- the last cutoff XID Robert> used to vacuum any given table. Then we could just make a rule Robert> that you can't vacuum the TOAST table with an XID that's newer Robert> than the last one used for the main table. That would preserve Robert> the property that you can vacuum the tables separately while Robert> avoiding dangling pointers. But that's obviously not Robert> back-patchable, The suggestion made previously (in a historical thread) was to use an entry in the reloptions field for this, at least in back branches. It would be necessary for vacuum to add the entry initially in a normal transactional update, after which it could be updated inplace. -- Andrew (irc:RhodiumToad)
On Fri, Apr 27, 2018 at 11:35 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: >>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes: > Robert> One idea that occurred to me is to somehow record -- I guess in > Robert> pg_class using non-transactional updates -- the last cutoff XID > Robert> used to vacuum any given table. Then we could just make a rule > Robert> that you can't vacuum the TOAST table with an XID that's newer > Robert> than the last one used for the main table. That would preserve > Robert> the property that you can vacuum the tables separately while > Robert> avoiding dangling pointers. But that's obviously not > Robert> back-patchable, > > The suggestion made previously (in a historical thread) was to use an > entry in the reloptions field for this, at least in back branches. It > would be necessary for vacuum to add the entry initially in a normal > transactional update, after which it could be updated inplace. Yeah, I suppose. Sounds pretty rickety to me, though. Maybe I'm just a pessimist. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Apr 27, 2018 at 11:35 AM, Andrew Gierth > <andrew@tao11.riddles.org.uk> wrote: > >>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes: > > Robert> One idea that occurred to me is to somehow record -- I guess in > > Robert> pg_class using non-transactional updates -- the last cutoff XID > > Robert> used to vacuum any given table. Then we could just make a rule > > Robert> that you can't vacuum the TOAST table with an XID that's newer > > Robert> than the last one used for the main table. That would preserve > > Robert> the property that you can vacuum the tables separately while > > Robert> avoiding dangling pointers. But that's obviously not > > Robert> back-patchable, > > > > The suggestion made previously (in a historical thread) was to use an > > entry in the reloptions field for this, at least in back branches. It > > would be necessary for vacuum to add the entry initially in a normal > > transactional update, after which it could be updated inplace. > > Yeah, I suppose. Sounds pretty rickety to me, though. Maybe I'm just > a pessimist. I tend to agree.. However, this isn't something that's been happening a lot, from what I gather, and if we actually add a proper column into pg_class for future versions (not really sure how I feel about if that means "v11" or "v12" right now...) and reloptions for back-branches then perhaps it's not so bad. As far as a metadata page, it'd be pretty overkill, but maybe a fork for it..? I'm trying to think if there might be anything else we'd be able to put into such a fork since adding another inode to every relation that'll only ever likely be 8k definitely wouldn't win us any fans. Not sure, just brainstorming. Thanks! Stephen
Attachment
>>>>> "Stephen" == Stephen Frost <sfrost@snowman.net> writes: Stephen> I tend to agree.. However, this isn't something that's been Stephen> happening a lot, from what I gather, Right now I can't prove it happens in the wild at all - I think I've ruled it out as an explanation for the original problem report - but it's very easy to demonstrate in tests. -- Andrew (irc:RhodiumToad)
On Fri, Apr 27, 2018 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Apr 26, 2018 at 9:03 PM, Andrew Gierth > <andrew@tao11.riddles.org.uk> wrote: >> (Or do we need to track it across restarts? maybe we do, to deal with >> replication slaves without slots, or changes in parameters) > > Yeah, I'm worried that it might need to be persistent across restarts. > > One idea that occurred to me is to somehow record -- I guess in > pg_class using non-transactional updates -- the last cutoff XID used > to vacuum any given table. Then we could just make a rule that you > can't vacuum the TOAST table with an XID that's newer than the last > one used for the main table. That would preserve the property that > you can vacuum the tables separately while avoiding dangling pointers. > Won't this lead to a bloat in toast tables when there is a big difference between the cutoff XID of the main heap table and the latest values of OldestXmin? As both the tables can be vacuumed separately, it doesn't sound impossible, but maybe in some later pass on main heap and toast table will fix the situation. Even if the difference is not big, vacuum might still be not able to remove rows from toast table in some situations. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: >>> (Or do we need to track it across restarts? maybe we do, to deal with >>> replication slaves without slots, or changes in parameters) >> Yeah, I'm worried that it might need to be persistent across restarts. >> >> One idea that occurred to me is to somehow record -- I guess in >> pg_class using non-transactional updates -- the last cutoff XID used >> to vacuum any given table. Then we could just make a rule that you >> can't vacuum the TOAST table with an XID that's newer than the last >> one used for the main table. That would preserve the property that >> you can vacuum the tables separately while avoiding dangling pointers. Amit> Won't this lead to a bloat in toast tables when there is a big Amit> difference between the cutoff XID of the main heap table and the Amit> latest values of OldestXmin? Yes. What we need is actually the reverse of what Robert describes - when we vacuum the _main_ table, we must use the _later_ of the currently calculated OldestXmin or the OldestXmin last used to vacuum the toast table. -- Andrew (irc:RhodiumToad)
On Sun, Apr 29, 2018 at 11:56 AM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: >>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: > > >>> (Or do we need to track it across restarts? maybe we do, to deal with > >>> replication slaves without slots, or changes in parameters) > > >> Yeah, I'm worried that it might need to be persistent across restarts. > >> > >> One idea that occurred to me is to somehow record -- I guess in > >> pg_class using non-transactional updates -- the last cutoff XID used > >> to vacuum any given table. Then we could just make a rule that you > >> can't vacuum the TOAST table with an XID that's newer than the last > >> one used for the main table. That would preserve the property that > >> you can vacuum the tables separately while avoiding dangling pointers. > > Amit> Won't this lead to a bloat in toast tables when there is a big > Amit> difference between the cutoff XID of the main heap table and the > Amit> latest values of OldestXmin? > > Yes. What we need is actually the reverse of what Robert describes - > when we vacuum the _main_ table, we must use the _later_ of the > currently calculated OldestXmin or the OldestXmin last used to vacuum > the toast table. > I think then that same formula needs to be used during cluster as well. Also what about get_actual_variable_range(), will also need similar change, is it okay to add additional lookup of pg_class in that code path? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com