Thread: Toast issues with OldestXmin going backwards

Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
Various comments in GetOldestXmin mention the possibility of the oldest
xmin going backward, and assert that this is actually safe. It's not.

Consider:

A table has a toastable column. A row is updated in a way that changes
the toasted value. There are now two row versions pointing to different
toast values, one live and one dead.

Now suppose the toast table - but not the main table - is vacuumed; the
dead toast entries are removed, even though they are still referenced by
the dead main-table row. Autovacuum treats the main table and toast
table separately, so this can happen.

Now suppose that OldestXmin goes backwards so that the older main table
row version is no longer dead, but merely recently-dead.

At this point, VACUUM FULL (or similar rewrites) on the table will fail
with "missing chunk number 0 for ..." toast errors, because it tries to
copy the recently-dead row, but that row's toasted values have been
vacuumed away already.

(I've been working for a while with someone on IRC to track down the
source of unexplained semi-repeatable "missing chunk number 0 ..."
errors from VACUUM FULL. I don't know at this stage whether this is the
actual problem, but it matches the symptoms.)

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Amit Kapila
Date:
On Thu, Apr 19, 2018 at 4:07 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> Various comments in GetOldestXmin mention the possibility of the oldest
> xmin going backward, and assert that this is actually safe. It's not.
>
> Consider:
>
> A table has a toastable column. A row is updated in a way that changes
> the toasted value. There are now two row versions pointing to different
> toast values, one live and one dead.
>
> Now suppose the toast table - but not the main table - is vacuumed; the
> dead toast entries are removed, even though they are still referenced by
> the dead main-table row. Autovacuum treats the main table and toast
> table separately, so this can happen.
>
> Now suppose that OldestXmin goes backwards so that the older main table
> row version is no longer dead, but merely recently-dead.
>
> At this point, VACUUM FULL (or similar rewrites) on the table will fail
> with "missing chunk number 0 for ..." toast errors, because it tries to
> copy the recently-dead row, but that row's toasted values have been
> vacuumed away already.
>

I haven't tried to reproduce it, but I can see the possibility of the
problem described by you.  What should we do next?  I could see few
possibilities: (a) Vacuum for main and toast table should always use
same OldestXmin, (b) Before removing the row from toast table, we
should ensure that row in the main table is removed, (c) Change the
logic during rewrite such that it can detect this situation and skip
copying the tuple in the main table, on a quick look, this seems
tricky, because the toast table TID might have been reused by that
time, basically I am not sure if this can be a viable solution or (d)
Ensure that GetOldestXmin doesn't move backwards or write a new API
similar to it which doesn't allow OldestXmin to move backwards and use
that for the purpose of vacuum.

Any better ideas?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

 Amit> I haven't tried to reproduce it, but I can see the possibility of
 Amit> the problem described by you. What should we do next? I could see
 Amit> few possibilities: (a) Vacuum for main and toast table should
 Amit> always use same OldestXmin,

I don't see how this could be arranged without either disabling the
ability to vacuum the toast table independently, or storing the xmin
somewhere.

 Amit> (b) Before removing the row from toast table, we should ensure
 Amit> that row in the main table is removed,

We can't find the main table row version(s) without scanning the whole
main table, so this amounts (again) to disabling vacuum on the toast
table only.

 Amit> (c) Change the logic during rewrite such that it can detect this
 Amit> situation and skip copying the tuple in the main table,

This seems more promising, but the problem is how to detect the error
safely (since currently, it's just ereport()ed from deep inside the
toast code). How about:

1) add a toast_datum_exists() call or similar that returns false if the
(first block of) the specified external toast item is missing

2) when copying a RECENTLY_DEAD tuple, check all its external column
values using this function beforehand, and if any are missing, treat the
tuple as DEAD instead.

 Amit> on a quick look, this seems tricky, because the toast table TID
 Amit> might have been reused by that time,

Toast pointers don't point to TIDs fortunately, they point to OIDs. The
OID could have been reused (if there's been an OID wraparound since the
toast item was created), but that should be harmless: it means that we'd
incorrectly copy the new value with the old tuple, but the old tuple is
never going to be visible to anybody ever again so nothing will see that.

 Amit> or (d) Ensure that GetOldestXmin doesn't move backwards or write
 Amit> a new API similar to it which doesn't allow OldestXmin to move
 Amit> backwards and use that for the purpose of vacuum.

This would presumably require storing a reference OldestXmin somewhere
(in shared memory? but there'd have to be a global one and one per db,
and I don't think we have any shared memory structures at present that
are per-db). We'd even have to preserve an oldestXmin for databases with
no currently active connections, though we wouldn't have to preserve it
across restarts.

 Amit> Any better ideas?

It turns out this issue has come up before (5 years ago!):

https://www.postgresql.org/message-id/20362.1359747327%40sss.pgh.pa.us

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

 Amit> (c) Change the logic during rewrite such that it can detect this
 Amit> situation and skip copying the tuple in the main table,

So I dug into this one and it looks to me like the best approach. Here's
a draft patch against 10-stable, if nobody can think of any showstoppers
then I'll do the rest of the versions and commit them.

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/access/heap/tuptoaster.c b/src/backend/access/heap/tuptoaster.c
index e3d09db0a8..fd091f2de5 100644
--- a/src/backend/access/heap/tuptoaster.c
+++ b/src/backend/access/heap/tuptoaster.c
@@ -1455,6 +1455,50 @@ toast_get_valid_index(Oid toastoid, LOCKMODE lock)
 
 
 /* ----------
+ * toast_validate_tuple -
+ *
+ *  Given one HeapTuple, test whether all of its external toast rows (if any)
+ *  exist and are the correct length.
+ *
+ * tup: tuple
+ * tupleDesc: tupdesc
+ */
+bool
+toast_validate_tuple(HeapTuple tup, TupleDesc tupleDesc)
+{
+    Form_pg_attribute *att = tupleDesc->attrs;
+    int            numAttrs = tupleDesc->natts;
+    int            i;
+    Datum        toast_values[MaxTupleAttributeNumber];
+    bool        toast_isnull[MaxTupleAttributeNumber];
+
+    /*
+     * Caller can check this if they like, but it doesn't hurt to do it here
+     * too.
+     */
+    if (!HeapTupleHasExternal(tup))
+        return true;
+
+    /*
+     * Break down the tuple into fields.
+     */
+    Assert(numAttrs <= MaxTupleAttributeNumber);
+    heap_deform_tuple(tup, tupleDesc, toast_values, toast_isnull);
+
+    for (i = 0; i < numAttrs; i++)
+    {
+        if (!toast_isnull[i] &&
+            att[i]->attlen == -1 &&
+            !att[i]->attisdropped &&
+            !toast_validate_datum(toast_values[i]))
+            return false;
+    }
+
+    return true;
+}
+
+
+/* ----------
  * toast_save_datum -
  *
  *    Save one single datum into the secondary relation and return
@@ -2037,6 +2081,141 @@ toast_fetch_datum(struct varlena *attr)
 }
 
 /* ----------
+ * toast_validate_datum -
+ *
+ *  Test whether a Datum's external toast rows (if any) exist with the correct
+ *  lengths.
+ *
+ *  This should track toast_fetch_datum to the extent that it returns false if
+ *  toast_fetch_datum would have errored out due to missing chunks or incorrect
+ *  lengths. (Other kinds of "can't happen" errors can still be thrown.)
+ * ----------
+ */
+bool
+toast_validate_datum(Datum value)
+{
+    bool        result = true;
+    Relation    toastrel;
+    Relation   *toastidxs;
+    ScanKeyData toastkey;
+    SysScanDesc toastscan;
+    HeapTuple    ttup;
+    TupleDesc    toasttupDesc;
+    struct varlena *attr = (struct varlena *) DatumGetPointer(value);
+    struct varatt_external toast_pointer;
+    int32        ressize;
+    int32        residx,
+                nextidx;
+    int32        numchunks;
+    Pointer        chunk;
+    bool        isnull;
+    int32        chunksize;
+    int            num_indexes;
+    int            validIndex;
+    SnapshotData SnapshotToast;
+
+    if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+        return true;
+
+    /* Must copy to access aligned fields */
+    VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
+    ressize = toast_pointer.va_extsize;
+    numchunks = ((ressize - 1) / TOAST_MAX_CHUNK_SIZE) + 1;
+
+    /*
+     * Open the toast relation and its indexes
+     */
+    toastrel = heap_open(toast_pointer.va_toastrelid, AccessShareLock);
+    toasttupDesc = toastrel->rd_att;
+
+    /* Look for the valid index of the toast relation */
+    validIndex = toast_open_indexes(toastrel,
+                                    AccessShareLock,
+                                    &toastidxs,
+                                    &num_indexes);
+
+    /*
+     * Setup a scan key to fetch from the index by va_valueid
+     */
+    ScanKeyInit(&toastkey,
+                (AttrNumber) 1,
+                BTEqualStrategyNumber, F_OIDEQ,
+                ObjectIdGetDatum(toast_pointer.va_valueid));
+
+    /*
+     * Read the chunks by index
+     *
+     * Note that because the index is actually on (valueid, chunkidx) we will
+     * see the chunks in chunkidx order, even though we didn't explicitly ask
+     * for it.
+     */
+    nextidx = 0;
+
+    init_toast_snapshot(&SnapshotToast);
+    toastscan = systable_beginscan_ordered(toastrel, toastidxs[validIndex],
+                                           &SnapshotToast, 1, &toastkey);
+    while ((ttup = systable_getnext_ordered(toastscan, ForwardScanDirection)) != NULL)
+    {
+        /*
+         * Have a chunk, extract the sequence number and the data
+         */
+        residx = DatumGetInt32(fastgetattr(ttup, 2, toasttupDesc, &isnull));
+        Assert(!isnull);
+        chunk = DatumGetPointer(fastgetattr(ttup, 3, toasttupDesc, &isnull));
+        Assert(!isnull);
+        if (!VARATT_IS_EXTENDED(chunk))
+        {
+            chunksize = VARSIZE(chunk) - VARHDRSZ;
+        }
+        else if (VARATT_IS_SHORT(chunk))
+        {
+            /* could happen due to heap_form_tuple doing its thing */
+            chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+        }
+        else
+        {
+            /* should never happen */
+            elog(ERROR, "found toasted toast chunk for toast value %u in %s",
+                 toast_pointer.va_valueid,
+                 RelationGetRelationName(toastrel));
+            chunksize = 0;        /* keep compiler quiet */
+        }
+
+        /*
+         * Some checks on the data we've found
+         */
+        if (residx != nextidx ||
+            (residx < numchunks - 1 &&
+             chunksize != TOAST_MAX_CHUNK_SIZE) ||
+            (residx == numchunks - 1 &&
+             ((residx * TOAST_MAX_CHUNK_SIZE + chunksize) != ressize)) ||
+            (residx >= numchunks))
+        {
+            result = false;
+            break;
+        }
+
+        nextidx++;
+    }
+
+    /*
+     * Final checks
+     */
+    if (result && nextidx != numchunks)
+        result = false;
+
+    /*
+     * End scan and close relations
+     */
+    systable_endscan_ordered(toastscan);
+    toast_close_indexes(toastidxs, num_indexes, AccessShareLock);
+    heap_close(toastrel, AccessShareLock);
+
+    return result;
+}
+
+/* ----------
  * toast_fetch_datum_slice -
  *
  *    Reconstruct a segment of a Datum from the chunks saved
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index dfd089f63c..b241b67a5e 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -945,6 +945,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
         HeapTuple    tuple;
         Buffer        buf;
         bool        isdead;
+        bool        validate_toast = false;
 
         CHECK_FOR_INTERRUPTS();
 
@@ -979,6 +980,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
                 break;
             case HEAPTUPLE_RECENTLY_DEAD:
                 tups_recently_dead += 1;
+                validate_toast = true;
                 /* fall through */
             case HEAPTUPLE_LIVE:
                 /* Live or recently dead, must copy it */
@@ -1022,6 +1024,20 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 
         LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
+        /*
+         * OldestXmin going backwards can mean we have a RECENTLY_DEAD row that
+         * contains toast pointers whose toast rows have already been vacuumed
+         * away (or in the worst but unlikely case, recycled). If so, then the
+         * row must really be dead to all snapshots that could access it, so
+         * treat it as DEAD instead.
+         */
+        if (validate_toast &&
+            !toast_validate_tuple(tuple, oldTupDesc))
+        {
+            isdead = true;
+            tups_recently_dead -= 1;
+        }
+
         if (isdead)
         {
             tups_vacuumed += 1;
diff --git a/src/include/access/tuptoaster.h b/src/include/access/tuptoaster.h
index fd9f83ac44..83b59257ef 100644
--- a/src/include/access/tuptoaster.h
+++ b/src/include/access/tuptoaster.h
@@ -236,4 +236,22 @@ extern Size toast_datum_size(Datum value);
  */
 extern Oid    toast_get_valid_index(Oid toastoid, LOCKMODE lock);
 
+/* ----------
+ * toast_validate_tuple -
+ *
+ *  Given one HeapTuple, test whether all of its external toast rows (if any)
+ *  exist and are the correct length.
+ *  ----------
+ */
+extern bool toast_validate_tuple(HeapTuple tup, TupleDesc tupleDesc);
+
+/* ----------
+ * toast_validate_datum -
+ *
+ *  Test whether a Datum's external toast rows (if any) exist with the correct
+ *  lengths.
+ *  ----------
+ */
+extern bool toast_validate_datum(Datum value);
+
 #endif                            /* TUPTOASTER_H */

Re: Toast issues with OldestXmin going backwards

From
Pavan Deolasee
Date:

On Sat, 21 Apr 2018 at 6:46 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:


So I dug into this one and it looks to me like the best approach. Here's
a draft patch against 10-stable, if nobody can think of any showstoppers
then I'll do the rest of the versions and commit them.

How does this guard against the case when the same OID gets inserted in the toast table again, with matching lengths etc? Rare but seems possible, no?

I think we should look for a more complete solution how hard these bugs are to detect and fix.

Thanks,
Pavan


--
Andrew (irc:RhodiumToad)

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Toast issues with OldestXmin going backwards

From
Andres Freund
Date:
Hi,

On 2018-04-21 14:16:27 +0100, Andrew Gierth wrote:
> >>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:
> 
>  Amit> (c) Change the logic during rewrite such that it can detect this
>  Amit> situation and skip copying the tuple in the main table,
> 
> So I dug into this one and it looks to me like the best approach. Here's
> a draft patch against 10-stable, if nobody can think of any showstoppers
> then I'll do the rest of the versions and commit them.

Please wait for a bit. This isn't a trivial change, and I would like to
wrap my head around it.

At the very least this seems like it could cause extreme slowdowns for
large tables that have the right update/delete patterns?


> +        /*
> +         * OldestXmin going backwards can mean we have a RECENTLY_DEAD row that
> +         * contains toast pointers whose toast rows have already been vacuumed
> +         * away (or in the worst but unlikely case, recycled). If so, then the
> +         * row must really be dead to all snapshots that could access it, so
> +         * treat it as DEAD instead.
> +         */

How is it guaranteed that the recheck doesn't find a different toast
tuple at the relevant position?

Greetings,

Andres Freund


Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Andres" == Andres Freund <andres@anarazel.de> writes:

 Andres> Please wait for a bit. This isn't a trivial change, and I would
 Andres> like to wrap my head around it.

Sure.

 Andres> At the very least this seems like it could cause extreme
 Andres> slowdowns for large tables that have the right update/delete
 Andres> patterns?

I'm working on the basis that fetching the same set of toast rows twice
in succession will at least mean they are likely in cache for the second
time, so we shouldn't be doing any more actual I/O than before. (This is
assuming that a system with really big toast values is going to also
have plenty of RAM, which is a generally safe assumption since such
values are copied many times in memory by pg_dump.) There is still some
overhead in the check, of course; I will see about doing some
benchmarks.

 >> +        /*
 >> +         * OldestXmin going backwards can mean we have a RECENTLY_DEAD row that
 >> +         * contains toast pointers whose toast rows have already been vacuumed
 >> +         * away (or in the worst but unlikely case, recycled). If so, then the
 >> +         * row must really be dead to all snapshots that could access it, so
 >> +         * treat it as DEAD instead.
 >> +         */

 Andres> How is it guaranteed that the recheck doesn't find a different
 Andres> toast tuple at the relevant position?

It's not, but it doesn't matter as long as the new toast rows can pass
through toast_fetch_datum without throwing an error (which is why
toast_validate_datum has to scan all the entries and check all their
lengths).

Firstly, we can assume the following statements (or at least if we
can't, we have far worse problems than this):

1. OldestXmin going backwards cannot make any tuple visible in any
(non-vacuum) snapshot that it wasn't visible in before;

2. If a tuple is non-vacuumable itself, it won't have its toast
entries vacuumed away;

3. Once a tuple is vacuumable at any single instant, it will never
become visible in any snapshot.

So the only cases where a recently-dead tuple can have pointers to a
now-recycled toast entry OID is if it was at some point vacuumable, such
that a toast-table-only vacuum removed its toast entries, and a new insert
or update subsequently inserted a value of the same length that was
assigned the OID.

Since the tuple was vacuumable at that point, it will never again become
visible in any snapshot, so the only code that should ever access its
toast values is vacuum full/cluster, which only ever copies the value
and does not attempt to decompress or otherwise interpret it. So the
only possible consequences for copying the wrong value, as long as we
don't throw an error in the process, are that (a) some space is possibly
wasted in the toast table - which actually happens already in far more
common circumstances - and (b) someone doing forensics or other
non-standard access to dead tuples might (in this vanishingly rare case)
see something unexpected. At no time would incorrect data be visible in
any query.

(regarding wasted space in the toast table: vacuum full / cluster will
insert live toast entries for copied recently-dead tuples, and if those
toast entries are not also referenced from any live tuple, they will
never be removed by subsequent normal vacuums, since nothing can ever
mark those entries dead; only a later vacuum full / cluster can remove
them.)

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

 Andrew> There is still some overhead in the check, of course; I will
 Andrew> see about doing some benchmarks.

Preliminary result suggests that the user-CPU cost of vacuum full
increases by ~16% when the entire table is recently-dead rows (with a
toasted column of ~10k length) compared to the same table when all rows
are live.

Since I doubt the practical value of vacuum full on a table which is
100% recently-dead, and that I would expect the live:recently-dead ratio
to not normally be much worse than 1:1 (making the overhead 8%) and more
likely up around 10:1 (making the overhead 1.5%), I think this is not an
issue. (When you have a lot of recently-dead rows is exactly the _wrong_
time to be doing a vacuum full, since it wouldn't save you any space and
you'd bloat the toast table as mentioned previously.)

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Amit Kapila
Date:
On Sat, Apr 21, 2018 at 1:26 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:
>
>  Amit> I haven't tried to reproduce it, but I can see the possibility of
>  Amit> the problem described by you. What should we do next? I could see
>  Amit> few possibilities: (a) Vacuum for main and toast table should
>  Amit> always use same OldestXmin,
>
> I don't see how this could be arranged without either disabling the
> ability to vacuum the toast table independently, or storing the xmin
> somewhere.
>

I think we can use the same xmin for both main heap and toast by
computing it before scanning the main heap (lazy_vacuum_rel) and then
pass it down.  I have tried it and came up with the attached patch.

>  Amit> (b) Before removing the row from toast table, we should ensure
>  Amit> that row in the main table is removed,
>
> We can't find the main table row version(s) without scanning the whole
> main table, so this amounts (again) to disabling vacuum on the toast
> table only.
>
>  Amit> (c) Change the logic during rewrite such that it can detect this
>  Amit> situation and skip copying the tuple in the main table,
>
> This seems more promising, but the problem is how to detect the error
> safely (since currently, it's just ereport()ed from deep inside the
> toast code). How about:
>
> 1) add a toast_datum_exists() call or similar that returns false if the
> (first block of) the specified external toast item is missing
>
> 2) when copying a RECENTLY_DEAD tuple, check all its external column
> values using this function beforehand, and if any are missing, treat the
> tuple as DEAD instead.
>
>  Amit> on a quick look, this seems tricky, because the toast table TID
>  Amit> might have been reused by that time,
>
> Toast pointers don't point to TIDs fortunately, they point to OIDs. The
> OID could have been reused (if there's been an OID wraparound since the
> toast item was created), but that should be harmless: it means that we'd
> incorrectly copy the new value with the old tuple, but the old tuple is
> never going to be visible to anybody ever again so nothing will see that.
>

Yeah, that's right, but it gives some uneasy feeling that we are
attaching the wrong toast value.  If you think there is some basic
flaw in Approach-1 (as in attached patch) or it requires some major
surgery then we can look further into this.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

 Amit> I haven't tried to reproduce it, but I can see the possibility of
 Amit> the problem described by you. What should we do next? I could see
 Amit> few possibilities: (a) Vacuum for main and toast table should
 Amit> always use same OldestXmin,
 
 >> I don't see how this could be arranged without either disabling the
 >> ability to vacuum the toast table independently, or storing the xmin
 >> somewhere.

 Amit> I think we can use the same xmin for both main heap and toast

But you're completely missing the point here, which is that the main
heap might not be vacuumed AT ALL. Autovacuum can and will call vacuum()
with the relid of a toast table - this is a deliberate decision to cope
with access patterns that would otherwise bloat the toast table.

(It's also currently legal to do VACUUM pg_toast.pg_toast_nnnn;
manually, but that's of secondary importance compared to the fact that
autovacuum does it.)

 Amit> by computing it before scanning the main heap (lazy_vacuum_rel)
 Amit> and then pass it down. I have tried it and came up with the
 Amit> attached patch.

Your patch would actually be needed if (and only if) autovacuum was
changed back to its old behavior of never vacuuming toast tables
independently, and if manual VACUUM pg_toast.*; was disabled. But in the
presence of either of those two possibilities, it does nothing useful.

 >> Toast pointers don't point to TIDs fortunately, they point to OIDs.
 >> The OID could have been reused (if there's been an OID wraparound
 >> since the toast item was created), but that should be harmless: it
 >> means that we'd incorrectly copy the new value with the old tuple,
 >> but the old tuple is never going to be visible to anybody ever again
 >> so nothing will see that.

 Amit> Yeah, that's right, but it gives some uneasy feeling that we are
 Amit> attaching the wrong toast value. If you think there is some basic
 Amit> flaw in Approach-1 (as in attached patch) or it requires some
 Amit> major surgery then we can look further into this.

The basic flaw is that it doesn't even reach the problem.

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

 Amit> Yeah, that's right, but it gives some uneasy feeling that we are
 Amit> attaching the wrong toast value.

Oh, forgot to stress this before: referencing the wrong toast value in
this case is something that can already happen, my patch does not affect
either the probability of it occurring or the consequences of it.

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Andrew" == Andrew Gierth <andrew@tao11.riddles.org.uk> writes:

 Andrew> Since the tuple was vacuumable at that point, it will never
 Andrew> again become visible in any snapshot, so the only code that
 Andrew> should ever access its toast values is vacuum full/cluster,

and, unfortunately, CREATE INDEX (non-concurrently), including the index
build done by vacfull/cluster itself, which rather messes things up
because the toasted column might be a parameter to a functional index, or it
might itself be an indexed column (within a narrow size range for btree,
or more generally for other index types); this means the index AM or the
indexed function may try and actually detoast and inspect the value.

So right now it's possible (I think) in the worst case to crash the
server if you can set up exactly the right circumstances (re-use of a
prematurely vacuumed toast oid after wraparound, with the new value
matching the old value's raw length but not its compression state).

(My changes so far don't address this case and don't make it either more
or less likely)

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Amit Kapila
Date:
On Mon, Apr 23, 2018 at 1:55 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:
>
>  Amit> by computing it before scanning the main heap (lazy_vacuum_rel)
>  Amit> and then pass it down. I have tried it and came up with the
>  Amit> attached patch.
>
> Your patch would actually be needed if (and only if) autovacuum was
> changed back to its old behavior of never vacuuming toast tables
> independently, and if manual VACUUM pg_toast.*; was disabled. But in the
> presence of either of those two possibilities, it does nothing useful.
>

Yeah, right, I have missed the point that they can be vacuumed
separately, however, I think that decision is somewhat questionable.
Basically, it doesn't appear logical to leave the sort-of dangling
toast pointer in heap.  I think this is somewhat akin to our current
index and heap binding where we never let heap itemid to go off till
all index entries pointing to it are taken care of.  I have gone
through the commit (3ccde312ec8ee47f5f797b070d34a675799448ae) which
appears to have decoupled this mechanism, but it doesn't provide any
reason for doing so.  I have also gone through the related thread [1]
which is not much help either.  I think there must be some patterns
where toast table bloat causes much more harm than heap as toast rows
are much larger in size.  Are you aware of what were the concrete
reasons behind this change?  I think it would have been better if
along with decoupling of vacuum for main heap and toast tables, we
would have come up with a way to selectively remove the corresponding
rows from the main heap, say by just vacuuming heap pages/rows which
have toast pointers but maybe that is not viable or involves much more
work without equivalent benefit.

Also, we can think along the lines of another idea suggested by Andres
[2] on the thread mentioned by you.


[1] - https://www.postgresql.org/message-id/20080808165809.GB3800%40alvh.no-ip.org
[2] - https://www.postgresql.org/message-id/20130204164341.GB22226%40awork2.anarazel.de

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

 >> Your patch would actually be needed if (and only if) autovacuum was
 >> changed back to its old behavior of never vacuuming toast tables
 >> independently, and if manual VACUUM pg_toast.*; was disabled. But in
 >> the presence of either of those two possibilities, it does nothing
 >> useful.

 Amit> Yeah, right, I have missed the point that they can be vacuumed
 Amit> separately, however, I think that decision is somewhat
 Amit> questionable.

Some previous discussion links for reference, for the background to the
thread containing the patch:

https://www.postgresql.org/message-id/flat/87y7gpiqx3.fsf%40oxford.xeocode.com
https://www.postgresql.org/message-id/flat/20080608230348.GD11028%40alvh.no-ip.org

 Amit> I think it would have been better if along with decoupling of
 Amit> vacuum for main heap and toast tables, we would have come up with
 Amit> a way to selectively remove the corresponding rows from the main
 Amit> heap, say by just vacuuming heap pages/rows which have toast
 Amit> pointers but maybe that is not viable or involves much more work
 Amit> without equivalent benefit.

It should be fairly obvious why this is unworkable - most toast-using
tables will have toast pointers on every page, but without making a
whole new index of toast pointer OIDs (unacceptable overhead), it would
be impossible to find the toast pointers pointing to a specific item
without searching the whole rel (in which case we might just as well
have vacuumed it).

 Amit> Also, we can think along the lines of another idea suggested by
 Amit> Andres [2] on the thread mentioned by you.

That one is tricky for various reasons (locking, order of operations in
vacuum_rel, having to mess with the API of vacuum(), etc.)

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Amit Kapila
Date:
On Mon, Apr 23, 2018 at 1:34 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:
>
>  >> Your patch would actually be needed if (and only if) autovacuum was
>  >> changed back to its old behavior of never vacuuming toast tables
>  >> independently, and if manual VACUUM pg_toast.*; was disabled. But in
>  >> the presence of either of those two possibilities, it does nothing
>  >> useful.
>
>  Amit> Yeah, right, I have missed the point that they can be vacuumed
>  Amit> separately, however, I think that decision is somewhat
>  Amit> questionable.
>
> Some previous discussion links for reference, for the background to the
> thread containing the patch:
>
> https://www.postgresql.org/message-id/flat/87y7gpiqx3.fsf%40oxford.xeocode.com
> https://www.postgresql.org/message-id/flat/20080608230348.GD11028%40alvh.no-ip.org
>

If I read correctly, it seems one of the main reason [1] is to save
the extra pass over the heap and improve the code.  Now, ideally, the
extra pass over heap should also free up some space (occupied by the
rows that contains old toast pointers corresponding to which we are
going to remove rows from toast table), but it is quite possible that
it is already clean due to a previous separate vacuum pass over the
heap.  I think it is good to save extra pass over heap which might not
be as useful as we expect, but that can cost us correctness issues in
boundary cases as in the case being discussed in this thread.

>  Amit> I think it would have been better if along with decoupling of
>  Amit> vacuum for main heap and toast tables, we would have come up with
>  Amit> a way to selectively remove the corresponding rows from the main
>  Amit> heap, say by just vacuuming heap pages/rows which have toast
>  Amit> pointers but maybe that is not viable or involves much more work
>  Amit> without equivalent benefit.
>
> It should be fairly obvious why this is unworkable - most toast-using
> tables will have toast pointers on every page, but without making a
> whole new index of toast pointer OIDs (unacceptable overhead), it would
> be impossible to find the toast pointers pointing to a specific item
> without searching the whole rel (in which case we might just as well
> have vacuumed it).
>

Okay, such an optimization might not be much effective and it would
anyway lead to extra pass over the heap, however, that will resolve
the correctness issue.   Now, I understand that it is not advisable to
go back to the previous behavior for performance concerns, but I think
it would be better if we find a bullet-proof way to fix this symptom,
rather than just fixing the issue reported in this thread.

[1] - From one of the email: "We go certain lengths in autovacuum to
make sure tables are vacuumed when their toast table needs vacuuming
and the main table does not, which is all quite kludgy. So we already
look at their stats and make decisions about them. But what we do
after that is force a vacuum to the main table, even if that one does
not need any vacuuming, which is dumb."

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Toast issues with OldestXmin going backwards

From
Peter Eisentraut
Date:
On 4/23/18 00:33, Amit Kapila wrote:
> Yeah, right, I have missed the point that they can be vacuumed
> separately, however, I think that decision is somewhat questionable.

Manually vacuuming the TOAST table was a way to repair the recently
fixed TOAST bug, so it's kind of useful.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Toast issues with OldestXmin going backwards

From
Robert Haas
Date:
On Mon, Apr 23, 2018 at 11:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> If I read correctly, it seems one of the main reason [1] is to save
> the extra pass over the heap and improve the code.  Now, ideally, the
> extra pass over heap should also free up some space (occupied by the
> rows that contains old toast pointers corresponding to which we are
> going to remove rows from toast table), but it is quite possible that
> it is already clean due to a previous separate vacuum pass over the
> heap.  I think it is good to save extra pass over heap which might not
> be as useful as we expect, but that can cost us correctness issues in
> boundary cases as in the case being discussed in this thread.

I don't think anybody disagrees with that, but it doesn't answer the
question of how best to fix it.  Making VACUUM a lot more expensive
will not win us any awards, and an extra pass over the heap could be
very expensive.  You seem to think Andrew's fix isn't really
addressing the root of the problem, but I think that just depends on
how you define the problem.  If you define the problem as "the table
should never have dangling pointers to the TOAST table", then Andrew's
approach is only fixing the symptom.  But if you define the problem as
"we shouldn't try to follow TOAST pointers in heap tuples that are not
visible to any current or future MVCC snapshot", then it's fixing the
root issue.

I wonder if perhaps get_actual_variable_range() has a hazard of this
type as well.  If OldestXmin retreats, then the special snapshot type
which it uses could see a row whose TOAST entry has been removed
meanwhile.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Toast issues with OldestXmin going backwards

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I wonder if perhaps get_actual_variable_range() has a hazard of this
> type as well.  If OldestXmin retreats, then the special snapshot type
> which it uses could see a row whose TOAST entry has been removed
> meanwhile.

Hm, yeah.  I wonder if we could fix that by extracting the value from
the index rather than recomputing it from the heap tuple, as we do now.
We'd still have to visit the heap tuple to check liveness, but we'd
not need to extract anything from its data part.

Have we given up on the angle of "prevent OldestXmin from retreating"?
That seems like it'd be far more bulletproof than trying to work
around places that break one-by-one.

            regards, tom lane


Re: Toast issues with OldestXmin going backwards

From
Michael Paquier
Date:
On Thu, Apr 26, 2018 at 03:09:01PM -0400, Tom Lane wrote:
> Have we given up on the angle of "prevent OldestXmin from retreating"?
> That seems like it'd be far more bulletproof than trying to work
> around places that break one-by-one.

Strong +1 on that.
--
Michael

Attachment

Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 Tom> Have we given up on the angle of "prevent OldestXmin from
 Tom> retreating"?

I haven't, and I'm thinking about it, but I don't have an immediate
solution.

Thinking-out-loud mode:

1) we could store minimum values for OldestXmin for each database in
   some kind of shared-memory structure. Downside is that we can't
   easily make this fixed-size, since we don't know how many DBs there
   are - we can't size it based on connections since we need to be able
   to track values for dbs with no currently active connections.

   (Or do we need to track it across restarts? maybe we do, to deal with
   replication slaves without slots, or changes in parameters)

2) in-place updates of a new column in pg_database? and a control file
   field for the global values? not back-patchable, but would it work
   going forward?

3) there was the previous suggestion to store it per-table in the main
   table's reloptions and inplace-update that; this isn't impossible,
   but it would be fiddly to do because toast-table-only vacuums would
   need to locate the main table, and lock order issues might get
   interesting. (Maybe it would be sneakier to store it in the toast
   table's otherwise-unused reloptions, and have vacuum/create index on
   the main table consult that? This assumes that we only care about the
   issue when dealing with toast tables)

4) something else??

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:

 Robert> I wonder if perhaps get_actual_variable_range() has a hazard of
 Robert> this type as well. If OldestXmin retreats, then the special
 Robert> snapshot type which it uses could see a row whose TOAST entry
 Robert> has been removed meanwhile.

Actually get_actual_variable_range may have a hazard even if OldestXmin
does not retreat; it uses RecentGlobalXmin, which is quite likely to be
older than an OldestXmin value used by a vacuum.

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Amit Kapila
Date:
On Fri, Apr 27, 2018 at 12:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 23, 2018 at 11:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> If I read correctly, it seems one of the main reason [1] is to save
>> the extra pass over the heap and improve the code.  Now, ideally, the
>> extra pass over heap should also free up some space (occupied by the
>> rows that contains old toast pointers corresponding to which we are
>> going to remove rows from toast table), but it is quite possible that
>> it is already clean due to a previous separate vacuum pass over the
>> heap.  I think it is good to save extra pass over heap which might not
>> be as useful as we expect, but that can cost us correctness issues in
>> boundary cases as in the case being discussed in this thread.
>
> I don't think anybody disagrees with that, but it doesn't answer the
> question of how best to fix it.  Making VACUUM a lot more expensive
> will not win us any awards, and an extra pass over the heap could be
> very expensive.  You seem to think Andrew's fix isn't really
> addressing the root of the problem, but I think that just depends on
> how you define the problem.  If you define the problem as "the table
> should never have dangling pointers to the TOAST table", then Andrew's
> approach is only fixing the symptom.  But if you define the problem as
> "we shouldn't try to follow TOAST pointers in heap tuples that are not
> visible to any current or future MVCC snapshot", then it's fixing the
> root issue.
>

Hmm, in some rare cases when OID will be reused, it can still lead to
problems (when Create Index is performed in parallel to Cluster) as
mentioned by Andrew.  I am not sure if that is acceptable on the
grounds that it is not a newly introduced problem by his proposed
patch.  I think if we can come up with a solution which prevents
OldestXmin from retreating, it will be good, otherwise, we might have
to think harder whether there is any other solution to this problem
which will attack the root cause of the problem and if nothing better
comes out, then probably we can go with his proposed patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Toast issues with OldestXmin going backwards

From
Robert Haas
Date:
On Thu, Apr 26, 2018 at 9:03 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>    (Or do we need to track it across restarts? maybe we do, to deal with
>    replication slaves without slots, or changes in parameters)

Yeah, I'm worried that it might need to be persistent across restarts.

One idea that occurred to me is to somehow record -- I guess in
pg_class using non-transactional updates -- the last cutoff XID used
to vacuum any given table.  Then we could just make a rule that you
can't vacuum the TOAST table with an XID that's newer than the last
one used for the main table.  That would preserve the property that
you can vacuum the tables separately while avoiding dangling pointers.
But that's obviously not back-patchable, and it sounds finicky to get
right.  It's really too bad that heap tables don't include a metapage
where we could store details like this...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:

 Robert> One idea that occurred to me is to somehow record -- I guess in
 Robert> pg_class using non-transactional updates -- the last cutoff XID
 Robert> used to vacuum any given table. Then we could just make a rule
 Robert> that you can't vacuum the TOAST table with an XID that's newer
 Robert> than the last one used for the main table. That would preserve
 Robert> the property that you can vacuum the tables separately while
 Robert> avoiding dangling pointers. But that's obviously not
 Robert> back-patchable,

The suggestion made previously (in a historical thread) was to use an
entry in the reloptions field for this, at least in back branches. It
would be necessary for vacuum to add the entry initially in a normal
transactional update, after which it could be updated inplace.

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Robert Haas
Date:
On Fri, Apr 27, 2018 at 11:35 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:
>  Robert> One idea that occurred to me is to somehow record -- I guess in
>  Robert> pg_class using non-transactional updates -- the last cutoff XID
>  Robert> used to vacuum any given table. Then we could just make a rule
>  Robert> that you can't vacuum the TOAST table with an XID that's newer
>  Robert> than the last one used for the main table. That would preserve
>  Robert> the property that you can vacuum the tables separately while
>  Robert> avoiding dangling pointers. But that's obviously not
>  Robert> back-patchable,
>
> The suggestion made previously (in a historical thread) was to use an
> entry in the reloptions field for this, at least in back branches. It
> would be necessary for vacuum to add the entry initially in a normal
> transactional update, after which it could be updated inplace.

Yeah, I suppose.  Sounds pretty rickety to me, though.  Maybe I'm just
a pessimist.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Toast issues with OldestXmin going backwards

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Apr 27, 2018 at 11:35 AM, Andrew Gierth
> <andrew@tao11.riddles.org.uk> wrote:
> >>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:
> >  Robert> One idea that occurred to me is to somehow record -- I guess in
> >  Robert> pg_class using non-transactional updates -- the last cutoff XID
> >  Robert> used to vacuum any given table. Then we could just make a rule
> >  Robert> that you can't vacuum the TOAST table with an XID that's newer
> >  Robert> than the last one used for the main table. That would preserve
> >  Robert> the property that you can vacuum the tables separately while
> >  Robert> avoiding dangling pointers. But that's obviously not
> >  Robert> back-patchable,
> >
> > The suggestion made previously (in a historical thread) was to use an
> > entry in the reloptions field for this, at least in back branches. It
> > would be necessary for vacuum to add the entry initially in a normal
> > transactional update, after which it could be updated inplace.
>
> Yeah, I suppose.  Sounds pretty rickety to me, though.  Maybe I'm just
> a pessimist.

I tend to agree..  However, this isn't something that's been happening a
lot, from what I gather, and if we actually add a proper column into
pg_class for future versions (not really sure how I feel about if that
means "v11" or "v12" right now...) and reloptions for back-branches then
perhaps it's not so bad.

As far as a metadata page, it'd be pretty overkill, but maybe a fork for
it..?  I'm trying to think if there might be anything else we'd be able
to put into such a fork since adding another inode to every relation
that'll only ever likely be 8k definitely wouldn't win us any fans.  Not
sure, just brainstorming.

Thanks!

Stephen

Attachment

Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Stephen" == Stephen Frost <sfrost@snowman.net> writes:

 Stephen> I tend to agree.. However, this isn't something that's been
 Stephen> happening a lot, from what I gather,

Right now I can't prove it happens in the wild at all - I think I've
ruled it out as an explanation for the original problem report - but
it's very easy to demonstrate in tests.

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Amit Kapila
Date:
On Fri, Apr 27, 2018 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 26, 2018 at 9:03 PM, Andrew Gierth
> <andrew@tao11.riddles.org.uk> wrote:
>>    (Or do we need to track it across restarts? maybe we do, to deal with
>>    replication slaves without slots, or changes in parameters)
>
> Yeah, I'm worried that it might need to be persistent across restarts.
>
> One idea that occurred to me is to somehow record -- I guess in
> pg_class using non-transactional updates -- the last cutoff XID used
> to vacuum any given table.  Then we could just make a rule that you
> can't vacuum the TOAST table with an XID that's newer than the last
> one used for the main table.  That would preserve the property that
> you can vacuum the tables separately while avoiding dangling pointers.
>

Won't this lead to a bloat in toast tables when there is a big
difference between the cutoff XID of the main heap table and the
latest values of OldestXmin?  As both the tables can be vacuumed
separately, it doesn't sound impossible, but maybe in some later pass
on main heap and toast table will fix the situation.  Even if the
difference is not big, vacuum might still be not able to remove rows
from toast table in some situations.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Toast issues with OldestXmin going backwards

From
Andrew Gierth
Date:
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

 >>> (Or do we need to track it across restarts? maybe we do, to deal with
 >>> replication slaves without slots, or changes in parameters)
 
 >> Yeah, I'm worried that it might need to be persistent across restarts.
 >> 
 >> One idea that occurred to me is to somehow record -- I guess in
 >> pg_class using non-transactional updates -- the last cutoff XID used
 >> to vacuum any given table.  Then we could just make a rule that you
 >> can't vacuum the TOAST table with an XID that's newer than the last
 >> one used for the main table.  That would preserve the property that
 >> you can vacuum the tables separately while avoiding dangling pointers.

 Amit> Won't this lead to a bloat in toast tables when there is a big
 Amit> difference between the cutoff XID of the main heap table and the
 Amit> latest values of OldestXmin?

Yes. What we need is actually the reverse of what Robert describes -
when we vacuum the _main_ table, we must use the _later_ of the
currently calculated OldestXmin or the OldestXmin last used to vacuum
the toast table.

-- 
Andrew (irc:RhodiumToad)


Re: Toast issues with OldestXmin going backwards

From
Amit Kapila
Date:
On Sun, Apr 29, 2018 at 11:56 AM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
>>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:
>
>  >>> (Or do we need to track it across restarts? maybe we do, to deal with
>  >>> replication slaves without slots, or changes in parameters)
>
>  >> Yeah, I'm worried that it might need to be persistent across restarts.
>  >>
>  >> One idea that occurred to me is to somehow record -- I guess in
>  >> pg_class using non-transactional updates -- the last cutoff XID used
>  >> to vacuum any given table.  Then we could just make a rule that you
>  >> can't vacuum the TOAST table with an XID that's newer than the last
>  >> one used for the main table.  That would preserve the property that
>  >> you can vacuum the tables separately while avoiding dangling pointers.
>
>  Amit> Won't this lead to a bloat in toast tables when there is a big
>  Amit> difference between the cutoff XID of the main heap table and the
>  Amit> latest values of OldestXmin?
>
> Yes. What we need is actually the reverse of what Robert describes -
> when we vacuum the _main_ table, we must use the _later_ of the
> currently calculated OldestXmin or the OldestXmin last used to vacuum
> the toast table.
>

I think then that same formula needs to be used during cluster as
well.  Also what about get_actual_variable_range(), will also need
similar change, is it okay to add additional lookup of pg_class in
that code path?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com