Thread: Are ctid chaining loops safe without relation size checks?

Are ctid chaining loops safe without relation size checks?

From
Andres Freund
Date:
Hi,

While looking at [1] I was rephrasing this comment + chck in
heap_get_latest_tid():

-     * Since this can be called with user-supplied TID, don't trust the input
-     * too much.  (RelationGetNumberOfBlocks is an expensive check, so we
-     * don't check t_ctid links again this way.  Note that it would not do to
-     * call it just once and save the result, either.)
      */
-    blk = ItemPointerGetBlockNumber(tid);
-    if (blk >= RelationGetNumberOfBlocks(relation))
-        elog(ERROR, "block number %u is out of range for relation \"%s\"",
-             blk, RelationGetRelationName(relation));

Which I dutifully rewrote. But I'm actually not sure it's safe at all
for heap to rely on t_ctid links to be valid. What prevents a ctid link
to point to a page that's since been truncated away?

And it's not just heap_get_latest_tid() afaict. As far as I can tell
just about every ctid chaining code ought to test the t_ctid link
against the relation size - otherwise it seems entirely possible to get
"could not read block %u in file \"%s\": %m" or
"could not read block %u in file \"%s\": read only 0 of %d bytes"
style errors, no?

These loops are of such long-standing vintage, that I feel like I must
be missing something.

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20190515185447.gno2jtqxyktylyvs%40alap3.anarazel.de



Re: Are ctid chaining loops safe without relation size checks?

From
Alvaro Herrera
Date:
On 2019-May-15, Andres Freund wrote:

> -    blk = ItemPointerGetBlockNumber(tid);
> -    if (blk >= RelationGetNumberOfBlocks(relation))
> -        elog(ERROR, "block number %u is out of range for relation \"%s\"",
> -             blk, RelationGetRelationName(relation));
> 
> Which I dutifully rewrote. But I'm actually not sure it's safe at all
> for heap to rely on t_ctid links to be valid. What prevents a ctid link
> to point to a page that's since been truncated away?

Umm .. IIUC all index entries for truncated pages should have been
removed prior to the truncation.  Otherwise, how would those index
entries not become immediately data corruption the instant the heap is
re-grown to cover those truncated pages?  So I think if the TID comes
directly from user then this is a check worth doing, but if the TID
comes from an index, then it isn't.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Are ctid chaining loops safe without relation size checks?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Which I dutifully rewrote. But I'm actually not sure it's safe at all
> for heap to rely on t_ctid links to be valid. What prevents a ctid link
> to point to a page that's since been truncated away?

Nothing, but when would the issue come up?  The updated tuple must be
newer than the one pointing at it, so if it's dead then the one pointing
at it must be too, no?

(If we're not checking liveness of x_max before following the link,
we'd have trouble ...)

            regards, tom lane



Re: Are ctid chaining loops safe without relation size checks?

From
Andres Freund
Date:
Hi,

On 2019-05-15 15:09:34 -0400, Tom Lane wrote:
> (If we're not checking liveness of x_max before following the link,
> we'd have trouble ...)

I don't think we do everywhere - e.g. in heap_get_latest_tid() case that
made me think about this there's only this as an xmax based loop
termination:

        /*
         * After following a t_ctid link, we might arrive at an unrelated
         * tuple.  Check for XMIN match.
         */
        if (TransactionIdIsValid(priorXmax) &&
            !TransactionIdEquals(priorXmax, HeapTupleHeaderGetXmin(tp.t_data)))
        {
            UnlockReleaseBuffer(buffer);
            break;
        }

but that's after we already followed the link, and read the page (and it
obviously isn't a liveliness check).

Additionally, note that heap_get_latest_tid() does *not* terminate when
it finds a visible tuple:

        /*
         * Check tuple visibility; if visible, set it as the new result
         * candidate.
         */
        valid = HeapTupleSatisfiesVisibility(&tp, snapshot, buffer);
        CheckForSerializableConflictOut(valid, relation, &tp, buffer, snapshot);
        if (valid)
            *tid = ctid;

it just continues to the next tuple version, if any.

EvalPlanQualFetch() in <= 11 and heapam_tuple_lock() in master and
heap_lock_updated_tuple_rec() don't have the problem that xmax might
suddenly abort while following the chain, because they have code like:

    /*
     * If tuple is being updated by other transaction then we
     * have to wait for its commit/abort, or die trying.
     */
    if (TransactionIdIsValid(SnapshotDirty.xmax))
    {
        ReleaseBuffer(buffer);
        switch (wait_policy)
        {
            case LockWaitBlock:
                XactLockTableWait(SnapshotDirty.xmax,
                                  relation, &tuple->t_self,
                                  XLTW_FetchUpdated);
                break;
            case LockWaitSkip:
                if (!ConditionalXactLockTableWait(SnapshotDirty.xmax))
                    /* skip instead of waiting */
                    return TM_WouldBlock;
                break;
            case LockWaitError:
                if (!ConditionalXactLockTableWait(SnapshotDirty.xmax))
                    ereport(ERROR,
                            (errcode(ERRCODE_LOCK_NOT_AVAILABLE),
                             errmsg("could not obtain lock on row in relation \"%s\"",
                                    RelationGetRelationName(relation))));
                break;
        }
        continue;   /* loop back to repeat heap_fetch */
    }

but heap_get_latest_tid() doesn't have that logic.


So I think the problem is just that heap_get_latest_tid() is missing
this type of check. The reason for which presumably is this piece of
intended functionality:

 * Actually, this gets the latest version that is visible according to
 * the passed snapshot.  You can pass SnapshotDirty to get the very latest,
 * possibly uncommitted version.

which means that neither can it block when xmax is still running, nor
can it terminate when HeapTupleSatisfiesVisibility() returns true.
There's no core code using a !mvcc snapshot however.




Because it's relevant for other work we've talked about for v13, and for
a potential fix:

> Andres Freund <andres@anarazel.de> writes:
> > Which I dutifully rewrote. But I'm actually not sure it's safe at all
> > for heap to rely on t_ctid links to be valid. What prevents a ctid link
> > to point to a page that's since been truncated away?
>
> Nothing, but when would the issue come up?  The updated tuple must be
> newer than the one pointing at it, so if it's dead then the one pointing
> at it must be too, no?

Well, the current tuple might not be dead, it might be
UPDATE_IN_PROGRESS when start following the ctid chain. By the time we
get to the next tuple, that UPDATE might have rolled back, vacuum came
along, removed the new version of thew tuple (which then becomes DEAD,
not RECENTLY_DEAD) and then truncated the relation.  Currently that's
not possible in the nodeTidscan.c case, because we'll have a lock
preventing truncations. But if we were to allow truncations without an
AEL, that'd be different.


I went through a few possible ways to fix this:

1) Break out of heap_get_latest_tid()/ loop, if the ctid to be chained
   to is bigger than the block length. It can't be visible by any
   definition except SnapshotAny, and we could just disallow that. As
   there's a lock on the relation heap_get_latest_tid() is operating on,
   we can rely on that value not getting too outdated.

   But I don't think that's correct, because the newest version of the
   tuple *actually* might be beyond the end of the table at the
   beginning of the scan - we *do* allow extension of the table while
   somebody holds a lock on the table after all.

   I also don't like adding more assumptions that depend on preventing
   truncations while any other lock is held.

2) Just disallow SnapshotDirty/SnapshotAny for heap_get_latest_tid(),
   and break out of the loop if the current tuple is visible. There
   can't be a newer visible version anyway, and as long as the input tid
   for heap_get_latest_tid() points to something the calling transaction
   could see (even if in an earlier snapshot), there has to be a visible
   version somewhere.

   That'd not fix the tid.c callers, but they're essentially unused and
   weird anyway.

   I'm not sure if WHERE CURRENT OF with a WITH HOLD cursor is possible
   / would be a problem. In that case we might need to add a
   XactLockTableWait too.

3) Declare these problems as esoteric, and don't care.

Greetings,

Andres Freund



Re: Are ctid chaining loops safe without relation size checks?

From
Andres Freund
Date:
Hi,

On 2019-05-15 15:07:13 -0400, Alvaro Herrera wrote:
> On 2019-May-15, Andres Freund wrote:
> 
> > -    blk = ItemPointerGetBlockNumber(tid);
> > -    if (blk >= RelationGetNumberOfBlocks(relation))
> > -        elog(ERROR, "block number %u is out of range for relation \"%s\"",
> > -             blk, RelationGetRelationName(relation));
> > 
> > Which I dutifully rewrote. But I'm actually not sure it's safe at all
> > for heap to rely on t_ctid links to be valid. What prevents a ctid link
> > to point to a page that's since been truncated away?
> 
> Umm .. IIUC all index entries for truncated pages should have been
> removed prior to the truncation.  Otherwise, how would those index
> entries not become immediately data corruption the instant the heap is
> re-grown to cover those truncated pages?  So I think if the TID comes
> directly from user then this is a check worth doing, but if the TID
> comes from an index, then it isn't.

I'm not sure how indexes come into play here? For one, I don't think
heap_get_latest_tid() is called straight on a tuple returned from an
index scan. But also, I don't think that'd change much - it's not the
tid that's passed to heap_get_latest_tid() that's the problem, it's the
tuples it chains to via t_ctid.

Greetings,

Andres Freund