Visibility bug in tuple lock - Mailing list pgsql-hackers

From Jasper Smit
Subject Visibility bug in tuple lock
Date
Msg-id CAOG+RQ74x0q=kgBBQ=mezuvOeZBfSxM1qu_o0V28bwDz3dHxLw@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi,

We found a bug in heap_lock_tuple(). It seems to be unsafe to follow
the t_ctid pointer of updated tuples,
 in the presence of aborted tuples. Vacuum can mark aborted tuples
LP_UNUSED, even if there are still
 visible tuples that have a t_ctid pointing to them.

We would like to address this issue in postgres. Attached is a patch
that modifies VACUUM behavior.
Could you advise us on the best way to proceed? How can we get this
patch included in postgres, or would you recommend
pursuing an alternative approach?

The following scenario will lead to an assertion in debug or to wrong
results in release.

1. (session 1) A tuple is being UPDATEd in an active transaction.
2. (session 2) Another session locks that tuple, for example, SELECT
.. FOR UPDATE.
3. heap_lock_tuple() calls HeapTupleSatisfiesUpdate(), which returns
TM_BeingModified.
   It will record the xmax and t_ctid of the tuple.
4. Since the tuple is actively being updated, we need to wait for
session 1 to complete.
5. The UPDATE of session 1 aborts, leaving the original tuple with an
aborted xmax and t_ctid pointing to a
    tuple with an aborted xmin.
6. VACUUM marks aborted tuples as unused, regardless of whether they
are still referenced by a visible tuple through t_ctid.
   As a result, the aborted tuple is marked unused.
7. An INSERT happens and the tuple is placed at the recycled location
of the aborted tuple.
8. The newly INSERTed tuple is UPDATEd.
9. heap_lock_tuple() of session 2 resumes, and will next call
heap_lock_updated_tuple() using the t_ctid which was recorded in step
3.
   Note that the code does not check the status of the transaction it
was waiting on.
   It will proceed, regardless of the transaction it was waiting on
committed or aborted.
10. heap_lock_updated_tuple() will now work on the inserted tuple of
step 7. It will see that this tuple updated,
    and thus the return value of heap_lock_updated_tuple() will be TM_Updated.
11. This will eventually lead to the assertion (result ==
TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID)

These concrete steps reproduce the problem:

Used version REL_18_0
Run with configuration:
io_combine_limit = '1'

session 1:
drop table if exists t;
create table t (id int);
insert into t (select generate_series(1, 452));
begin;
update t set id = 1000 where id = 1;

session 2:
gdb: b heapam.c:5033
gdb: continue
select * from t where id = 1 for update;

session 1:
abort;
select * from t;
VACUUM (TRUNCATE off);
insert into t values (453) returning ctid; -- Should be (2,1)
update t set id = 454 where id = 453 returning ctid;

session 2: (should now be at the breakpoint)
gdb: continue

We get the assertion:
(result == TM_WouldBlock) || !(tuple->t_data->t_infomask & HEAP_XMAX_INVALID).
Stacktrace included as attachment.

We have also seen the error "t_xmin %u is uncommitted in tuple (%u,%u)
to be updated in table \"%s\"" in our testing.
 We believe that this has a similar cause. However, we have not been
able to come up with reproduction steps for that assertion
specifically.

We think that this problem can be fixed in two different ways.
1. Check the commit status after waiting in heap_lock_updated_tuple(),
and don't proceed checking the updated tuple when the commit has
aborted.
or check, like in heap_get_latest_tid(), if the xmin of the updated
tuples matches the xmax of the old tuple.
2. Change vacuum, to delay marking aborted tuples dead until no
visible tuple will have a t_ctid pointing to that tuple.

For 2. we have a patch included in the e-mail. This patch alters the
visibility check during vacuum, aborted tuples will
be treated similar to RECENTLY_DEAD tuples. They are not recycled
immediately, but instead the xmin is used to decide whether
they can be marked unused.

Regards,

Jasper Smit
Oleksii Kozlov
Luc Vlaming
Spiros Agathos
Servicenow

Attachment

pgsql-hackers by date:

Previous
From: shveta malik
Date:
Subject: Re: Logical Replication of sequences
Next
From: Ajin Cherian
Date:
Subject: Re: Improve pg_sync_replication_slots() to wait for primary to advance