Concurrency bug with vacuum full (cluster) and toast - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Concurrency bug with vacuum full (cluster) and toast
Date
Msg-id CAPpHfdu3PJUzHpQrvJ5RC9bEX_bZ6LwX52kBpb0EiD_9e3Np5g@mail.gmail.com
Whole thread Raw
Responses Re: Concurrency bug with vacuum full (cluster) and toast
List pgsql-hackers
Hi all,

I've discovered bug, when vacuum full fails with error, because it
couldn't find toast chunks deleted by itself.  That happens because
cluster_rel() sets OldestXmin, but toast accesses gets snapshot later
and independently.  That causes heap_page_prune_opt() to clean chunks,
which rebuild_relation() expects to exist.  This bug very rarely
happens on busy systems which actively update toast values.  But I
found way to reliably reproduce it using debugger.

*Setup*

CREATE FUNCTION random_string(seed integer, length integer) RETURNS text
    AS $$
        SELECT substr(
                    string_agg(
                        substr(
                            encode(
                                decode(
                                    md5(seed::text || '-' || i::text),
                                    'hex'),
                                'base64'),
                            1, 21),
                        ''),
                    1, length)
        FROM generate_series(1, (length + 20) / 21) i; $$
LANGUAGE SQL;

CREATE TABLE test (val text);
INSERT INTO test (random_string(1,100000));

*Reproduction steps*

s1-s3 are three parallel PostgreSQL sessions
s3lldb is lldb connected to s1

At first s1 acquires snapshot and holds it.

s1# begin transaction isolation level repeatable read;
s1# select 1;

Then s2 makes multiple updates of our toasted value.

s2# update test set val = random_string(2,100000);
s2# update test set val = random_string(3,100000);
s2# update test set val = random_string(4,100000);
s2# update test set val = random_string(5,100000);
s2# update test set val = random_string(6,100000);
s2# update test set val = random_string(7,100000);

Then s3 starting vacuum full stopping on vacuum_set_xid_limits().

s3lldb# b vacuum_set_xid_limits
s3# vacuum full test;

We pass vacuum_set_xid_limits() making sure old tuple versions made by
s2 would be recently dead for vacuum full.

s3lldb# finish

Then s1 releases snapshot.  Then heap_page_prune_opt() called from
toast accessed would cleanup toast chunks, which vacuum full expects
to be recently dead.

s1# commit;

Finally, we continue our vacuum full and get error!

s3lldb# continue
s3#
ERROR:  unexpected chunk number 50 (expected 2) for toast value 16429
in pg_toast_16387

Attached patch contains dirty fix of this bug, which just prevents
heap_page_prune_opt() from clean tuples, when it's called from
rebuild_relation().  Actually, it's not something I'm proposing to
commit or even review, it might be just some start point for thoughts.

Any ideas?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

pgsql-hackers by date:

Previous
From: Ildar Musin
Date:
Subject: Re: [HACKERS] Custom compression methods
Next
From: Peter Eisentraut
Date:
Subject: Re: Fix optimization of foreign-key on update actions