Thread: BUG #17268: Possible corruption in toast index after reindex index concurrently

BUG #17268: Possible corruption in toast index after reindex index concurrently

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17268
Logged by:          Alexey Ermakov
Email address:      alexey.ermakov@dataegret.com
PostgreSQL version: 13.4
Operating system:   Ubuntu 18.04.5 LTS
Description:

Hello,

after running this query:
2021-11-02 13:15:45.469 UTC 2379 postgres@xx from [local] [vxid:24/0 txid:0]
[REINDEX] LOG:  duration: 1719120.441 ms  statement: REINDEX INDEX
CONCURRENTLY pg_toast.pg_toast_2624976286_index

we've got following errors (here is first one, others looks exactly the same
with same "toast value"):
2021-11-02 20:37:12.305 UTC 27935 yy@xx from [local] [vxid:41/199242673
txid:0] [SELECT] ERROR:  missing chunk number 0 for toast value 4040061139
in pg_toast_2624976286

it turns out one entry in toast index (?) was corrupted:
select  md5(body) from zz where id = ...;
ERROR:  missing chunk number 0 for toast value 4040061139 in
pg_toast_2624976286

According to "created_at" column in linked table row was created at
"2021-11-02 13:04:22.192125", i.e. during reindex concurrently.

I'm wondering if it's known bug and how risky could it be to reindex toast's
indexes. It was done automatically with tool which monitors indexes' bloat
and index size reduced several times in this case.

It's possible to get additional details from backup and WAL archive but
it'll take some time to restore (row with corrupted data was removed).

--
Thanks,
Alexey Ermakov


Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
Peter Geoghegan
Date:
On Wed, Nov 3, 2021 at 3:26 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> it turns out one entry in toast index (?) was corrupted:
> select  md5(body) from zz where id = ...;
> ERROR:  missing chunk number 0 for toast value 4040061139 in
> pg_toast_2624976286
>
> According to "created_at" column in linked table row was created at
> "2021-11-02 13:04:22.192125", i.e. during reindex concurrently.

I wonder if it's a coincidence that that number (~4.04 billion) is not
that far from 2^32-1 (~4.294 billion).

Can you run amcheck? Perhaps the output of the following will be interesting:

create extension amcheck;
set client_min_messages=debug1;
select bt_index_check('pg_toast.pg_toast_2624976286_index', true);

(Couldn't hurt to try it, at least.)

> I'm wondering if it's known bug and how risky could it be to reindex toast's
> indexes. It was done automatically with tool which monitors indexes' bloat
> and index size reduced several times in this case.

If I had to guess, I'd guess that this is a new and unknown bug.

-- 
Peter Geoghegan



On Thu, Nov 4, 2021 at 3:07 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Nov 3, 2021 at 3:26 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
> > it turns out one entry in toast index (?) was corrupted:
> > select  md5(body) from zz where id = ...;
> > ERROR:  missing chunk number 0 for toast value 4040061139 in
> > pg_toast_2624976286
> >
> > According to "created_at" column in linked table row was created at
> > "2021-11-02 13:04:22.192125", i.e. during reindex concurrently.
>
> I wonder if it's a coincidence that that number (~4.04 billion) is not
> that far from 2^32-1 (~4.294 billion).
>
> Can you run amcheck? Perhaps the output of the following will be interesting:
>
> create extension amcheck;
> set client_min_messages=debug1;
> select bt_index_check('pg_toast.pg_toast_2624976286_index', true);
>
> (Couldn't hurt to try it, at least.)

Hi Peter,

Please see a result of amcheck

select bt_index_check('pg_toast.pg_toast_2624976286_index', true);
DEBUG:  verifying consistency of tree structure for index
"pg_toast_2624976286_index"
DEBUG:  verifying level 3 (true root level)
DEBUG:  verifying level 2
DEBUG:  verifying level 1
DEBUG:  verifying level 0 (leaf level)
DEBUG:  leaf block 715360 of index "pg_toast_2624976286_index" has no
first data item
DEBUG:  verifying that tuples from index "pg_toast_2624976286_index"
are present in "pg_toast_2624976286"
ERROR:  heap tuple (59561917,1) from table "pg_toast_2624976286" lacks
matching index tuple within index "pg_toast_2624976286_index"
HINT:  Retrying verification using the function
bt_index_parent_check() might provide a more specific error.



--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему
больно когда я так делаю ещё раз?"



Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
Peter Geoghegan
Date:
On Thu, Nov 4, 2021 at 11:08 AM Maxim Boguk <maxim.boguk@gmail.com> wrote:
> select bt_index_check('pg_toast.pg_toast_2624976286_index', true);
> DEBUG:  verifying consistency of tree structure for index
> "pg_toast_2624976286_index"
> DEBUG:  verifying level 3 (true root level)
> DEBUG:  verifying level 2
> DEBUG:  verifying level 1
> DEBUG:  verifying level 0 (leaf level)
> DEBUG:  leaf block 715360 of index "pg_toast_2624976286_index" has no
> first data item
> DEBUG:  verifying that tuples from index "pg_toast_2624976286_index"
> are present in "pg_toast_2624976286"
> ERROR:  heap tuple (59561917,1) from table "pg_toast_2624976286" lacks
> matching index tuple within index "pg_toast_2624976286_index"
> HINT:  Retrying verification using the function
> bt_index_parent_check() might provide a more specific error.

That's an unusually large TOAST table. It's at least ~454.42GiB, based
on this error. Is the block number 59561917 near the end of the table?

Structural verification of the B-Tree itself doesn't throw any errors
here. This leaves little doubt that the problem is that the index is
missing entries that it really should have. That's not really new
information, though -- I was pretty sure of that much already.

-- 
Peter Geoghegan



On Thu, Nov 4, 2021 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Nov 4, 2021 at 11:08 AM Maxim Boguk <maxim.boguk@gmail.com> wrote:
> > select bt_index_check('pg_toast.pg_toast_2624976286_index', true);
> > DEBUG:  verifying consistency of tree structure for index
> > "pg_toast_2624976286_index"
> > DEBUG:  verifying level 3 (true root level)
> > DEBUG:  verifying level 2
> > DEBUG:  verifying level 1
> > DEBUG:  verifying level 0 (leaf level)
> > DEBUG:  leaf block 715360 of index "pg_toast_2624976286_index" has no
> > first data item
> > DEBUG:  verifying that tuples from index "pg_toast_2624976286_index"
> > are present in "pg_toast_2624976286"
> > ERROR:  heap tuple (59561917,1) from table "pg_toast_2624976286" lacks
> > matching index tuple within index "pg_toast_2624976286_index"
> > HINT:  Retrying verification using the function
> > bt_index_parent_check() might provide a more specific error.
>
> That's an unusually large TOAST table. It's at least ~454.42GiB, based
> on this error. Is the block number 59561917 near the end of the table?

select pg_size_pretty(pg_relation_size('pg_toast.pg_toast_2624976286'));
 pg_size_pretty
----------------
 473 GB
now... and yes during the time of error page 59561917 was very close
to the end of the table.
There was a high chance (but not 100%) that the corresponding main
table entry had been inserted during reindex CONCURRENTLY of the toast
index run.

We have base backup and wal archive so theoretically it's possible to
restore sequence of writes which lead to error,
but given huge size of relation in interest (and even bigger size of
whole database 10+TB) and large amount of writes it's a complicated
task (especially when I not really sure what exactly to look for in
waldump output).


--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему
больно когда я так делаю ещё раз?"



On Thu, Nov 4, 2021 at 9:47 PM Maxim Boguk <maxim.boguk@gmail.com> wrote:
>
> On Thu, Nov 4, 2021 at 8:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Thu, Nov 4, 2021 at 11:08 AM Maxim Boguk <maxim.boguk@gmail.com> wrote:
> > > select bt_index_check('pg_toast.pg_toast_2624976286_index', true);
> > > DEBUG:  verifying consistency of tree structure for index
> > > "pg_toast_2624976286_index"
> > > DEBUG:  verifying level 3 (true root level)
> > > DEBUG:  verifying level 2
> > > DEBUG:  verifying level 1
> > > DEBUG:  verifying level 0 (leaf level)
> > > DEBUG:  leaf block 715360 of index "pg_toast_2624976286_index" has no
> > > first data item
> > > DEBUG:  verifying that tuples from index "pg_toast_2624976286_index"
> > > are present in "pg_toast_2624976286"
> > > ERROR:  heap tuple (59561917,1) from table "pg_toast_2624976286" lacks
> > > matching index tuple within index "pg_toast_2624976286_index"
> > > HINT:  Retrying verification using the function
> > > bt_index_parent_check() might provide a more specific error.
> >
> > That's an unusually large TOAST table. It's at least ~454.42GiB, based
> > on this error. Is the block number 59561917 near the end of the table?
>
> select pg_size_pretty(pg_relation_size('pg_toast.pg_toast_2624976286'));
>  pg_size_pretty
> ----------------
>  473 GB
> now... and yes during the time of error page 59561917 was very close
> to the end of the table.
> There was a high chance (but not 100%) that the corresponding main
> table entry had been inserted during reindex CONCURRENTLY of the toast
> index run.

UPDATE: the problem base table entry which had erroneous toast data
definitely had been created during the REINDEX run of toast idx.

2021-11-02 13:15:45.469 UTC 2379 postgres@*** from [local] [vxid:24/0
txid:0] [REINDEX] LOG:  duration: 1719120.441 ms  statement: REINDEX
INDEX CONCURRENTLY pg_toast.pg_toast_2624976286_index
and the problem entry had been created at 2021-11-02 13:04:22.192125 UTC.

So there seems some subtle bug with indexing new toast data during
REINDEX INDEX CONCURRENTLY of the toast index.
--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему
больно когда я так делаю ещё раз?"



Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
Peter Geoghegan
Date:
On Thu, Nov 4, 2021 at 12:47 PM Maxim Boguk <maxim.boguk@gmail.com> wrote:
> now... and yes during the time of error page 59561917 was very close
> to the end of the table.
> There was a high chance (but not 100%) that the corresponding main
> table entry had been inserted during reindex CONCURRENTLY of the toast
> index run.

It might be useful if you located the leaf page that the missing index
tuple is supposed to be on. It's possible that there is a recognizable
pattern. If you knew the block number of the relevant leaf page in the
index already, then you could easily dump the relevant page to a small
file, and share it with us here. The usual procedure is described
here:


https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

The tricky part is figuring out which block number is the one of
interest. I can't think of any easy way of doing that in a production
database. The easiest approach I can think of is to use
pg_buffercache. Restart the database (or more likely an instance of
the database that has the problem, but isn't the live production
database). Then write a query like this:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM pg_toast_2624976286 WHERE
chunk_id = 4040061139;

(The BUFFERS stuff is to verify that you got buffer hits.)

Next query the blocks that you see in pg_buffercache:

SELECT relblocknumber FROM pg_buffercache WHERE relfilenode =
pg_relation_filenode('pg_toast.pg_toast_2624976286_index');

Finally, note down the block numbers that the query returns. There
will probably be 3 or 4. Just send me any that are non-0 (that's just
the metapage). I only really care about the leaf page, but I can
figure that part out myself when I have the pages you access.

(It's a pity there isn't a less cumbersome procedure here.)

-- 
Peter Geoghegan



Hi,

On November 4, 2021 1:01:44 PM PDT, Maxim Boguk <maxim.boguk@gmail.com> wrote:
>UPDATE: the problem base table entry which had erroneous toast data
>definitely had been created during the REINDEX run of toast idx.

Do you have any details about the transaction / application that inserted the data?


>2021-11-02 13:15:45.469 UTC 2379 postgres@*** from [local] [vxid:24/0
>txid:0] [REINDEX] LOG:  duration: 1719120.441 ms  statement: REINDEX
>INDEX CONCURRENTLY pg_toast.pg_toast_2624976286_index
>and the problem entry had been created at 2021-11-02 13:04:22.192125 UTC.
>
>So there seems some subtle bug with indexing new toast data during
>REINDEX INDEX CONCURRENTLY of the toast index.

Any chance you're using prepared transactions?

Regards,

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



On Sun, Nov 7, 2021 at 2:07 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On November 4, 2021 1:01:44 PM PDT, Maxim Boguk <maxim.boguk@gmail.com> wrote:
> >UPDATE: the problem base table entry which had erroneous toast data
> >definitely had been created during the REINDEX run of toast idx.
>
> Do you have any details about the transaction / application that inserted the data?

Well... jdbc with server side prepared statements off (pgbouncer).


> >2021-11-02 13:15:45.469 UTC 2379 postgres@*** from [local] [vxid:24/0
> >txid:0] [REINDEX] LOG:  duration: 1719120.441 ms  statement: REINDEX
> >INDEX CONCURRENTLY pg_toast.pg_toast_2624976286_index
> >and the problem entry had been created at 2021-11-02 13:04:22.192125 UTC.
> >
> >So there seems some subtle bug with indexing new toast data during
> >REINDEX INDEX CONCURRENTLY of the toast index.
>
> Any chance you're using prepared transactions?

Most definitely no, prepared TS are disabled server-wide.
postgres=# show max_prepared_transactions ;
 max_prepared_transactions
---------------------------
 0


--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7  985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61  45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему
больно когда я так делаю ещё раз?"



Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
Alexey Ermakov
Date:
Hello,

On 2021-11-08 02:29, Maxim Boguk wrote:
> On Sun, Nov 7, 2021 at 2:07 AM Andres Freund <andres@anarazel.de> wrote:
>> Hi,
>>
>> On November 4, 2021 1:01:44 PM PDT, Maxim Boguk <maxim.boguk@gmail.com> wrote:
>>> UPDATE: the problem base table entry which had erroneous toast data
>>> definitely had been created during the REINDEX run of toast idx.
>> Do you have any details about the transaction / application that inserted the data?
> Well... jdbc with server side prepared statements off (pgbouncer).
>

I was told that application inserts rows into given table with queries like:

INSERT INTO zz (id, body) VALUES (?, ?) ON CONFLICT (id) DO NOTHING

inside transaction. It's possible to attempt to insert row with same id 
several times inside one transaction (in that case usually that id is 
new one i.e. was not presented in the table before that transaction).

Hope it could help in some way.


--

Thanks,

Alexey Ermakov




Hi,

On 2021-11-04 20:07:28 +0200, Maxim Boguk wrote:
> ERROR:  heap tuple (59561917,1) from table "pg_toast_2624976286" lacks
> matching index tuple within index "pg_toast_2624976286_index"
> HINT:  Retrying verification using the function
> bt_index_parent_check() might provide a more specific error.

Could you use https://www.postgresql.org/docs/devel/pageinspect.html
to get the heap items for this page? Best after verifying that that ctid still
shows an amcheck error. Something like

SELECT * FROM page_header(get_raw_page('pg_toast_2624976286', 59561917));
SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, mask.raw_flags,
mask.combined_flags,t_hoff, t_bits
 
FROM heap_page_items(get_raw_page('pg_toast_2624976286', 59561917)),
  heap_tuple_infomask_flags(t_infomask, t_infomask2) AS mask;

Do you have WAL archiving or such set up? If you still have the WAL from that
time it'd be helpful.

Greetings,

Andres Freund



Hi,

On 2021-11-04 22:01:44 +0200, Maxim Boguk wrote:
> 2021-11-02 13:15:45.469 UTC 2379 postgres@*** from [local] [vxid:24/0
> txid:0] [REINDEX] LOG:  duration: 1719120.441 ms  statement: REINDEX
> INDEX CONCURRENTLY pg_toast.pg_toast_2624976286_index
> and the problem entry had been created at 2021-11-02 13:04:22.192125 UTC.

Another question on this. Are you running multiple reindexes concurrently? Other
maintenance commands?

Greetings,

Andres Freund



Hi,

On 2021-11-08 17:04:40 +0600, Alexey Ermakov wrote:
> I was told that application inserts rows into given table with queries like:
> 
> INSERT INTO zz (id, body) VALUES (?, ?) ON CONFLICT (id) DO NOTHING
> 
> inside transaction. It's possible to attempt to insert row with same id
> several times inside one transaction (in that case usually that id is new
> one i.e. was not presented in the table before that transaction).
> 
> Hope it could help in some way.

It did. I've managed to reproduce an issue like this. I'm still narrowing down
the reproducer, but here's what I have so far:

DROP TABLE IF EXISTS wide; CREATE TABLE wide(id serial primary key, wide text);
pgbench -n -Mprepared -T1000 -f ~/tmp/insert_cci_bug.sql -c 16 -j 16 postgres

in concurrent psql
SELECT indrelid::regclass, indexrelid::regclass, bt_index_parent_check(indexrelid::regclass, true) FROM pg_index WHERE
indrelid= (SELECT reltoastrelid FROM pg_class WHERE oid = 'wide'::regclass);
 
REINDEX INDEX CONCURRENTLY pg_toast.pg_toast_331092_index;

usually after 1-3 iterations the index is corrupt. Reproduces both in 13 and
HEAD.

Greetings,

Andres Freund

Attachment
Hi,

On 2021-11-08 10:36:46 -0800, Andres Freund wrote:
> DROP TABLE IF EXISTS wide; CREATE TABLE wide(id serial primary key, wide text);
> pgbench -n -Mprepared -T1000 -f ~/tmp/insert_cci_bug.sql -c 16 -j 16 postgres
> 
> in concurrent psql
> SELECT indrelid::regclass, indexrelid::regclass, bt_index_parent_check(indexrelid::regclass, true) FROM pg_index
WHEREindrelid = (SELECT reltoastrelid FROM pg_class WHERE oid = 'wide'::regclass);
 
> REINDEX INDEX CONCURRENTLY pg_toast.pg_toast_331092_index;
> 
> usually after 1-3 iterations the index is corrupt. Reproduces both in 13 and
> HEAD.

Oh, wow (or ugh). It narrows down quite a bit. A single pgbench session
running
INSERT INTO wide(wide) SELECT string_agg(g.i::text || random()::text, '') FROM generate_series(1, 500) g(i);

triggers the issue. Sometimes takes quite a few iterations of REINDEX INDEX
CONCURRENTLY, but...

Greetings,

Andres Freund

PS:
to trigger the reindex / validate I use

cat /tmp/ri.sql
\i /tmp/ri.sql
SELECT 'REINDEX INDEX CONCURRENTLY '||indexrelid::regclass FROM pg_index WHERE indrelid = (SELECT reltoastrelid FROM
pg_classWHERE oid = 'wide'::regclass) \gexec
 
SELECT indrelid::regclass, indexrelid::regclass, bt_index_parent_check(indexrelid::regclass, true) FROM pg_index WHERE
indrelid= (SELECT reltoastrelid FROM pg_class WHERE oid = 'wide'::regclass);
 



Hi,

On 2021-11-08 10:53:17 -0800, Andres Freund wrote:
> Oh, wow (or ugh). It narrows down quite a bit. A single pgbench session
> running
> INSERT INTO wide(wide) SELECT string_agg(g.i::text || random()::text, '') FROM generate_series(1, 500) g(i);
> 
> triggers the issue. Sometimes takes quite a few iterations of REINDEX INDEX
> CONCURRENTLY, but...

After several wrong turns, I found the cause of the problem:

The problem is that toast_save_datum() releases the lock on the toast relation
directly after inserting. That breaks the interlock between
WaitForLockersMultiple() in ReindexRelationConcurrently() and the "toast
inserter". The comments in ReindexRelationConcurrently() don't meaningfully
explain the need for that interlock, but the equivalent case in DefineIndex()
does:

    /*
     * Phase 2 of concurrent index build (see comments for validate_index()
     * for an overview of how this works)
     *
     * Now we must wait until no running transaction could have the table open
     * with the old list of indexes.  Use ShareLock to consider running
     * transactions that hold locks that permit writing to the table.  Note we
     * do not need to worry about xacts that open the table for writing after
     * this point; they will see the new index when they open it.
     *
     * Note: the reason we use actual lock acquisition here, rather than just
     * checking the ProcArray and sleeping, is that deadlock is possible if
     * one of the transactions in question is blocked trying to acquire an
     * exclusive lock on our table.  The lock code will detect deadlock and
     * error out properly.
     */
    WaitForLockers(heaplocktag, ShareLock, true);


The problem that this causes is that the toast-inserting backend will not
necessarily have seen the new index at this point (since there was no need to
process relcache invalidations), but the scan below validate_index() will not
necessarily find new tuples inserted after the start of the scan (particularly
if they're in a new portion of the table).

I verified that if I change the table_close() in toast_save_datum() to not
release the lock the problem doesn't happen anymore.


This problem doesn't exist for CREATE INDEX CONCURRENTLY afaict, because we
don't allow that on system tables, including toast tables. But we *do* reindex
toast indexes concurrently.

One possible way to fix this would be to make ReindexRelationConcurrently()
acquire a lock on the underlying table when reindexing a toast table. Another
to not release the lock in toast_save_datum().

Greetings,

Andres Freund



On Mon, Nov 08, 2021 at 12:36:41PM -0800, Andres Freund wrote:
> One possible way to fix this would be to make ReindexRelationConcurrently()
> acquire a lock on the underlying table when reindexing a toast table. Another
> to not release the lock in toast_save_datum().

The latter is more future-proof.  Does it have material disadvantages?



Hi,

On November 8, 2021 7:56:24 PM PST, Noah Misch <noah@leadboat.com> wrote:
>On Mon, Nov 08, 2021 at 12:36:41PM -0800, Andres Freund wrote:
>> One possible way to fix this would be to make ReindexRelationConcurrently()
>> acquire a lock on the underlying table when reindexing a toast table. Another
>> to not release the lock in toast_save_datum().
>
>The latter is more future-proof.  Does it have material disadvantages?

I don't immediately see any. But I've been long of the opinion, and had plenty discussions around it, that our habit of
releasinglocks early is far too widely used. 

I do however wonder if there's other path to the problem, besides saving toast datums. We also release those locks
earlyin other places, and while e.g. r/o locks won't cause a problem with this specific interlock, it could cause
problemaround dropping the relation, for example. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
Michael Paquier
Date:
On Mon, Nov 08, 2021 at 08:37:58PM -0800, Andres Freund wrote:
> On November 8, 2021 7:56:24 PM PST, Noah Misch <noah@leadboat.com> wrote:
>> On Mon, Nov 08, 2021 at 12:36:41PM -0800, Andres Freund wrote:
>>> One possible way to fix this would be to make ReindexRelationConcurrently()
>>> acquire a lock on the underlying table when reindexing a toast table. Another
>>> to not release the lock in toast_save_datum().

Thanks for the test case.  That reproduces really quickly.

>> The latter is more future-proof.  Does it have material disadvantages?
>
> I don't immediately see any. But I've been long of the opinion, and
> had plenty discussions around it, that our habit of releasing locks
> early is far too widely used.

Yes, I'd agree that not patching the reindex concurrent path would be
safer in the long run.  This feels a bit like e629a01, in spirit, not
in scope.

> I do however wonder if there's other path to the problem, besides
> saving toast datums. We also release those locks early in other
> places, and while e.g. r/o locks won't cause a problem with this
> specific interlock, it could cause problem around dropping the
> relation, for example.

Hmm.  Perhaps there could be some latent issue around
toast_delete_datum()?
--
Michael

Attachment
Hi,

On 2021-11-09 14:02:19 +0900, Michael Paquier wrote:
> On Mon, Nov 08, 2021 at 08:37:58PM -0800, Andres Freund wrote:
> > On November 8, 2021 7:56:24 PM PST, Noah Misch <noah@leadboat.com> wrote:
> >> On Mon, Nov 08, 2021 at 12:36:41PM -0800, Andres Freund wrote:
> >>> One possible way to fix this would be to make ReindexRelationConcurrently()
> >>> acquire a lock on the underlying table when reindexing a toast table. Another
> >>> to not release the lock in toast_save_datum().
> 
> Thanks for the test case.  That reproduces really quickly.
> 
> >> The latter is more future-proof.  Does it have material disadvantages?
> > 
> > I don't immediately see any. But I've been long of the opinion, and
> > had plenty discussions around it, that our habit of releasing locks
> > early is far too widely used.
> 
> Yes, I'd agree that not patching the reindex concurrent path would be
> safer in the long run.  This feels a bit like e629a01, in spirit, not
> in scope.

I wonder if we should do both...

Greetings,

Andres Freund



On Mon, Nov 08, 2021 at 09:08:48PM -0800, Andres Freund wrote:
> On 2021-11-09 14:02:19 +0900, Michael Paquier wrote:
> > On Mon, Nov 08, 2021 at 08:37:58PM -0800, Andres Freund wrote:
> > > On November 8, 2021 7:56:24 PM PST, Noah Misch <noah@leadboat.com> wrote:
> > >> On Mon, Nov 08, 2021 at 12:36:41PM -0800, Andres Freund wrote:
> > >>> One possible way to fix this would be to make ReindexRelationConcurrently()
> > >>> acquire a lock on the underlying table when reindexing a toast table. Another
> > >>> to not release the lock in toast_save_datum().

> > >> The latter is more future-proof.  Does it have material disadvantages?
> > > 
> > > I don't immediately see any. But I've been long of the opinion, and
> > > had plenty discussions around it, that our habit of releasing locks
> > > early is far too widely used.
> > 
> > Yes, I'd agree that not patching the reindex concurrent path would be
> > safer in the long run.  This feels a bit like e629a01, in spirit, not
> > in scope.
> 
> I wonder if we should do both...

Doing both sounds harmless.  Regarding REINDEX CONCURRENTLY on a system
catalog, I bet that can still reach bugs even if we do both, considering this:

$ git grep 'table_close.*, RowExclusiveLock' | wc -l
288



Hi,

On 2021-11-08 21:46:25 -0800, Noah Misch wrote:
> Doing both sounds harmless.  Regarding REINDEX CONCURRENTLY on a system
> catalog, I bet that can still reach bugs even if we do both, considering this:

Normal catalog aren't REINDEX CONCURRENTLY. It's just toast tables that are
kind of system-y, but also kind of not, that can be reindexed concurrently. Or
at least I think it's just toast indexes?

Greetings,

Andres Freund



On Mon, Nov 08, 2021 at 10:31:19PM -0800, Andres Freund wrote:
> On 2021-11-08 21:46:25 -0800, Noah Misch wrote:
> > Doing both sounds harmless.  Regarding REINDEX CONCURRENTLY on a system
> > catalog, I bet that can still reach bugs even if we do both, considering this:
> 
> Normal catalog aren't REINDEX CONCURRENTLY. It's just toast tables that are
> kind of system-y, but also kind of not, that can be reindexed concurrently.

Oh, okay.



Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
Michael Paquier
Date:
On Mon, Nov 08, 2021 at 10:42:17PM -0800, Noah Misch wrote:
> On Mon, Nov 08, 2021 at 10:31:19PM -0800, Andres Freund wrote:
>> On 2021-11-08 21:46:25 -0800, Noah Misch wrote:
>>> Doing both sounds harmless.  Regarding REINDEX CONCURRENTLY on a system
>>> catalog, I bet that can still reach bugs even if we do both, considering this:
>>
>> Normal catalog aren't REINDEX CONCURRENTLY. It's just toast tables that are
>> kind of system-y, but also kind of not, that can be reindexed concurrently.
>
> Oh, okay.

To be clear on this point, users cannot reindex concurrently catalog
indexes and toast indexes associated to catalog tables, just toast
indexes of normal tables.  I don't know if any of you have been
working on a patch, but I was cooking something.  It would be worth
checking if an isolation test could be written.
--
Michael

Attachment

Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
Michael Paquier
Date:
On Thu, Nov 11, 2021 at 06:09:49PM +0900, Michael Paquier wrote:
> To be clear on this point, users cannot reindex concurrently catalog
> indexes and toast indexes associated to catalog tables, just toast
> indexes of normal tables.  I don't know if any of you have been
> working on a patch, but I was cooking something.  It would be worth
> checking if an isolation test could be written.

So, I have worked on this one.  And attached is a patch that
implements the two approaches suggested by Andres which are able to
fix the issues discussed:
1) Switch to a session-level lock on the parent relation if doing a
reindex concurrently on a toast table or on one of its indexes.  This
requires to look back at pg_class.reltoastrelid to find the correct
parent.  This stresses me quite a bit, and I am not sure that I like
that to be honest because we don't do anything like that in the rest
of the tree.  I am also getting the feeling that this is an open door
for more issues.
2) Don't release locks when a new toast value is saved until the end
of its transaction.

After more testing, I have been able to extract and write an isolation
test that is able to reproduce the failure.  It relies on a trick as
the toast relation names are not deterministic, and we cannot use
REINDEX CONCURRENTLY in a function context.  So I have used an ALTER
TABLE/INDEX RENAME with a DO block to change the toast relation
names with allow_system_table_mods instead.  There is no need either
for amcheck with this method.

2) is enough to fix the problem, and I'd like to think that we had
better stick with only this method for simplicity's sake.

Comments?
--
Michael

Attachment

Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
Michael Paquier
Date:
On Mon, Nov 15, 2021 at 08:03:04PM +0900, Michael Paquier wrote:
> After more testing, I have been able to extract and write an isolation
> test that is able to reproduce the failure.  It relies on a trick as
> the toast relation names are not deterministic, and we cannot use
> REINDEX CONCURRENTLY in a function context.  So I have used an ALTER
> TABLE/INDEX RENAME with a DO block to change the toast relation
> names with allow_system_table_mods instead.  There is no need either
> for amcheck with this method.
>
> 2) is enough to fix the problem, and I'd like to think that we had
> better stick with only this method for simplicity's sake.

I have been working on this one again for the last couple of days, and
I would still go with the simple solution, releasing the row-level
toast locks only at the end of the transaction when saving and
deleting a toast value.  While testing, I have noticed that the
deletion part is also important, as a REINDEX CONCURRENTLY run in
parallel of a transaction removing a toast value would go through
without that, rather than waiting for the transaction doing the
deletion to commit first.  I have expanded the tests with everything I
could think about, so I'd like to commit the attached.  The test is
fancy with its use of allow_system_table_mods to make the toast
relation names reliable, but it has been really useful.
--
Michael

Attachment

Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
"Bossart, Nathan"
Date:
On 12/5/21, 5:54 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> I have been working on this one again for the last couple of days, and
> I would still go with the simple solution, releasing the row-level
> toast locks only at the end of the transaction when saving and
> deleting a toast value.  While testing, I have noticed that the
> deletion part is also important, as a REINDEX CONCURRENTLY run in
> parallel of a transaction removing a toast value would go through
> without that, rather than waiting for the transaction doing the
> deletion to commit first.  I have expanded the tests with everything I
> could think about, so I'd like to commit the attached.  The test is
> fancy with its use of allow_system_table_mods to make the toast
> relation names reliable, but it has been really useful.

I confirmed that the new tests reliably produce corruption and that
the suggested fix resolves it.  I also lean towards the simple
solution, but I do wonder if it creates any interesting side effects.
For example, could holding the locks longer impact performance?  (Of
course, performance is probably not a great reason to avoid a
sustainable solution.)

-    toast_close_indexes(toastidxs, num_indexes, RowExclusiveLock);
-    table_close(toastrel, RowExclusiveLock);
+    toast_close_indexes(toastidxs, num_indexes, NoLock);
+    table_close(toastrel, NoLock);

I think it would be good to expand the comments above these changes to
explain why we are keeping the lock.  That might help avoid similar
problems in the future.

Nathan


Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
Michael Paquier
Date:
On Tue, Dec 07, 2021 at 10:04:54PM +0000, Bossart, Nathan wrote:
> I confirmed that the new tests reliably produce corruption and that
> the suggested fix resolves it.  I also lean towards the simple
> solution, but I do wonder if it creates any interesting side effects.

Thanks.

> For example, could holding the locks longer impact performance?

Do you have anything particular in mind?  It seems to me that this
boils down to the same lock taken on the parent table based on its
RTE.

> -    toast_close_indexes(toastidxs, num_indexes, RowExclusiveLock);
> -    table_close(toastrel, RowExclusiveLock);
> +    toast_close_indexes(toastidxs, num_indexes, NoLock);
> +    table_close(toastrel, NoLock);
>
> I think it would be good to expand the comments above these changes to
> explain why we are keeping the lock.  That might help avoid similar
> problems in the future.

Yes, I have added a note, and applied the patch after looking at it
again this morning.  The test cannot be used in 12 so I have removed
it from REL_12_STABLE, as allow_system_table_mods is a PGC_POSTMASTER
there.
--
Michael

Attachment

Re: BUG #17268: Possible corruption in toast index after reindex index concurrently

From
"Bossart, Nathan"
Date:
On 12/7/21, 6:48 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> On Tue, Dec 07, 2021 at 10:04:54PM +0000, Bossart, Nathan wrote:
>> For example, could holding the locks longer impact performance?
>
> Do you have anything particular in mind?  It seems to me that this
> boils down to the same lock taken on the parent table based on its
> RTE.

I don't have anything concrete in mind.

>> I think it would be good to expand the comments above these changes to
>> explain why we are keeping the lock.  That might help avoid similar
>> problems in the future.
>
> Yes, I have added a note, and applied the patch after looking at it
> again this morning.  The test cannot be used in 12 so I have removed
> it from REL_12_STABLE, as allow_system_table_mods is a PGC_POSTMASTER
> there.

Great!

Nathan