Thread: BUG #17568: unexpected zero page at block 0 during REINDEX CONCURRENTLY

BUG #17568: unexpected zero page at block 0 during REINDEX CONCURRENTLY

From

PG Bug reporting form

Date:

03 August 2022, 17:23:34

The following bug has been logged on the website:

Bug reference:      17568
Logged by:          Sergei Kornilov
Email address:      sk@zsrv.org
PostgreSQL version: 14.4
Operating system:   Ubuntu 20.04
Description:

Hello
I recently ran "REINDEX INDEX CONCURRENTLY i_sess_uuid;" (pg14.4, table
around 700gb), but suddenly, after the start of phase "index validation:
scanning index", the insert and update operations started returning an
error:

ERROR: index "i_sess_uuid_ccnew" contains unexpected zero page at block 0
HINT: Please REINDEX it.

i_sess_uuid_ccnew is exactly the new index that builds reindex concurrently
at this time. It is clear that the errors started after
index_set_state_flags INDEX_CREATE_SET_READY, because insert and update
queries now need to update this index too. But it remains unclear how
exactly page 0 turned out to be all zeros at this point.

I think some process may have loaded btree metapage (page 0) into shared
buffers prior the end of _bt_load. In this case, the error is reproduced
(14.4, 14 STABLE, HEAD):

create extension pageinspect;
create table test as select generate_series(1,1e4) as id;
create index test_id_idx on test(id);
# prepare gdb for this backend with breakpoint on _bt_uppershutdown
reindex index concurrently test_id_idx ;

While gdb is stopped on breakpoint run from second session:

insert into test values (0);
SELECT * FROM bt_metap('test_id_idx_ccnew');
-[ RECORD 1 ]-------------+---
magic                     | 0
version                   | 0
root                      | 0
level                     | 0
fastroot                  | 0
fastlevel                 | 0
last_cleanup_num_delpages | 0
last_cleanup_num_tuples   | -1
allequalimage             | f

Then continue reindex backend. New inserts along with reindex itself will
give error "index "test_id_idx_ccnew" contains unexpected zero page at block
0". The metapage on disk after _bt_uppershutdown call will be written
correctly and correctly replicated to standby. But it is still erroneous in
shared buffers on primary.

I still don't know if this is what happened to my base. Monitoring requests
(like pg_total_relation_size, pg_stat_user_indexes, pg_statio_user_indexes)
do not load metapage into shared buffers. Normal select/insert/update/delete
should not touch in any way not ready index. This database does not have any
extensions installed other than those available in contrib.

Thoughts?

regards, Sergei

Re:BUG #17568: unexpected zero page at block 0 during REINDEX CONCURRENTLY

From

Sergei Kornilov

Date:

15 August 2022, 17:02:38

Hello
Luckily, I found a call of bt_metap function in one query that was definitely called while reindex was running and
beforethese errors started to appear.
 
Accidentally pilot error.

It would be nice to protect shared buffers from such premature page loading, but it's probably not going to work
withoutperformance penalty.
 

regards, Sergei

Re: BUG #17568: unexpected zero page at block 0 during REINDEX CONCURRENTLY

From

Andres Freund

Date:

16 August 2022, 02:39:24

Hi,

On 2022-08-03 14:23:34 +0000, PG Bug reporting form wrote:
> I think some process may have loaded btree metapage (page 0) into shared
> buffers prior the end of _bt_load. In this case, the error is reproduced
> (14.4, 14 STABLE, HEAD):
>
> create extension pageinspect;
> create table test as select generate_series(1,1e4) as id;
> create index test_id_idx on test(id);
> # prepare gdb for this backend with breakpoint on _bt_uppershutdown
> reindex index concurrently test_id_idx ;

Worth noting that this doesn't even require reindex concurrently, it's also an
issue for CIC.

The problem basically is that once the first non-meta page from the btree is
written (e.g. _bt_blwritepage() calling smgrextend()), concurrent sessions can
read in the metapage (and potentially other pages that are also zero filled)
into shared_buffers. at the end of _bt_uppershutdown() we'll write the
metapage to disk, bypassing shared buffers. And boom, the all-zeroes version
read into memory earlier is suddenly out of date.

The easiest fix is likely to force all buffers to be forgotten at the end of
index_concurrently_build() or such. I don't immediately see a nicer way to fix
this, we can't just lock the new index relation exclusively.

We could of course also stop bypassing s_b for CIC/RIC, but that seems mighty
invasive for a bugfix.

Greetings,

Andres Freund

Re: BUG #17568: unexpected zero page at block 0 during REINDEX CONCURRENTLY

From

Tom Lane

Date:

16 August 2022, 02:56:40

Andres Freund <andres@anarazel.de> writes:
> The easiest fix is likely to force all buffers to be forgotten at the end of
> index_concurrently_build() or such.

Race conditions there ...

> I don't immediately see a nicer way to fix
> this, we can't just lock the new index relation exclusively.

Why not?  If the index isn't valid yet, other backends have zero
business touching it.  I'd think about taking an exclusive lock
to start with, and releasing it (downgrading to a non-exclusive
lock) once the index is valid enough that other backends can
access it, which would be just before we set pg_index.indisready
to true.

Basically, this is to enforce the previously-implicit contract
that other sessions won't touch the index too soon against
careless superusers.

            regards, tom lane

Re: BUG #17568: unexpected zero page at block 0 during REINDEX CONCURRENTLY

From

Andres Freund

Date:

16 August 2022, 03:10:27

Hi,

On 2022-08-15 19:56:40 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > The easiest fix is likely to force all buffers to be forgotten at the end of
> > index_concurrently_build() or such.
> 
> Race conditions there ...

Not immediately seeing it? New reads from disk will read valid data.

But I agree, it's a shitty approach.

> > I don't immediately see a nicer way to fix
> > this, we can't just lock the new index relation exclusively.
> 
> Why not?  If the index isn't valid yet, other backends have zero
> business touching it.  I'd think about taking an exclusive lock
> to start with, and releasing it (downgrading to a non-exclusive
> lock) once the index is valid enough that other backends can
> access it, which would be just before we set pg_index.indisready
> to true.

I'm afraid we'd start blocking in quite a few places, both inside and outside
of core PG. E.g. ExecOpenIndices(), ExecInitPartitionInfo(),
calculate_toast_table_size(), ... will open all indislive indexes, even if not
indisready.

> Basically, this is to enforce the previously-implicit contract
> that other sessions won't touch the index too soon against
> careless superusers.

I suspect this isn't restricted to superusers, fwiw. E.g. pg_prewarm doesn't()
require superuser.

Greetings,

Andres Freund