Thread: [BUG] Error in BRIN summarization
One of our clients caught an error "failed to find parent tuple for heap-only tuple at (50661,130) in table "tbl'" in PostgreSQL v12. Steps to reproduce (REL_12_STABLE): 1) Create table with primary key, create brin index, fill table with some initial data: create table tbl (id int primary key, a int) with (fillfactor=50); create index idx on tbl using brin (a) with (autosummarize=on); insert into tbl select i, i from generate_series(0,100000) as i; 2) Run script test_brin.sql using pgbench: pgbench postgres -f ../review/brin_test.sql -n -T 120 The script is a bit messy because I was trying to reproduce a problematic workload. Though I didn't manage to simplify it. The idea is that it inserts new values into the table to produce unindexed pages and also updates some values to trigger HOT-updates on these pages. 3) Open psql session and run brin_summarize_new_values select brin_summarize_new_values('idx'::regclass::oid); \watch 2 Wait a bit. And in psql you will see the ERROR. This error is caused by the problem with root_offsets array bounds. It occurs if a new HOT tuple was inserted after we've collected root_offsets, and thus we don't have root_offset for tuple's offnum. Concurrent insertions are possible, because brin_summarize_new_values() only holds ShareUpdateLock on table and no lock (only pin) on the page. The draft fix is in the attachments. It saves root_offsets_size and checks that we only access valid fields. Patch also adds some debug messages, just to ensure that problem was caught. TODO: - check if heapam_index_validate_scan() has the same problem - code cleanup - test other PostgreSQL versions [1] https://www.postgresql.org/message-id/flat/CA%2BTgmoYgwjmmjK24Qxb_vWAu8_Hh7gfVFcr3%2BR7ocdLvYOWJXg%40mail.gmail.com -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 23.07.2020 20:39, Anastasia Lubennikova wrote: > One of our clients caught an error "failed to find parent tuple for > heap-only tuple at (50661,130) in table "tbl'" in PostgreSQL v12. > > Steps to reproduce (REL_12_STABLE): > > 1) Create table with primary key, create brin index, fill table with > some initial data: > > create table tbl (id int primary key, a int) with (fillfactor=50); > create index idx on tbl using brin (a) with (autosummarize=on); > insert into tbl select i, i from generate_series(0,100000) as i; > > 2) Run script test_brin.sql using pgbench: > > pgbench postgres -f ../review/brin_test.sql -n -T 120 > > The script is a bit messy because I was trying to reproduce a > problematic workload. Though I didn't manage to simplify it. > The idea is that it inserts new values into the table to produce > unindexed pages and also updates some values to trigger HOT-updates on > these pages. > > 3) Open psql session and run brin_summarize_new_values > > select brin_summarize_new_values('idx'::regclass::oid); \watch 2 > > Wait a bit. And in psql you will see the ERROR. > > This error is caused by the problem with root_offsets array bounds. It > occurs if a new HOT tuple was inserted after we've collected > root_offsets, and thus we don't have root_offset for tuple's offnum. > Concurrent insertions are possible, because > brin_summarize_new_values() only holds ShareUpdateLock on table and no > lock (only pin) on the page. > > The draft fix is in the attachments. It saves root_offsets_size and > checks that we only access valid fields. > Patch also adds some debug messages, just to ensure that problem was > caught. > > TODO: > > - check if heapam_index_validate_scan() has the same problem > - code cleanup > - test other PostgreSQL versions > > [1] > https://www.postgresql.org/message-id/flat/CA%2BTgmoYgwjmmjK24Qxb_vWAu8_Hh7gfVFcr3%2BR7ocdLvYOWJXg%40mail.gmail.com > Here is the updated version of the fix. The problem can be reproduced on all supported versions, so I suggest to backpatch it. Code slightly changed in v12, so here are two patches: one for versions 9.5 to 11 and another for versions from 12 to master. As for heapam_index_validate_scan(), I've tried to reproduce the same error with CREATE INDEX CONCURRENTLY, but haven't found any problem with it. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 2020-Jul-27, Anastasia Lubennikova wrote: > Here is the updated version of the fix. > The problem can be reproduced on all supported versions, so I suggest to > backpatch it. > Code slightly changed in v12, so here are two patches: one for versions 9.5 > to 11 and another for versions from 12 to master. Hi Anastasia, thanks for this report and fix. I was considering this last week and noticed that the patch changes the ABI of heap_get_root_tuples, which may be problematic in back branches. I suggest that for unreleased branches (12 and prior) we need to create a new function with the new signature, and keep heap_get_root_tuples unchanged. In 13 and master we don't need that trick, so we can keep the code as you have it in this version of the patch. OffsetNumber heap_get_root_tuples_new(Page page, OffsetNumber *root_offsets) { .. full implementation ... } /* ABI compatibility only */ void heap_get_root_tuples(Page page, OffsetNumber *root_offsets) { (void) heap_get_root_tuples_new(page, root_offsets); } (I was also considering whether it needs to be a loop to reobtain root tuples, in case a concurrent transaction can create a new item while we're checking that item; but I don't think that can really happen for one individual tuple.) Thanks -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 27, 2020 at 10:25 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > (I was also considering whether it needs to be a loop to reobtain root > tuples, in case a concurrent transaction can create a new item while > we're checking that item; but I don't think that can really happen for > one individual tuple.) I wonder if something like that is the underlying problem in a recent problem case involving a "REINDEX index pg_class_tblspc_relfilenode_index" command that runs concurrently with the regression tests: https://postgr.es/m/CAH2-WzmBxu4o=pMsniur+bwHqCGCmV_AOLkuK6BuU7ngA6evqw@mail.gmail.com We see a violation of the HOT invariant in this case, though only for a system catalog index, and only in fairly particular circumstances involving concurrency. -- Peter Geoghegan
I don't think we need a recheck for a single tuple, because we only care about finding its root, which simply must exist somewhere on this page, as concurrent pruning is not allowed. We also may catch root_offsets[] for subsequent tuples, but it's okay if we don't. These tuples will do the same recheck on their turn.On 2020-Jul-27, Anastasia Lubennikova wrote:Here is the updated version of the fix. The problem can be reproduced on all supported versions, so I suggest to backpatch it. Code slightly changed in v12, so here are two patches: one for versions 9.5 to 11 and another for versions from 12 to master. (I was also considering whether it needs to be a loop to reobtain root tuples, in case a concurrent transaction can create a new item while we're checking that item; but I don't think that can really happen for one individual tuple.)
While testing this fix, Alexander Lakhin spotted another problem. I simplified the test case to this:
1) prepare a table with brin index
create table tbl (i int, b char(1000)) with (fillfactor=10); insert into tbl select i, md5(i::text) from generate_series(0,1000) as i; create index idx on tbl using brin(i, b) with (pages_per_range=1);
2) run brin_desummarize_range() in a loop:
echo "-- desummarize all ranges SELECT FROM generate_series(0, pg_relation_size('tbl')/8192 - 1) as i, lateral (SELECT brin_desummarize_range('idx', i)) as d; -- summarize them back VACUUM tbl" > brin_desum_test.sql pgbench postgres -f brin_desum_test.sql -n -T 120
3) run a search that invokes bringetbitmap:
set enable_seqscan to off; explain analyze select * from tbl where i>10 and i<100; \watch 1
After a few runs, it will fail with "ERROR: corrupted BRIN index: inconsistent range map"
The problem is caused by a race in page locking in brinGetTupleForHeapBlock [1]:
(1) bitmapsan locks revmap->rm_currBuf and finds the address of the tuple on a regular page "page", then unlocks revmap->rm_currBuf
(2) in another transaction desummarize locks both revmap->rm_currBuf and "page", cleans up the tuple and unlocks both buffers
(1) bitmapscan locks buffer, containing "page", attempts to access the tuple and fails to find it
At first, I tried to fix it by holding the lock on revmap->rm_currBuf until we locked the regular page, but it causes a deadlock with brinsummarize(), It can be easily reproduced with the same test as above.
Is there any rule about the order of locking revmap and regular pages in brin? I haven't found anything in README.
As an alternative, we can leave locks as is and add a recheck, before throwing an error.
What do you think?
[1] https://github.com/postgres/postgres/blob/master/src/backend/access/brin/brin_revmap.c#L269
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 30.07.2020 16:40, Anastasia Lubennikova wrote: > While testing this fix, Alexander Lakhin spotted another problem. > > After a few runs, it will fail with "ERROR: corrupted BRIN index: > inconsistent range map" > > The problem is caused by a race in page locking in > brinGetTupleForHeapBlock [1]: > > (1) bitmapsan locks revmap->rm_currBuf and finds the address of the > tuple on a regular page "page", then unlocks revmap->rm_currBuf > (2) in another transaction desummarize locks both revmap->rm_currBuf > and "page", cleans up the tuple and unlocks both buffers > (1) bitmapscan locks buffer, containing "page", attempts to access the > tuple and fails to find it > > > At first, I tried to fix it by holding the lock on revmap->rm_currBuf > until we locked the regular page, but it causes a deadlock with > brinsummarize(), It can be easily reproduced with the same test as above. > Is there any rule about the order of locking revmap and regular pages > in brin? I haven't found anything in README. > > As an alternative, we can leave locks as is and add a recheck, before > throwing an error. > Here are the updated patches for both problems. 1) brin_summarize_fix_REL_12_v2 fixes "failed to find parent tuple for heap-only tuple at (50661,130) in table "tbl'" This patch checks that we only access initialized entries of root_offsets[] array. If necessary, collect the array again. One recheck is enough here, since concurrent pruning is not possible. 2) brin_pagelock_fix_REL_12_v1.patch fixes "ERROR: corrupted BRIN index: inconsistent range map" This patch adds a recheck as suggested in previous message. I am not sure if one recheck is enough to eliminate the race completely, but the problem cannot be reproduced anymore. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 2020-Jul-30, Anastasia Lubennikova wrote: > While testing this fix, Alexander Lakhin spotted another problem. I > simplified the test case to this: Ah, good catch. I think a cleaner way to fix this problem is to just consider the range as not summarized and return NULL from there, as in the attached patch. Running your test case with a telltale WARNING added at that point, it's clear that it's being hit. By returning NULL, we're forcing the caller to scan the heap, which is not great. But note that if you retry, and your VACUUM hasn't run yet by the time we go through the loop again, the same thing would happen. So it seems to me a good enough answer. A much more troubling thought is what happens if the range is desummarized, then the index item is used for the summary of a different range. Then the index might end up returning corrupt results. > At first, I tried to fix it by holding the lock on revmap->rm_currBuf until > we locked the regular page, but it causes a deadlock with brinsummarize(), > It can be easily reproduced with the same test as above. > Is there any rule about the order of locking revmap and regular pages in > brin? I haven't found anything in README. Umm, I thought that stuff was in the README, but it seems I didn't add it there. I think I had a .org file with my notes on that ... must be in an older laptop disk, because it's not in my worktree for that. I'll see if I can fish it out. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2020-Jul-23, Anastasia Lubennikova wrote: > This error is caused by the problem with root_offsets array bounds. It > occurs if a new HOT tuple was inserted after we've collected root_offsets, > and thus we don't have root_offset for tuple's offnum. Concurrent insertions > are possible, because brin_summarize_new_values() only holds ShareUpdateLock > on table and no lock (only pin) on the page. Excellent detective work, thanks. > The draft fix is in the attachments. It saves root_offsets_size and checks > that we only access valid fields. I think this is more complicated than necessary. It seems easier to solve this problem by just checking whether the given root pointer is set to InvalidOffsetNumber, which is already done in the existing coding of heap_get_root_tuples (only they spell it "0" rather than InvalidOffsetNumber, which I propose to change). AFAIR this should only happen in the 'anyvisible' mode, so I added that in an assert. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2020-Jul-28, Peter Geoghegan wrote: > On Mon, Jul 27, 2020 at 10:25 AM Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > (I was also considering whether it needs to be a loop to reobtain root > > tuples, in case a concurrent transaction can create a new item while > > we're checking that item; but I don't think that can really happen for > > one individual tuple.) > > I wonder if something like that is the underlying problem in a recent > problem case involving a "REINDEX index > pg_class_tblspc_relfilenode_index" command that runs concurrently with > the regression tests: > > https://postgr.es/m/CAH2-WzmBxu4o=pMsniur+bwHqCGCmV_AOLkuK6BuU7ngA6evqw@mail.gmail.com > > We see a violation of the HOT invariant in this case, though only for > a system catalog index, and only in fairly particular circumstances > involving concurrency. Hmm. As far as I understand, the bug Anastasia reports can only hit an index build that occurs concurrently to heap updates; and that cannot happen for a regular index build, only for CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY. So unless I miss something, it's not related to that other bug. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Aug-11, Alvaro Herrera wrote: > I think this is more complicated than necessary. It seems easier to > solve this problem by just checking whether the given root pointer is > set to InvalidOffsetNumber, which is already done in the existing coding > of heap_get_root_tuples (only they spell it "0" rather than > InvalidOffsetNumber, which I propose to change). AFAIR this should only > happen in the 'anyvisible' mode, so I added that in an assert. 'anyvisible' mode is not required AFAICS; reading the code, I think this could also hit REINDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY, which do not use that flag. I didn't try to reproduce it there, though. Anyway, I'm going to remove that Assert() I added. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Aug-11, Alvaro Herrera wrote: > A much more troubling thought is what happens if the range is > desummarized, then the index item is used for the summary of a different > range. Then the index might end up returning corrupt results. Actually, this is not a concern because the brin tuple's bt_blkno is rechecked before returning it, and if it doesn't match what we're searching, the loop is restarted. It becomes an infinite loop problem if the revmap is pointing to a tuple that's labelled with a different range's blkno. So I think my patch as posted is a sufficient fix for this problem. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Aug-12, Alvaro Herrera wrote: > 'anyvisible' mode is not required AFAICS; reading the code, I think this > could also hit REINDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY, which > do not use that flag. I didn't try to reproduce it there, though. > Anyway, I'm going to remove that Assert() I added. So this is what I propose as the final form of the fix. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 12.08.2020 22:58, Alvaro Herrera wrote: > On 2020-Aug-12, Alvaro Herrera wrote: > >> 'anyvisible' mode is not required AFAICS; reading the code, I think this >> could also hit REINDEX CONCURRENTLY and CREATE INDEX CONCURRENTLY, which >> do not use that flag. I didn't try to reproduce it there, though. >> Anyway, I'm going to remove that Assert() I added. > So this is what I propose as the final form of the fix. > Cool. This version looks much simpler than mine and passes the tests fine. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2020-Aug-13, Anastasia Lubennikova wrote: > Cool. > This version looks much simpler than mine and passes the tests fine. Thanks, pushed it to all branches. Thanks for diagnosing this problem! -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
hyrax's latest report suggests that this patch has issues under CLOBBER_CACHE_ALWAYS: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-08-13%2005%3A09%3A58 Hard to tell whether there's an actual bug there or just test instability, but either way it needs to be resolved. regards, tom lane
On 2020-Aug-15, Tom Lane wrote: > hyrax's latest report suggests that this patch has issues under > CLOBBER_CACHE_ALWAYS: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2020-08-13%2005%3A09%3A58 > > Hard to tell whether there's an actual bug there or just test instability, > but either way it needs to be resolved. Hmm, the only explanation I can see for this is that autovacuum managed to summarize the range before the test script did it. So the solution would simply be to disable autovacuum for the table across the whole script. I'm running the scripts and dependencies to verify that fix, but under CLOBBER_CACHE_ALWAYS that takes quite a bit. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Aug-17, Alvaro Herrera wrote: > Hmm, the only explanation I can see for this is that autovacuum managed > to summarize the range before the test script did it. So the solution > would simply be to disable autovacuum for the table across the whole > script. > > I'm running the scripts and dependencies to verify that fix, but under > CLOBBER_CACHE_ALWAYS that takes quite a bit. I ran a subset of tests a few times, but was unable to reproduce the problem. I'll just push this to all branches and hope for the best. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services