Thread: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

The following bug has been logged on the website:

Bug reference:      17485
Logged by:          Peter Slavov
Email address:      pet.slavov@gmail.com
PostgreSQL version: 14.3
Operating system:   Ubuntu 20.04
Description:

I a noticed a problem in our production database when rebuilding the primary
key index. The database has a lot of INSERTS and UPDATES when this is
happening. I was able to reproduce this on PostgreSQL 14.1/2/3 locally on
docker instance and on AWS EC2.
Here is how you can reproduce:
- Load table `test2` from this file containing the table structure and small
testing dataset:

https://drive.google.com/file/d/1vCqKOda3IIrHmWDzNNJaDQfCb5eQA_D2/view?usp=sharing
- Start inserting and updating in this table using pgbench:
    SHELL: pgbench -r -T 1000 -f test.sql -c 50 -j 50 receiptbank
    test.sql file:
https://drive.google.com/file/d/1mdduGqu1XcDg01iwSysF7szrIlCKUJQT/view?usp=sharing
- Reindex the PRIMARY KEY (it is possible to have to repeat this 2-3 times):

    SQL: REINDEX INDEX CONCURRENTLY test2_pkey
- Check index with `amcheck` extension for errors:
    SQL: select bt_index_check(index => c.oid, heapallindexed => true) from
pg_class c where oid = 'test2_pkey'::regclass;
    SQL result for me: 
      ERROR:  heap tuple (13134,18) from table "test2" lacks matching index
tuple within index "test2_pkey"
- Check for missing records in the index:
    SQL: WITH missing_check AS ( 
              SELECT t1.*, t1.ctid, EXISTS(SELECT 1 FROM test2 t2 WHERE
t2.id = t1.id) AS flag 
            FROM test2 t1)
            SELECT id, ctid 
            FROM missing_check 
            WHERE flag = false;
    Result for me was several records found ONLY with the sequential
scan...

This problem is possibly related to BUG #17478

Thanks for the help,
Peter Slavov



> 18 мая 2022 г., в 15:42, PG Bug reporting form <noreply@postgresql.org> написал(а):
> I was able to reproduce this on PostgreSQL 14.1/2/3 locally on
> docker instance and on AWS EC2.


Uhm, that's very...interesting. I'll look closely next week. Though I didn't have a chance to reproduce yet.
We have fixed similar problem in 14.1. And now we have very similar TAP test to you reproduction [0]. How do you think,
what'sthe key difference between TAP test and your repro? 

Thanks! Best regards, Andrey Borodin.

[0] https://github.com/postgres/postgres/blob/REL_14_STABLE/contrib/amcheck/t/002_cic.pl


On Thu, May 19, 2022 at 09:22:44AM +0500, Andrey Borodin wrote:
> Uhm, that's very...interesting. I'll look closely next week. Though I didn't have a chance to reproduce yet.
> We have fixed similar problem in 14.1. And now we have very similar
> TAP test to you reproduction [0]. How do you think, what's the key
> difference between TAP test and your repro?

Interesting, indeed.  Another question I have: is this limited to v14
or are you able to see it in older versions?  REINDEX CONCURRENTLY has
been introduced in v12.
--
Michael

Attachment
На чт, 19.05.2022 г. в 7:32 ч. Michael Paquier <michael@paquier.xyz> написа:
On Thu, May 19, 2022 at 09:22:44AM +0500, Andrey Borodin wrote:
> Uhm, that's very...interesting. I'll look closely next week. Though I didn't have a chance to reproduce yet.
> We have fixed similar problem in 14.1. And now we have very similar
> TAP test to you reproduction [0]. How do you think, what's the key
> difference between TAP test and your repro?

Interesting, indeed.  Another question I have: is this limited to v14
or are you able to see it in older versions?  REINDEX CONCURRENTLY has
been introduced in v12.
--
Michael

Hi Michael,
Yes I have made the same test on PostgreSQL 13.7, but the reindex works as expected there (no issues).
I haven't  tested on older versions.

Peter
Hi Andrey,
This test looks similar to me but I cannot say what is the difference. My tests are done under heavier  load maybe.
Also I couldn't reproduce this with a table with low num number of columns (integrer and text). I am not sure if this is relevant...

Peter


На чт, 19.05.2022 г. в 7:22 ч. Andrey Borodin <x4mmm@yandex-team.ru> написа:


> 18 мая 2022 г., в 15:42, PG Bug reporting form <noreply@postgresql.org> написал(а):
> I was able to reproduce this on PostgreSQL 14.1/2/3 locally on
> docker instance and on AWS EC2.


Uhm, that's very...interesting. I'll look closely next week. Though I didn't have a chance to reproduce yet.
We have fixed similar problem in 14.1. And now we have very similar TAP test to you reproduction [0]. How do you think, what's the key difference between TAP test and your repro?

Thanks! Best regards, Andrey Borodin.

[0] https://github.com/postgres/postgres/blob/REL_14_STABLE/contrib/amcheck/t/002_cic.pl
On Thu, May 19, 2022 at 08:24:27AM +0300, Петър Славов wrote:
> Yes I have made the same test on PostgreSQL 13.7, but the reindex works as
> expected there (no issues).
> I haven't  tested on older versions.

Okay, thanks.  Something that has changed in this area is the
addition of c98763b, where spurious waits are avoided in some of the
phaese of REINDEX CONCURRENTLY.  I am wondering if this is related.
--
Michael

Attachment
On 2022-May-19, Michael Paquier wrote:

> Okay, thanks.  Something that has changed in this area is the
> addition of c98763b, where spurious waits are avoided in some of the
> phaese of REINDEX CONCURRENTLY.  I am wondering if this is related.

Hmm, yes, it's definitely possible that it is related.

I'll have a look soon.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Once again, thank you and all of the developers for your hard work on
PostgreSQL.  This is by far the most pleasant management experience of
any database I've worked on."                             (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php



On Thu, May 19, 2022 at 02:03:56PM +0200, Alvaro Herrera wrote:
> Hmm, yes, it's definitely possible that it is related.
>
> I'll have a look soon.

It took me some time to write a script to bisect that, but I have been
able to establish a correlation with d9d0762 that causes VACUUM to
ignore transactions doing some concurrent reindex operations.  I would
not be surprised to see that this is also related to some of the
reports we have seen lately with reindex operations.  There was one
with logical replication and missing records from a primary key, I
recall.

For the stable branches of 14 and 15, I would tend to play it safe and
revert d9d0762.  I have to admit that f9900df and c98763b stress me a
bit, and that we have not have anticipated all the ramifications of
this set of changes.  Thoughts?
--
Michael

Attachment
On 2022-May-23, Michael Paquier wrote:

> It took me some time to write a script to bisect that, but I have been
> able to establish a correlation with d9d0762 that causes VACUUM to
> ignore transactions doing some concurrent reindex operations.  I would
> not be surprised to see that this is also related to some of the
> reports we have seen lately with reindex operations.  There was one
> with logical replication and missing records from a primary key, I
> recall.
> 
> For the stable branches of 14 and 15, I would tend to play it safe and
> revert d9d0762.  I have to admit that f9900df and c98763b stress me a
> bit, and that we have not have anticipated all the ramifications of
> this set of changes.  Thoughts?

Wow, thanks for researching that over the weekend.

I think if this is a big enough deal (and I think it may be) then IMO we
should revert as you suggest, make an out-of-schedule release, and then
I can take some time to investigate in more depth and see if the feature
can be salvaged. 

OTOH if we think an out-of-schedule release is not warranted, then
reverting right now is not useful; we can make a decision about that
closer to the next minor release, once we've had time to see if the bug
can be fixed in some other way that doesn't break the whole feature.

Opinions?

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)




> 23 мая 2022 г., в 13:07, Alvaro Herrera <alvherre@alvh.no-ip.org> написал(а):
>
> Opinions?

I think revert+release is not really a good idea until we understand how this commit breaks things.
Chances are that it only affects frequency of the reproduction.

When we will understand what is root cause of the bug - it won't take much time to fix things.

Best regards, Andrey Borodin.



> 23 мая 2022 г., в 10:40, Michael Paquier <michael@paquier.xyz> написал(а):
>
> It took me some time to write a script to bisect that, but I have been
> able to establish a correlation with d9d0762 that causes VACUUM to
> ignore transactions doing some concurrent reindex operations.

I've transformed Peter's test into TAP test that runs ~20 seconds and reliably reproduces problem on my laptop.
And I observe that commenting out condition in following code fixes the test.
            //if (!(statusFlags & PROC_IN_SAFE_IC))
                h->data_oldest_nonremovable =
                    TransactionIdOlder(h->data_oldest_nonremovable, xmin);

Best regards, Andrey Borodin.

Attachment

> 23 мая 2022 г., в 15:49, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> I've transformed Peter's test into TAP test that runs ~20 seconds and reliably reproduces problem on my laptop.

I found out one interesting thing: unindexed tuple (that comes from amcheck scan) does not exist in heap page at the
momentof check fail. 
I've added ReadBuffer() in case of bloom_lacks_element() and ItemIdHasStorage() is false.
I understand that this description is a too vague, so I attached a patch for amcheck relaxing bt_index_check() so the
testwould pass. 

Best regards, Andrey Borodin.


Attachment
On Mon, May 23, 2022 at 2:02 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> I think revert+release is not really a good idea until we understand how this commit breaks things.
> Chances are that it only affects frequency of the reproduction.

+1 -- it's been in a stable release for months now, and we will
probably know the exact nature of the problem in just a few more days.
There is no reason to decide that the feature needs to be reverted
before anything else. Or if there is I would like to hear it.

-- 
Peter Geoghegan



On Mon, May 23, 2022 at 6:06 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> I found out one interesting thing: unindexed tuple (that comes from amcheck scan) does not exist in heap page at the
momentof check fail.
 

That could just be a "downstream problem" from HOT chain corruption.

Maybe you'd get a clearer/earlier failure if you also applied this
patch, on an assertion-enabled build:

https://postgr.es/m/CAH2-Wzk2LeWPwz1wcKNz7Fux4Ogn+PC81H+q7Q7no-5XT0dx3w@mail.gmail.com

-- 
Peter Geoghegan



On Mon, May 23, 2022 at 10:07:44AM +0200, Alvaro Herrera wrote:
> I think if this is a big enough deal (and I think it may be) then IMO we
> should revert as you suggest, make an out-of-schedule release, and then
> I can take some time to investigate in more depth and see if the feature
> can be salvaged.
>
> OTOH if we think an out-of-schedule release is not warranted, then
> reverting right now is not useful; we can make a decision about that
> closer to the next minor release, once we've had time to see if the bug
> can be fixed in some other way that doesn't break the whole feature.

The annoying part is that this can cause silent corruptions for
indexes created with REINDEX and CIC, so most users won't know about
the failure until they see that their application is broken.  And we
are just talking about a btree index here, other index AMs may be
similarly impacted.  So that's rather bad IMHO :/

It seems to me that the problem is around the wait phase after the
validation, where the computation of limitXmin coming from the
snapshot used for the validation ignores now the impact of VACUUM,
hence impacting the timing when the index can be safely used.  It also
looks like it is possible to build an isolation test, where we use a
transaction with a snapshot older than the REINDEX to force it to
wait in the first WaitForOlderSnapshots() call.
--
Michael

Attachment
On Mon, May 23, 2022 at 03:49:02PM +0500, Andrey Borodin wrote:
> I've transformed Peter's test into TAP test that runs ~20 seconds
> and reliably reproduces problem on my laptop.

Thanks for the TAP test.  That's nice.  It actually passes here,
reliably.

> And I observe that commenting out condition in following code fixes the test.
>             //if (!(statusFlags & PROC_IN_SAFE_IC))
>                 h->data_oldest_nonremovable =
>                     TransactionIdOlder(h->data_oldest_nonremovable, xmin);

Well, by doing so, I think that you are just making the CIC/REINDEX
wait again until the index is safe to use, but we want to skip this
wait as of the optimization done in d9d0762.
--
Michael

Attachment
On Mon, May 23, 2022 at 6:20 PM Michael Paquier <michael@paquier.xyz> wrote:
> > And I observe that commenting out condition in following code fixes the test.
> >             //if (!(statusFlags & PROC_IN_SAFE_IC))
> >                 h->data_oldest_nonremovable =
> >                     TransactionIdOlder(h->data_oldest_nonremovable, xmin);
>
> Well, by doing so, I think that you are just making the CIC/REINDEX
> wait again until the index is safe to use, but we want to skip this
> wait as of the optimization done in d9d0762.

Uh...isn't that exactly the point that Andrey made himself, in posting
the snippet?

You seem to be addressing this PROC_IN_SAFE_IC snippet as if Andrey
formally proposed it as a bugfix, which seems like an odd
interpretation to me. It seems pretty clear to me that Andrey was just
making an observation, in case it helped with debugging.

--
Peter Geoghegan



On 2022-May-23, Peter Geoghegan wrote:

> You seem to be addressing this PROC_IN_SAFE_IC snippet as if Andrey
> formally proposed it as a bugfix, which seems like an odd
> interpretation to me. It seems pretty clear to me that Andrey was just
> making an observation, in case it helped with debugging.

Right.

I approached it yesterday by running the test case with each
set_indexsafe_procflags() callsite commented out, see which one breaks
things.  Didn't reach any conclusion because I ran into thermal problems
with my laptop, which got me angry and couldn't make any further
progress.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
                                                  (Linus Torvalds)




> On 24 May 2022, at 14:02, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> I approached it yesterday by running the test case with each
> set_indexsafe_procflags() callsite commented out, see which one breaks
> things

On my machine commenting out set_indexsafe_procflags() before "Phase 3 of concurrent index build" fixes the tests.



> On 23 May 2022, at 23:18, Peter Geoghegan <pg@bowt.ie> wrote:
>
> Maybe you'd get a clearer/earlier failure if you also applied this
> patch, on an assertion-enabled build

I've tried this approach, but nothing actually seem to change... BTW I used a three-way-merge rebase, there is a slight
conflictin comments. But, luckily, comments don't run. 



> On 24 May 2022, at 06:20, Michael Paquier <michael@paquier.xyz> wrote:
>
> Thanks for the TAP test.  That's nice.  It actually passes here,
> reliably.

IDK. Maybe if you increase --time of pgbench you will observe the problem...



> On 24 May 2022, at 07:19, Peter Geoghegan <pg@bowt.ie> wrote:
>
> Andrey was just
> making an observation, in case it helped with debugging.

Yes, I'm not proposing to commit anything so far. All my tests, snippets, diffs here are only debug stuff.


Thank you!

Best regards, Andrey Borodin.


Hi,

On 2022-05-24 11:02:12 +0200, Alvaro Herrera wrote:
> On 2022-May-23, Peter Geoghegan wrote:
> 
> > You seem to be addressing this PROC_IN_SAFE_IC snippet as if Andrey
> > formally proposed it as a bugfix, which seems like an odd
> > interpretation to me. It seems pretty clear to me that Andrey was just
> > making an observation, in case it helped with debugging.
> 
> Right.
> 
> I approached it yesterday by running the test case with each
> set_indexsafe_procflags() callsite commented out, see which one breaks
> things.  Didn't reach any conclusion because I ran into thermal problems
> with my laptop, which got me angry and couldn't make any further
> progress.

Do we have any idea what really causes the corruption?

One thing that'd be worth excluding is the use of parallel index builds.

Greetings,

Andres Freund



On Tue, May 24, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:
> Do we have any idea what really causes the corruption?

I don't think so.

Andrey's tap test fails for me on 14 as expected, and does so reliably
-- so there is a fairly good reproducer for this.

I don't have time to debug this right now (need to work on my pgCon
talk), but it would probably be straightforward to get an RR recording
of the failure.

> One thing that'd be worth excluding is the use of parallel index builds.

I can rule out a problem with parallel index builds -- disabling them
in the tap test doesn't alter the outcome.

-- 
Peter Geoghegan



Hi,

On 2022-05-24 09:37:02 -0700, Andres Freund wrote:
> On 2022-05-24 11:02:12 +0200, Alvaro Herrera wrote:
> > I approached it yesterday by running the test case with each
> > set_indexsafe_procflags() callsite commented out, see which one breaks
> > things.  Didn't reach any conclusion because I ran into thermal problems
> > with my laptop, which got me angry and couldn't make any further
> > progress.
>
> Do we have any idea what really causes the corruption?

Reproed the problem with the pgbench script, against an existing cluster. One
note: With fsync=on, it's much harder to reproduce.

One instance:

pgbench: error: client 0 script 2 aborted in command 3 query 0: ERROR:  heap tuple (37,56) from table "tbl" lacks
matchingindex tuple within index "idx"
 
HINT:  Retrying verification using the function bt_index_parent_check() might provide a more specific error.

the last record for that offset is (with waldump enhanced to print offsets):

rmgr: Heap2       len (rec/tot):     60/    60, tx:          0, lsn: 0/8E56D218, prev 0/8E56D1F0, desc: PRUNE
latestRemovedXid1878329 nredirected 1 ndead 0 nunused 1, unused: [178], redirected: [56->53], blkref #0: rel
1663/5/27905blk 37
 

the tuple originally was inserted:
rmgr: Heap        len (rec/tot):     95/    95, tx:    1802927, lsn: 0/89776380, prev 0/89776358, desc: INSERT off 56
flags0x04, blkref #0: rel 1663/5/27905 blk 37
 
rmgr: Heap        len (rec/tot):     48/    48, tx:    1802927, lsn: 0/89776460, prev 0/89776420, desc: HEAP_CONFIRM
off56, blkref #0: rel 1663/5/27905 blk 37
 

and updated a "while" back:
rmgr: Heap        len (rec/tot):     54/    54, tx:    1804510, lsn: 0/898B5E68, prev 0/898B5E40, desc: LOCK off 56:
xid1804510: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/27905 blk 37
 
rmgr: Heap        len (rec/tot):     73/    73, tx:    1804510, lsn: 0/898B5EA0, prev 0/898B5E68, desc: HOT_UPDATE off
56xmax 1804510 flags 0x60 ; new off 80 xmax 1804510, blkref #0: rel 1663/5/27905 blk 37
 

since which there have been numerous redirections to other tuples:
rmgr: Heap2       len (rec/tot):    212/   212, tx:          0, lsn: 0/89AC6B40, prev 0/89AC6B00, desc: PRUNE
latestRemovedXid1806705 nredirected 16 ndead 0 nunused 47, unused: [47, 61, 12, 14, 38, 52, 53, 60, 75, 16, 17, 31, 50,
33,59, 76, 49, 51, 54, 63, 89, 95, 18, 19, 29, 32, 48, 62, 81, 85, 88, 106, 28, 39, 40, 44, 26, 27, 45, 58, 73, 91, 34,
41,80, 86, 96], redirected: [1->74, 3->93, 4->37, 7->94, 8->72, 9->87, 10->105, 15->108, 21->71, 23->98, 30->79,
35->102,55->92, 56->107, 67->103, 84->104], blkref #0: rel 1663/5/27905 blk 37
 
rmgr: Heap2       len (rec/tot):    220/   220, tx:          0, lsn: 0/89DC6240, prev 0/89DC6218, desc: PRUNE
latestRemovedXid1810120 nredirected 27 ndead 0 nunused 29, unused: [74, 88, 93, 19, 94, 53, 59, 105, 16, 29, 118, 108,
71,26, 38, 98, 79, 52, 102, 31, 54, 92, 107, 50, 113, 103, 104, 60, 85], redirected: [1->122, 3->62, 5->49, 7->91,
10->121,15->28, 21->51, 23->119, 24->58, 30->73, 33->81, 34->86, 35->89, 36->14, 42->76, 45->112, 48->75, 55->80,
56->116,67->17, 68->115, 70->61, 78->117, 83->27, 84->18, 97->120, 99->114], blkref #0: rel 1663/5/27905 blk 37
 
....
rmgr: Heap2       len (rec/tot):     60/    60, tx:          0, lsn: 0/8E56D218, prev 0/8E56D1F0, desc: PRUNE
latestRemovedXid1878329 nredirected 1 ndead 0 nunused 1, unused: [178], redirected: [56->53], blkref #0: rel
1663/5/27905blk 37
 


the target of the redirection that's pruned away in 0/8E56D218 is more recent:
rmgr: Heap        len (rec/tot):     54/    54, tx:    1878329, lsn: 0/8E556070, prev 0/8E556020, desc: LOCK off 178:
xid1878329: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/27905 blk 37
 
rmgr: Heap        len (rec/tot):     74/    74, tx:    1878329, lsn: 0/8E5560A8, prev 0/8E556070, desc: HOT_UPDATE off
178xmax 1878329 flags 0x60 ; new off 53 xmax 1878329, blkref #0: rel 1663/5/27905 blk 37
 


I suspect the problem might be related to pruning done during the validation
scan. Once PROC_IN_SAFE_IC is set, the backend itself will not preserve tids
its own snapshot might need. Which will wreak havoc during the validation
scan.

Greetings,

Andres Freund



Hi,

On 2022-05-24 11:15:42 -0700, Andres Freund wrote:
> I suspect the problem might be related to pruning done during the validation
> scan. Once PROC_IN_SAFE_IC is set, the backend itself will not preserve tids
> its own snapshot might need. Which will wreak havoc during the validation
> scan.

Looking at it more, I don't see when PROC_IN_SAFE_IC is safe as defined,
tbh. Afaict it basically completely breaks snapshots - which we rely on to
work for both the initial build scan and then for the validation scan.

It seems safe to make CIC to ignore other CICs when deciding whether to wait
for those transactions. But it seems utterly unsafe to ignore CICs when when
determining horizons.

Greetings,

Andres Freund




> On 24 May 2022, at 23:15, Andres Freund <andres@anarazel.de> wrote:
>
> With fsync=on, it's much harder to reproduce.
That exaplains why it's easier to reproduce on MacOS: it seem it ignores fsync.



> On 24 May 2022, at 23:15, Andres Freund <andres@anarazel.de> wrote:
>
> I suspect the problem might be related to pruning done during the validation
> scan. Once PROC_IN_SAFE_IC is set, the backend itself will not preserve tids
> its own snapshot might need. Which will wreak havoc during the validation
> scan.

I observe that removing PROC_IN_SAFE_IC for index_validate() fixes tests.
But why it's not a problem for index_build() scan?

And I do not understand why it's a problem that tuple is pruned during the scan... How does this "wreak havoc" happen?

Best regards, Andrey Borodin.


Hi,

On 2022-05-24 10:38:14 -0700, Peter Geoghegan wrote:
> On Tue, May 24, 2022 at 9:37 AM Andres Freund <andres@anarazel.de> wrote:
> > Do we have any idea what really causes the corruption?
> 
> I don't think so.

I think I found it: https://postgr.es/m/20220524183705.cmgbqq32z63qynhe%40alap3.anarazel.de
afaict PROC_IN_SAFE_IC is completely broken right now. Any concurrent prune
can remove prune rows that are visible to the snapshot held by the
PROC_IN_SAFE_IC backend. Which basically makes them "fair weather snapshots" -
they work only as long as there is no concurrent activity.

Similar behavior is fine for VACUUM - it doesn't use a snapshot / need a
consistent view of the table. But not for CIC - otherwise it could just use
SnapshotAny or such.


I don't really see a realistic alternative other than reverting at this
point. I think this needs to be rethought fairly fundamentally.


> Andrey's tap test fails for me on 14 as expected, and does so reliably
> -- so there is a fairly good reproducer for this.
> 
> I don't have time to debug this right now (...), but it would probably be
> straightforward to get an RR recording of the failure.

I tried that, but it didn't repro under rr within 15min or so.


> (need to work on my pgCon talk)

Good luck :)


> > One thing that'd be worth excluding is the use of parallel index builds.
> 
> I can rule out a problem with parallel index builds -- disabling them
> in the tap test doesn't alter the outcome.

Good. Just to clarify: I was suspicious of PROC_IN_SAFE_IC being set
incoherently in parallel workers or such, not of parallel index builds "in
general".

Greetings,

Andres Freund



Hi,

On 2022-05-24 23:38:07 +0500, Andrey Borodin wrote:
>
>
> > On 24 May 2022, at 23:15, Andres Freund <andres@anarazel.de> wrote:
> >
> > With fsync=on, it's much harder to reproduce.
> That exaplains why it's easier to reproduce on MacOS: it seem it ignores fsync.

Yea, one needs wal_sync_method=fsync_writethrough or such :/


> > On 24 May 2022, at 23:15, Andres Freund <andres@anarazel.de> wrote:
> >
> > I suspect the problem might be related to pruning done during the validation
> > scan. Once PROC_IN_SAFE_IC is set, the backend itself will not preserve tids
> > its own snapshot might need. Which will wreak havoc during the validation
> > scan.
>
> I observe that removing PROC_IN_SAFE_IC for index_validate() fixes tests.
> But why it's not a problem for index_build() scan?

I now suspect it's a problem for both, just more visible for index_validate().


> And I do not understand why it's a problem that tuple is pruned during the scan... How does this "wreak havoc"
happen?

Basically snapshots don't work anymore. If PROC_IN_SAFE_IC is set, that
backend is ignored for the horizon computation for snapshots / on-access HOT
pruning. Which means that rows that are visible to the snapshot can be pruned
away.

One might think that could be safe, after all the row is invisible to all
other backends. The problem is that the validation scan won't see *newer* rows
either, since they're not visible to the snapshot either. And if the new row
version is a HOT tuple, it won't have made an index entry on its own. Boom,
corruption.

Basically:

1) S1 builds index in phase 2
2) S2 inserts tuple t1 (not in the index built in 1), since it's inserted
   after that)
3) S2 hot updates tuple t1->t2
4) S1 sets PROC_IN_SAFE_IC, builds snapshot, starts validation scan (phase 3)
5) S2 hot updates tuple t2->t3
6) Either S1 or S2 performs hot pruning, redirecting t1 to t3, this is only
   possible because PROC_IN_SAFE_IC caused S2's ->xmin to be ignored
7) S2 checks t1->t3, finds that t3 is too new for the snapshot, doesn't create
   an index entry
8) corruption


Greetings,

Andres Freund



On Tue, 24 May 2022 at 15:02, Andres Freund <andres@anarazel.de> wrote:
>
> Basically:
>
> 1) S1 builds index in phase 2
> 2) S2 inserts tuple t1 (not in the index built in 1), since it's inserted
   after that)
> 3) S2 hot updates tuple t1->t2

Not that it matters but is this step even necessary?

> 4) S1 sets PROC_IN_SAFE_IC, builds snapshot, starts validation scan (phase 3)
> 5) S2 hot updates tuple t2->t3

That seems like the key observation. But I wonder if it's even the
only flow where this could be an issue. What happens if t2 is deleted,
can it get pruned away completely?

> 6) Either S1 or S2 performs hot pruning, redirecting t1 to t3, this is only
>    possible because PROC_IN_SAFE_IC caused S2's ->xmin to be ignored

Or presumably any other transaction. But ... does the update to t2->t3
not automatically trigger pruning anyways?

> 7) S2 checks t1->t3, finds that t3 is too new for the snapshot, doesn't create
>    an index entry

Just to be clear, it would normally have created an index entry (for
the whole HOT chain) because t2 is in the recheck snapshot and
therefore the whole HOT chain wasn't in the initial snapshot. I'm a
little confused here.

> 8) corruption

Aside from amcheck I wonder if we can come up with any way for users
to tell whether their index is affected or at risk. Like, is there a
way to tell from catalog entries if an index was created with CIC?


-- 
greg



Hi,

On 2022-05-24 17:11:12 -0400, Greg Stark wrote:
> On Tue, 24 May 2022 at 15:02, Andres Freund <andres@anarazel.de> wrote:
> >
> > Basically:
> >
> > 1) S1 builds index in phase 2
> > 2) S2 inserts tuple t1 (not in the index built in 1), since it's inserted
>    after that)
> > 3) S2 hot updates tuple t1->t2
> 
> Not that it matters but is this step even necessary?

I think it is, but there might be other recipes reproducing the problem.


> > 4) S1 sets PROC_IN_SAFE_IC, builds snapshot, starts validation scan (phase 3)
> > 5) S2 hot updates tuple t2->t3
> 
> That seems like the key observation. But I wonder if it's even the
> only flow where this could be an issue. What happens if t2 is deleted,
> can it get pruned away completely?

Yes it could, but afaics that'd be fine, because then there's no missing index
entry. And the index should only be marked valid once all older snapshots have
ended.


> > 6) Either S1 or S2 performs hot pruning, redirecting t1 to t3, this is only
> >    possible because PROC_IN_SAFE_IC caused S2's ->xmin to be ignored
> 
> Or presumably any other transaction.

Right.


> But ... does the update to t2->t3 not automatically trigger pruning anyways?

We don't prune during updates right now (but do when fetching the row to
update) - I think that's bad, but it's how it is.

When you say "automatically" - do you mean that it'd happen unconditionally,
independent of the horizon? It shouldn't...


> > 7) S2 checks t1->t3, finds that t3 is too new for the snapshot, doesn't create
> >    an index entry
> 
> Just to be clear, it would normally have created an index entry (for
> the whole HOT chain) because t2 is in the recheck snapshot and
> therefore the whole HOT chain wasn't in the initial snapshot. I'm a
> little confused here.

Hm? Why / where would we have done that? It's a HOT update, so the UPDATE
doesn't create an index entry. And the validate scan won't see the HOT chain
because t2 has been pruned away and t3 is too new.

What "recheck snapshot" are you referring to? The one passed to
validate_index()?


> > 8) corruption
> 
> Aside from amcheck I wonder if we can come up with any way for users
> to tell whether their index is affected or at risk. Like, is there a
> way to tell from catalog entries if an index was created with CIC?

Not reliably, afaik. indcheckxmin won't ever be set for a CIC index IIRC, but
it's not reliably set for a non-CIC index.

Greetings,

Andres Freund



On Tue, May 24, 2022 at 03:24:33PM -0700, Andres Freund wrote:
> On 2022-05-24 17:11:12 -0400, Greg Stark wrote:
>> Aside from amcheck I wonder if we can come up with any way for users
>> to tell whether their index is affected or at risk. Like, is there a
>> way to tell from catalog entries if an index was created with CIC?
>
> Not reliably, afaik. indcheckxmin won't ever be set for a CIC index IIRC, but
> it's not reliably set for a non-CIC index.

When it comes to REINDEX, we recreate entirely a new relation for the
concurrent flavor, dropping the old one.  Hence its OID changes, while
all the data from the old relation gets copied over.  That's not much
and you cannot use that for a CIC, still..
--
Michael

Attachment

> On 25 May 2022, at 00:01, Andres Freund <andres@anarazel.de> wrote:
>
>>> On 24 May 2022, at 23:15, Andres Freund <andres@anarazel.de> wrote:
>>>
>>> I suspect the problem might be related to pruning done during the validation
>>> scan. Once PROC_IN_SAFE_IC is set, the backend itself will not preserve tids
>>> its own snapshot might need. Which will wreak havoc during the validation
>>> scan.
>>
>> I observe that removing PROC_IN_SAFE_IC for index_validate() fixes tests.
>> But why it's not a problem for index_build() scan?
>
> I now suspect it's a problem for both, just more visible for index_validate().
No, I understood now. index_build() in CIC can be safely replaced by index_build_empty().
index_build() is only called because many AMs can make index more efficient than index_build_empty()+inserts.
BTW for other indexes that are built just by inserts we can safely optimize CIC by calling index_build_empty() instead
ofindex_build() in Phase 1. 

>> And I do not understand why it's a problem that tuple is pruned during the scan... How does this "wreak havoc"
happen?
>
> Basically snapshots don't work anymore

I got it! (almost)
Thank you!
Now I see why tests in 002_cic.pl did not catch that: there were no HOT updates.

So, can we leave the feature enabled for all the process besides index validation? We don't seem to need a valid
snapshotin CIC anywhere else. Or it's safer to revert whole commit? 

Thanks!

Best regards, Andrey Borodin.


Hi,

On 2022-05-25 09:00:47 +0500, Andrey Borodin wrote:
> > On 25 May 2022, at 00:01, Andres Freund <andres@anarazel.de> wrote:
> > 
> >>> On 24 May 2022, at 23:15, Andres Freund <andres@anarazel.de> wrote:
> >>> 
> >>> I suspect the problem might be related to pruning done during the validation
> >>> scan. Once PROC_IN_SAFE_IC is set, the backend itself will not preserve tids
> >>> its own snapshot might need. Which will wreak havoc during the validation
> >>> scan.
> >> 
> >> I observe that removing PROC_IN_SAFE_IC for index_validate() fixes tests.
> >> But why it's not a problem for index_build() scan?
> > 
> > I now suspect it's a problem for both, just more visible for index_validate().

> No, I understood now. index_build() in CIC can be safely replaced by
> index_build_empty().

I'm not convinced that that is true. I think at least for stuff like indexes
used in constraints (unique or exclusion) we need to do something like the
current approach.

But more importantly, there's a difference between not inserting any index
entries, and inserting entries based on a essentially corrupted
snapshot. Using a too old snapshot (which is what we have here) could lead to
all sorts of oddities when following HOT chains etc. Which might end up
inserting *too much* into the index, I think.


> So, can we leave the feature enabled for all the process besides index
> validation? We don't seem to need a valid snapshot in CIC anywhere else. Or
> it's safer to revert whole commit?

I doubt it's safe in the other places either. We can't just do a table scan,
which expects to be passed a valid snapshot, with a broken snapshot, and
expect no problems. At the very least I expect there to be a potential of
spurious warnings / errors.

Greetings,

Andres Freund



On 2022-May-24, Andres Freund wrote:

> One might think that could be safe, after all the row is invisible to all
> other backends. The problem is that the validation scan won't see *newer* rows
> either, since they're not visible to the snapshot either. And if the new row
> version is a HOT tuple, it won't have made an index entry on its own. Boom,
> corruption.

Whoa.  My oversight here -- I failed to picture a HOT prune concurrently
with CIC or RIC. 

> Basically snapshots don't work anymore. If PROC_IN_SAFE_IC is set,
> that backend is ignored for the horizon computation for snapshots /
> on-access HOT pruning. Which means that rows that are visible to the
> snapshot can be pruned away.

I wondered if we could have different tuple horizons for HOT pruning
than for vacuum, but looking at ComputeXidHorizons() and users of that,
it looks complicated to adapt.

Another possibility (than reverting the commit altogether) might be to
disable HOT pruning while a process is operating on that relation with
PROC_IN_SAFE_IC.  So CIC/RIC processes are still ignored for VACUUM,
while not creating corrupted indexes.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Postgres is bloatware by design: it was built to house
 PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)



On Wed, May 25, 2022 at 7:45 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Basically snapshots don't work anymore. If PROC_IN_SAFE_IC is set,
> > that backend is ignored for the horizon computation for snapshots /
> > on-access HOT pruning. Which means that rows that are visible to the
> > snapshot can be pruned away.
>
> I wondered if we could have different tuple horizons for HOT pruning
> than for vacuum, but looking at ComputeXidHorizons() and users of that,
> it looks complicated to adapt.
>
> Another possibility (than reverting the commit altogether) might be to
> disable HOT pruning while a process is operating on that relation with
> PROC_IN_SAFE_IC.  So CIC/RIC processes are still ignored for VACUUM,
> while not creating corrupted indexes.

I'm not sure that would be a win, because HOT pruning is great as long
as the tuples you're pruning are old enough. Also, it seems like it
would require complex new infrastructure that I think we should be
reluctant to invent in back branches.

It seems to me that we should just revert. As far as I can see, and
for sure I might be missing something, this is a classic case of an
idea that seemed good at the time but turns out not to work. When we
look at a recently HOT-updated tuple, we need to know whether the
original insertion happened before or after the index build. We can't
figure that out if we prune away the tuples that store the xmin values
that we need in order to figure that out. So it turns out we need
everyone to respect that snapshot after all. Bummer.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On 2022-May-25, Robert Haas wrote:

> On Wed, May 25, 2022 at 7:45 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > Another possibility (than reverting the commit altogether) might be to
> > disable HOT pruning while a process is operating on that relation with
> > PROC_IN_SAFE_IC.  So CIC/RIC processes are still ignored for VACUUM,
> > while not creating corrupted indexes.
> 
> I'm not sure that would be a win, because HOT pruning is great as long
> as the tuples you're pruning are old enough.

Well, the point is that VACUUM could still prune dead tuples that are
newer than the CIC in all *other* tables.  VACUUM cannot run in the
table being processed by CIC/RIC (because of locks), so there's no
change; it's only HOT-pruning in the table being processed that would
change.

> Also, it seems like it would require complex new infrastructure that I
> think we should be reluctant to invent in back branches.

This is definitely true.  And I think this would be expensive, because
we'd have to check in every heap_page_prune call.

> It seems to me that we should just revert.

Deciding to revert makes me sad, because this feature is extremely
valuable for users.  However, I understand the danger and I don't
disagree with the rationale so I can't really object.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"¿Cómo puedes confiar en algo que pagas y que no ves,
y no confiar en algo que te dan y te lo muestran?" (Germán Poo)



Hi,

On 2022-05-25 18:43:22 +0200, Alvaro Herrera wrote:
> On 2022-May-25, Robert Haas wrote:
> > Also, it seems like it would require complex new infrastructure that I
> > think we should be reluctant to invent in back branches.
> 
> This is definitely true.  And I think this would be expensive, because
> we'd have to check in every heap_page_prune call.

I think the cost could be addressed, along the lines of the mechanism I put in
as part of the snapshot scalability work. I.e. don't compute an accurate
horizon when not needed for pruning, only do so when within a certain range of
xids.

But it seems still way too invasive for the back branches. Quite obviously we
need a lot more testing for this etc.


I'm also doubtful it's the right approach. The problem here comes from needing
a snapshot for the entire duration of the validation scan. ISTM that we should
work on not needing that snapshot, rather than trying to reduce the
consequences of holding that snapshot.  I think it might be possible to avoid
it. Random sketch:

We could prevent HOT updates during CIC for rows inserted during the first
scan. If we did that we IIRC could rely on the xids of the last row version to
determine whether an index insertion is needed during the validation scan.


> > It seems to me that we should just revert.
> 
> Deciding to revert makes me sad, because this feature is extremely
> valuable for users.  However, I understand the danger and I don't
> disagree with the rationale so I can't really object.

I sadly don't see how we could develop a reliable reimplementation of this
feature without delaying (or destabilizing) the release...

Greetings,

Andres Freund



On Wed, May 25, 2022 at 12:43 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > It seems to me that we should just revert.
>
> Deciding to revert makes me sad, because this feature is extremely
> valuable for users.  However, I understand the danger and I don't
> disagree with the rationale so I can't really object.

Right, I mean I'm not saying I *like* reverting, and I'm not disputing
it's a good feature. Just that, if we have to choose between this
feature and not having index corruption, we better not have index
corruption.... and I'm not seeing any way that we can just tweak this
and make it work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Wed, May 25, 2022 at 10:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Right, I mean I'm not saying I *like* reverting, and I'm not disputing
> it's a good feature. Just that, if we have to choose between this
> feature and not having index corruption, we better not have index
> corruption.... and I'm not seeing any way that we can just tweak this
> and make it work.

Are we any closer to deciding on a timeline, in light of recent discussion?

I'm now convinced that an out-of-schedule release is probably the way
to go. My thinking is: are we really going to make users wait for
August 11th for a fix?

-- 
Peter Geoghegan



On Wed, May 25, 2022 at 10:32:43AM -0700, Peter Geoghegan wrote:
> On Wed, May 25, 2022 at 10:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Right, I mean I'm not saying I *like* reverting, and I'm not disputing
> > it's a good feature. Just that, if we have to choose between this
> > feature and not having index corruption, we better not have index
> > corruption.... and I'm not seeing any way that we can just tweak this
> > and make it work.
> 
> Are we any closer to deciding on a timeline, in light of recent discussion?
> 
> I'm now convinced that an out-of-schedule release is probably the way
> to go. My thinking is: are we really going to make users wait for
> August 11th for a fix?

No, we are unlikely to wait until August.

What do we currently know?

*  Caused by CREATE/REINDEX CONCURRENTLY
*  No way to detect which indexes were created concurrently
*  amcheck will be required to find corrupt indexes (options?)
*  Heap is fine, only index is corrupt
*  Bug since PG 14
*  Not known why reports are only appearing now
*  Feature will likely need to be removed rather than fixed

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




On Wed, May 25, 2022 at 11:17:33PM -0400, Bruce Momjian wrote:
> *  Caused by CREATE/REINDEX CONCURRENTLY

Yes.

> *  No way to detect which indexes were created concurrently
> *  amcheck will be required to find corrupt indexes (options?)

amcheck would help for btree, not the other AMs.  I am not sure but
there could be other symptoms that may not be detected by amcheck?

> *  Heap is fine, only index is corrupt

Yes.

> *  Bug since PG 14

14.0, yes.

> *  Not known why reports are only appearing now
> *  Feature will likely need to be removed rather than fixed

Trends go in this direction.  It does not seem like there is a
complete conclusion on this point, either.
--
Michael

Attachment

> On 26 May 2022, at 08:39, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, May 25, 2022 at 11:17:33PM -0400, Bruce Momjian wrote:
>> *  Caused by CREATE/REINDEX CONCURRENTLY
>
> Yes.
>
>> *  No way to detect which indexes were created concurrently
>> *  amcheck will be required to find corrupt indexes (options?)
>
> amcheck would help for btree, not the other AMs.  I am not sure but
> there could be other symptoms that may not be detected by amcheck?

Noah invented a neat way to do check in pre-heapallindexed versions [0]. I think we can make GiST and some others AMs
bechecked this way. 

BTW, patches for GiST and GIN verification are circulating in pgsql-hackers. Would anyone be interested in reviewing?
Ifso - I'll add heapallindexed to GiST verification. 
Last GIN version by Heikki was in a good condition, though without heapallindexed AFAIK. [1] [2]


Best regards, Andrey Borodin.

[0] https://github.com/postgres/postgres/blob/REL_11_STABLE/src/bin/pgbench/t/022_cic.pl#L28-L40
[1]
https://www.postgresql.org/message-id/flat/a5bc2aad-464b-91bd-061d-28af0f9b634c%40iki.fi#222ef3f6cbdd6f86172b38ff7663e6a3
[2] https://www.postgresql.org/message-id/CAF3eApa07-BajjG8%2BRYx-Dr_cq28ZA0GsZmUQrGu5b2ayRhB5A%40mail.gmail.com




On Thu, May 26, 2022 at 6:17 AM Bruce Momjian <bruce@momjian.us> wrote:
What do we currently know?

*  Caused by CREATE/REINDEX CONCURRENTLY
*  No way to detect which indexes were created concurrently
*  amcheck will be required to find corrupt indexes (options?)
Actually in the initial bug report I posted an SQL that is testing sequential scan VS index scan to find the missing records in the index.
This is how we initially found the problem.
I am not sure how these missing records are related to the `amcheck` problems because the `ctid`s does not match exactly.


--
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

On Thu, May 26, 2022 at 12:39:20PM +0900, Michael Paquier wrote:
> On Wed, May 25, 2022 at 11:17:33PM -0400, Bruce Momjian wrote:
> > *  Caused by CREATE/REINDEX CONCURRENTLY
> 
> Yes.
> 
> > *  No way to detect which indexes were created concurrently
> > *  amcheck will be required to find corrupt indexes (options?)
> 
> amcheck would help for btree, not the other AMs.  I am not sure but
> there could be other symptoms that may not be detected by amcheck?
> 
> > *  Heap is fine, only index is corrupt
> 
> Yes.
> 
> > *  Bug since PG 14
> 
> 14.0, yes.
> 
> > *  Not known why reports are only appearing now
> > *  Feature will likely need to be removed rather than fixed
> 
> Trends go in this direction.  It does not seem like there is a
> complete conclusion on this point, either.

Uh, if people don't know if they have used CREATE/REINDEX CONCURRENTLY
in PG 14.0+, they are going to need to run amcheck on all btree indexes
and reindex all non-btree indexes?  That is going to be a painful
message to deliver, and hear.  :-(

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson





> On May 26, 2022, at 07:40, Bruce Momjian <bruce@momjian.us> wrote:
>
> Uh, if people don't know if they have used CREATE/REINDEX CONCURRENTLY
> in PG 14.0+, they are going to need to run amcheck on all btree indexes
> and reindex all non-btree indexes?  That is going to be a painful
> message to deliver, and hear.  :-(

Unless we are 100% sure that amcheck will never return a false negative in this situation, we may have to recommend
reindexingeverything.  Ouch. 


On Thu, May 26, 2022 at 07:41:28AM -0700, Christophe Pettus wrote:
> 
> 
> > On May 26, 2022, at 07:40, Bruce Momjian <bruce@momjian.us> wrote:
> > 
> > Uh, if people don't know if they have used CREATE/REINDEX CONCURRENTLY
> > in PG 14.0+, they are going to need to run amcheck on all btree indexes
> > and reindex all non-btree indexes?  That is going to be a painful
> > message to deliver, and hear.  :-(
> 
> Unless we are 100% sure that amcheck will never return a false negative in this situation, we may have to recommend
reindexingeverything.  Ouch.
 

I assume we are then looking at reindexdb, which has a parallel option. 
I assume we can also recommend --concurrently on a patched PG 14.4,
which might not be terrible since it would not cause downtime.  I guess
the command would be:

    $ reindexdb --all --concurrently --jobs ##

Without reindexdb's features, this would be much more painful.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




On Thu, May 26, 2022 at 10:48:27AM -0400, Bruce Momjian wrote:
> I assume we are then looking at reindexdb, which has a parallel option.
> I assume we can also recommend --concurrently on a patched PG 14.4,
> which might not be terrible since it would not cause downtime.  I guess
> the command would be:
>
>     $ reindexdb --all --concurrently --jobs ##
>
> Without reindexdb's features, this would be much more painful.

Worth noting that CONCURRENTLY is not supported on system catalogs so
these are safe from the start, and that reindexdb skips the parallel
processing of catalogs when using --jobs and --concurrently.  So that
would be non-intrusive.
--
Michael

Attachment
Hi,

On 2022-05-25 18:43:22 +0200, Alvaro Herrera wrote:
> On 2022-May-25, Robert Haas wrote:
> > It seems to me that we should just revert.
> 
> Deciding to revert makes me sad, because this feature is extremely
> valuable for users.  However, I understand the danger and I don't
> disagree with the rationale so I can't really object.

Since we haven't come up with a better plan, let's do this?

Seems like we should add a test for HOT vs CONCURRENTLY? Seems easy enough to
screw up some way in the future...

Greetings,

Andres Freund




> On 28 May 2022, at 04:13, Andres Freund <andres@anarazel.de> wrote:
>
> Seems like we should add a test for HOT vs CONCURRENTLY?

This test fails on my machine with 35K transactions. And fails in ~50% of cases with 30K transactions.
And this seems like a lot of work, ~10 CPU seconds. How can we reduce the time to reproduce?

Best regards, Andrey Borodin.


Attachment
Hi,

On 2022-05-28 11:26:55 +0500, Andrey Borodin wrote:
> > On 28 May 2022, at 04:13, Andres Freund <andres@anarazel.de> wrote:
> > 
> > Seems like we should add a test for HOT vs CONCURRENTLY?
> 
> This test fails on my machine with 35K transactions. And fails in ~50% of cases with 30K transactions.
> And this seems like a lot of work, ~10 CPU seconds. How can we reduce the time to reproduce?

I think you basically need to force some, but not all, of the modifying
transactions to be open for a bit longer, so that it's more likely that
there's a chance to prune vs CIC waiting. Might also be helpful to update rows
multiple times within an xact.

Greetings,

Andres Freund




> On 28 May 2022, at 12:02, Andres Freund <andres@anarazel.de> wrote:
>
> I think you basically need to force some, but not all, of the modifying
> transactions to be open for a bit longer, so that it's more likely that
> there's a chance to prune vs CIC waiting. Might also be helpful to update rows
> multiple times within an xact.
Now I've got 2 different versions of test for master branch. Both fail in 50% of cases on my machine. Both take
approximately4 seconds of wallclock time and 1 second of CPU time. 

v3: wait with a fraction of waiting transactions.
This test fails with
0   postgres                            0x00000001049ec508 ExceptionalCondition + 124
1   postgres                            0x00000001045ea284 heap_page_prune + 2992
2   postgres                            0x00000001045e9670 heap_page_prune_opt + 424
3   postgres                            0x00000001045e25c0 heapam_index_fetch_tuple + 140
4   postgres                            0x0000000100272d60 index_fetch_heap + 104
5   postgres                            0x0000000100272e18 index_getnext_slot + 88
6   postgres                            0x00000001003bbf4c check_exclusion_or_unique_constraint + 440
7   postgres                            0x00000001003bc360 ExecCheckIndexConstraints + 232
8   postgres                            0x00000001003ea30c ExecInsert + 1024
9   postgres                            0x00000001003e90cc ExecModifyTable + 1536
10  postgres                            0x00000001003bd0cc standard_ExecutorRun + 268
11  postgres                            0x0000000100542d94 ProcessQuery + 160
12  postgres                            0x00000001005423c8 PortalRunMulti + 396
13  postgres                            0x0000000100541cfc PortalRun + 476

And reverting d9d0762 does not fix the issue. I'm not sure if I'm observing some other problem here.

v4 of a test not use pg_sleep() and fails with regular amcheck failure. Reverting d9d0762 fixes the test. Unless I
executethe test for 1 million transactions, then it fail even with a revert... 

I suspect that v3 and v4 triggers different problems.

Best regards, Andrey Borodin.



Attachment
Hi,

On 2022-05-28 19:46:40 +0500, Andrey Borodin wrote:
> > On 28 May 2022, at 12:02, Andres Freund <andres@anarazel.de> wrote:
> > 
> > I think you basically need to force some, but not all, of the modifying
> > transactions to be open for a bit longer, so that it's more likely that
> > there's a chance to prune vs CIC waiting. Might also be helpful to update rows
> > multiple times within an xact.
> Now I've got 2 different versions of test for master branch. Both fail in 50% of cases on my machine. Both take
approximately4 seconds of wallclock time and 1 second of CPU time.
 
> 
> v3: wait with a fraction of waiting transactions.
> This test fails with
> 0   postgres                            0x00000001049ec508 ExceptionalCondition + 124
> 1   postgres                            0x00000001045ea284 heap_page_prune + 2992
> 2   postgres                            0x00000001045e9670 heap_page_prune_opt + 424
> 3   postgres                            0x00000001045e25c0 heapam_index_fetch_tuple + 140
> 4   postgres                            0x0000000100272d60 index_fetch_heap + 104
> 5   postgres                            0x0000000100272e18 index_getnext_slot + 88
> 6   postgres                            0x00000001003bbf4c check_exclusion_or_unique_constraint + 440
> 7   postgres                            0x00000001003bc360 ExecCheckIndexConstraints + 232
> 8   postgres                            0x00000001003ea30c ExecInsert + 1024
> 9   postgres                            0x00000001003e90cc ExecModifyTable + 1536
> 10  postgres                            0x00000001003bd0cc standard_ExecutorRun + 268
> 11  postgres                            0x0000000100542d94 ProcessQuery + 160
> 12  postgres                            0x00000001005423c8 PortalRunMulti + 396
> 13  postgres                            0x0000000100541cfc PortalRun + 476
> 
> And reverting d9d0762 does not fix the issue. I'm not sure if I'm observing some other problem here.

I've not been able to reproduce this issue. Even after increasing the number
of clients and transactions, and running the test a number of times. With
d9d0762 reverted, the problem doesn't happen anymore for me.

Any chance you hit this with d9d0762 reverted? It's easy to e.g. revert and
run the tests without recreating the temp-install, to reduce cycle times.

Was there anything else running on the system? c98763bf51bf also needs to
reverted, of course.


> v4 of a test not use pg_sleep() and fails with regular amcheck failure. Reverting d9d0762 fixes the test. Unless I
executethe test for 1 million transactions, then it fail even with a revert...
 

What you're saying is that the revert might not actually fix the problem, or
that you're hitting a separate bug. Correct?

If if it fails after 1 mio xact, what are the symptoms?

It might be worth trying to repro the problem in 13 or such.

Greetings,

Andres Freund




> On 29 May 2022, at 00:34, Andres Freund <andres@anarazel.de> wrote:
>
> c98763bf51bf also needs to
> reverted, of course.
Reverting c98763bf51bf fixed both tests. I did not understand that we are going to revert it too.

Thanks!

Best regards, Andrey Borodin.


On Sat, May 28, 2022 at 12:34:13PM -0700, Andres Freund wrote:
> I've not been able to reproduce this issue. Even after increasing the number
> of clients and transactions, and running the test a number of times. With
> d9d0762 reverted, the problem doesn't happen anymore for me.

Neither was I able to reproduce this issue after a couple of hours
making my laptop hotter, using either the TAP tests or some equivalent
manual tests, and the revert of d9d0762 helps.

> Any chance you hit this with d9d0762 reverted? It's easy to e.g. revert and
> run the tests without recreating the temp-install, to reduce cycle times.
>
> Was there anything else running on the system? c98763bf51bf also needs to
> reverted, of course.

Yeah, I agree that we'd better revert c98763bf for the time being.
And f9900df on top of that?

I was trying to think of ways to get an isolation test out of that,
but that proves to be sort of tricky as we need to manipulate the HOT
chains after the validation phase has begun with the snapshot from the
build phase.  It is easy to block before the validation transaction
starts, like in WaitForLockersMultiple() beforehand, though.
--
Michael

Attachment
On 2022-May-30, Michael Paquier wrote:

> > Was there anything else running on the system? c98763bf51bf also needs to
> > reverted, of course.
> 
> Yeah, I agree that we'd better revert c98763bf for the time being.
> And f9900df on top of that?

Yeah, both commits need to be reverted, since the latter depends
critically on the former.  I'll get the revert pushed soon.

> I was trying to think of ways to get an isolation test out of that,
> but that proves to be sort of tricky as we need to manipulate the HOT
> chains after the validation phase has begun with the snapshot from the
> build phase.  It is easy to block before the validation transaction
> starts, like in WaitForLockersMultiple() beforehand, though.

Hmm.  I suppose for the next try of implementing a feature like this,
we'll definitely want to incorporate some tests that can catch problems
of this sort.  But I don't think we need to come up with something right
now.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Hi,

On 2022-05-29 09:15:15 +0500, Andrey Borodin wrote:
> > On 29 May 2022, at 00:34, Andres Freund <andres@anarazel.de> wrote:
> > 
> > c98763bf51bf also needs to
> > reverted, of course.
> Reverting c98763bf51bf fixed both tests. I did not understand that we are going to revert it too.

We're going to need to revert both, but I don't understand why c98763bf51bf
would affect this your tests. Afaics it should just affect CIC, not RIC. Could
you make sure you can reproduce the issue without c98763bf51bf reverted?

Greetings,

Andres Freund



On 2022-05-30 15:54:08 +0900, Michael Paquier wrote:
> Yeah, I agree that we'd better revert c98763bf for the time being.
> And f9900df on top of that?

Well, f9900df needs to be reverted, because it caused the problem at hand, and
is ontop of c98763bf...


> I was trying to think of ways to get an isolation test out of that,
> but that proves to be sort of tricky as we need to manipulate the HOT
> chains after the validation phase has begun with the snapshot from the
> build phase.  It is easy to block before the validation transaction
> starts, like in WaitForLockersMultiple() beforehand, though.

I think it's ok if we have a heuristic test for this kind of thing. It
sometimes can even be good, because it means we'll get different schedulings
over time, hitting "unknown" bugs.

Greetings,

Andres Freund



On Mon, May 30, 2022 at 12:10 PM Andres Freund <andres@anarazel.de> wrote:
> > I was trying to think of ways to get an isolation test out of that,
> > but that proves to be sort of tricky as we need to manipulate the HOT
> > chains after the validation phase has begun with the snapshot from the
> > build phase.  It is easy to block before the validation transaction
> > starts, like in WaitForLockersMultiple() beforehand, though.
>
> I think it's ok if we have a heuristic test for this kind of thing. It
> sometimes can even be good, because it means we'll get different schedulings
> over time, hitting "unknown" bugs.

As long as it has a reasonably good chance of failing with the bug,
it's still a valid test IMV. As you say, there may be some value in
not over-specifying what the problem is -- that could actually bring
unknown issues to light, especially if the test uses amcheck.

Clearly there is never any strict guarantee that writing a test will
avoid even one bug in the future. You're always working off some
intuition about what related problems might happen in the future,
weighed against the costs (mostly the added test cycles). And so
adding an imprecise test really isn't very different to adding a
precise test that reliably catches the bug that the test was written
to catch.

-- 
Peter Geoghegan



On 2022-May-30, Andres Freund wrote:

> On 2022-05-30 15:54:08 +0900, Michael Paquier wrote:
> > Yeah, I agree that we'd better revert c98763bf for the time being.
> > And f9900df on top of that?
> 
> Well, f9900df needs to be reverted, because it caused the problem at hand, and
> is ontop of c98763bf...

I'm not sure I understand what you're saying here.  I understood that
both RIC and CIC are affected, so the reversal is of the following
commits:

c98763bf51bf Avoid spurious waits in concurrent indexing
f9900df5f949 Avoid spurious wait in concurrent reindex
d9d076222f5b VACUUM: ignore indexing operations with CONCURRENTLY


-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"



On 2022-May-30, Alvaro Herrera wrote:

> I'm not sure I understand what you're saying here.  I understood that
> both RIC and CIC are affected, so the reversal is of the following
> commits:
> 
> c98763bf51bf Avoid spurious waits in concurrent indexing
> f9900df5f949 Avoid spurious wait in concurrent reindex
> d9d076222f5b VACUUM: ignore indexing operations with CONCURRENTLY

This is the full reversal in branch master.

I'm unable to reproduce a problem with the v3 patch Andrey last
provided, however.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)

Attachment
On 2022-05-30 22:40:39 +0200, Alvaro Herrera wrote:
> On 2022-May-30, Andres Freund wrote:
> 
> > On 2022-05-30 15:54:08 +0900, Michael Paquier wrote:
> > > Yeah, I agree that we'd better revert c98763bf for the time being.
> > > And f9900df on top of that?
> > 
> > Well, f9900df needs to be reverted, because it caused the problem at hand, and
> > is ontop of c98763bf...
> 
> I'm not sure I understand what you're saying here.  I understood that
> both RIC and CIC are affected, so the reversal is of the following
> commits:

I was just a bit confused about Michael's phrasing of reverting f9900df "on
top of" c98763bf. Not important...


> c98763bf51bf Avoid spurious waits in concurrent indexing
> f9900df5f949 Avoid spurious wait in concurrent reindex
> d9d076222f5b VACUUM: ignore indexing operations with CONCURRENTLY

That looks right, yes.

Greetings,

Andres Freund



On Mon, May 30, 2022 at 12:10:04PM -0700, Andres Freund wrote:
> I think it's ok if we have a heuristic test for this kind of thing. It
> sometimes can even be good, because it means we'll get different schedulings
> over time, hitting "unknown" bugs.

Yeah, the tricky part is how to parametize that to be cheap, still
useful.  Having a larger number of attributes makes the particular
problem of this thread easier to hit because it enlarges the
validation window in the concurrent build, and generating such dummy
data with a simple schema should be quick enough, but I would not
accept in the tree a test that takes ~5s to run without a rather good
hit rate.  For this particular issue, I would say that something able
to fail up to 20%~25% of the time for a short run-time would be quite
decent, actually, even if that sounds low in a single run because the
odds of detecting one failure increase a "lot" after a few repeated
runs.

On my own laptop, the original test of the reporter takes up to 3~4s
to reproduce the issue, for example, all the time, so that's good :)
--
Michael

Attachment
On Mon, May 30, 2022 at 10:43:41PM +0200, Alvaro Herrera wrote:
> This is the full reversal in branch master.

This looks fine to me, thanks.  I can see that you have kept the
reference to ComputeXidHorizons for vacuum_defer_cleanup_age in
guc.c.
--
Michael

Attachment

> 31 мая 2022 г., в 00:08, Andres Freund <andres@anarazel.de> написал(а):
>
> We're going to need to revert both, but I don't understand why c98763bf51bf
> would affect this your tests. Afaics it should just affect CIC, not RIC. Could
> you make sure you can reproduce the issue without c98763bf51bf reverted?

Here's the revision with tests v3 and revert of d9d0762 [0]. In CI you can see the tests pass on Linux and FreeBSD, but
failon MacOS [1]. 

Best regards, Andrey Borodin.

[0] https://github.com/x4m/postgres_g/commits/test
[1] https://cirrus-ci.com/task/4580203115577344?logs=test_world#L1566


On 2022-May-31, Michael Paquier wrote:

> On Mon, May 30, 2022 at 10:43:41PM +0200, Alvaro Herrera wrote:
> > This is the full reversal in branch master.
> 
> This looks fine to me, thanks.  I can see that you have kept the
> reference to ComputeXidHorizons for vacuum_defer_cleanup_age in
> guc.c.

Yeah, there was a typo fix there that I also kept as well as one newline
change.  Basically I scanned the reversal patch manually and removed the
changes I thought were better kept.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)



On 2022-May-31, Michael Paquier wrote:

> Yeah, the tricky part is how to parametize that to be cheap, still
> useful.  Having a larger number of attributes makes the particular
> problem of this thread easier to hit because it enlarges the
> validation window in the concurrent build,

Hmm, I wonder if it's possible to use an index ON tab (pg_sleep(100ms))
or something like that (perhaps even use an expresion involving an
advisory lock acquisition, which is being held by the other session), to
widen the window.

The stop events patch that A. Korotkov posted [0] would likely be very
useful here.

[0] https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com


> On my own laptop, the original test of the reporter takes up to 3~4s
> to reproduce the issue, for example, all the time, so that's good :)

Sadly, my laptop is somehow broken and I haven't reproduced the failure
even a single time, and my other regular computers are at high seas (I hope).

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



On Tue, May 31, 2022 at 11:19:01AM +0200, Alvaro Herrera wrote:
> Hmm, I wonder if it's possible to use an index ON tab (pg_sleep(100ms))
> or something like that (perhaps even use an expresion involving an
> advisory lock acquisition, which is being held by the other session), to
> widen the window.

I have not considered the use of expressions to enforce a lock at this
stage.  I'll give it a shot, even if it comes to the abuse of an
immutable function with volatile internals :)

> The stop events patch that A. Korotkov posted [0] would likely be very
> useful here.
>
> [0] https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp+Z++i4BGQoffKip6JDWngEA+g7Z-XmQ@mail.gmail.com

Thanks for the reference.  I've wanted to resurrect this stuff,
actually, after dealing with 6dced63 to be able to have a stop point
in the middle of a restart point.
--
Michael

Attachment
On 2022-May-26, Christophe Pettus wrote:

> Unless we are 100% sure that amcheck will never return a false
> negative in this situation, we may have to recommend reindexing
> everything.

I am not aware of any reason why amcheck would return a false negative.
Is there any evidence behind this assertion?

If there is none, the recommendation should be to use amcheck on all
btree indexes and reindex those that have the problem; and reindex all
indexes of other AMs that could have been reindexed or created
concurrently in 14beta1 or later (implying: an index that was created in
13 and pg_upgraded but not touched afterwards is not at risk).

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



On 2022-May-31, Alvaro Herrera wrote:

> If there is none, the recommendation should be to use amcheck on all
> btree indexes and reindex those that have the problem; and reindex all
> indexes of other AMs that could have been reindexed or created
> concurrently in 14beta1 or later (implying: an index that was created in
> 13 and pg_upgraded but not touched afterwards is not at risk).

Another possibility for very large indexes may be to disable all types
of index scans, then apply no-op UPDATEs to the unindexed rows until the
migrate to some other heap block, then vacuum.  After that, amcheck
should issue a clean report.  This is much less intrusive than
reindexing possibly several TB of data.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



On Tue, May 31, 2022 at 7:43 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I am not aware of any reason why amcheck would return a false negative.
> Is there any evidence behind this assertion?

With heapallindexed verification, there is a theoretical risk of them
from the Bloom filter itself, due to hash collisions. But the chance
of that happening is rather low. And there is a random seed, so absent
an extreme shortage of memory the collision shouldn't recur on a
second or a third attempt at verification. Or with a second or third
missing tuple in any single attempt at verifying a corrupt index.

> If there is none, the recommendation should be to use amcheck on all
> btree indexes and reindex those that have the problem; and reindex all
> indexes of other AMs that could have been reindexed or created
> concurrently in 14beta1 or later (implying: an index that was created in
> 13 and pg_upgraded but not touched afterwards is not at risk).

I'm in favor of this.

We agonized about these false negatives back in 2017, when the feature
went in, but I think that they're a negligible issue compared to
everything else. It just doesn't need to be discussed in the release
notes. The large majority of people that do this will find nothing at
all, because there is no corruption.

Lots of people that already use amcheck probably don't use
heapallindexed verification at all. For one thing it's not the default
with pg_amcheck. It's also significantly more expensive.

-- 
Peter Geoghegan



On 2022-May-28, Andrey Borodin wrote:

> And reverting d9d0762 does not fix the issue. I'm not sure if I'm observing some other problem here.
> 
> v4 of a test not use pg_sleep() and fails with regular amcheck
> failure. Reverting d9d0762 fixes the test. Unless I execute the test
> for 1 million transactions, then it fail even with a revert...
> 
> I suspect that v3 and v4 triggers different problems.

Hmm, the only failure I see with v4 is a deadlock of this sort:

2022-05-31 17:26:13.400 CEST [130375] 004_rc.pl ERROR:  deadlock detected
2022-05-31 17:26:13.400 CEST [130375] 004_rc.pl DETAIL:  Process 130375 waits for ShareLock on transaction 36108;
blockedby process 130372.
 
    Process 130372 waits for ShareLock on transaction 36107; blocked by process 130375.
    Process 130375: INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
                    on conflict(i) do update set updated_at = now();
    Process 130372: INSERT INTO tbl VALUES(random()*1000,0,0,0,now())
                    on conflict(i) do update set updated_at = now();

which is of course not very interesting.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
<Schwern> It does it in a really, really complicated way
<crab> why does it need to be complicated?
<Schwern> Because it's MakeMaker.




> On 31 May 2022, at 20:30, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2022-May-28, Andrey Borodin wrote:
>
>> And reverting d9d0762 does not fix the issue. I'm not sure if I'm observing some other problem here.
>>
>> v4 of a test not use pg_sleep() and fails with regular amcheck
>> failure. Reverting d9d0762 fixes the test. Unless I execute the test
>> for 1 million transactions, then it fail even with a revert...
>>
>> I suspect that v3 and v4 triggers different problems.
>
> Hmm, the only failure I see with v4 is a deadlock of this sort:

As seen in CI with reverting only d9d0762:
 v3 fails only on MacOS [0,1]
 v4 fails on Linux and FreeBSD [2,3]

So it's kind of system-dependent...


> On 31 May 2022, at 19:43, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> (implying: an index that was created in
> 13 and pg_upgraded but not touched afterwards is not at risk).

13.5+ to be accurate. Other 13 releases contained CIC\RC bug fixed by fdd965d0.


Best regards, Andrey Borodin.

[0] https://github.com/x4m/postgres_g/runs/6665396918
[1] https://cirrus-ci.com/task/4580203115577344?logs=test_world#L1571
[2] https://github.com/x4m/postgres_g/runs/6675153300
[3] https://cirrus-ci.com/task/6540481319403520?logs=test_world#L842


On 2022-May-30, Andres Freund wrote:

> On 2022-05-30 22:40:39 +0200, Alvaro Herrera wrote:

> > c98763bf51bf Avoid spurious waits in concurrent indexing
> > f9900df5f949 Avoid spurious wait in concurrent reindex
> > d9d076222f5b VACUUM: ignore indexing operations with CONCURRENTLY
> 
> That looks right, yes.

Sorry, after going through these commits again, this is a
misunderstanding on my part.  The first two commits need not be
reverted; only the third one does, which is the one that changes the way
the system-wide Xmin is determined (the commit message says it affects
VACUUM, but naturally it affects HOT pruning as well.)

The point of the first two changes is to remove snapshot waits in CIC
and RC.  Before those commits, each CIC would wait for all other CICs
and RCs in the system.  AFAICS this is unrelated to the HOT-pruning bug.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Para tener más hay que desear menos"



On 2022-May-31, Andrey Borodin wrote:

> As seen in CI with reverting only d9d0762:
>  v3 fails only on MacOS [0,1]
>  v4 fails on Linux and FreeBSD [2,3]

> [0] https://github.com/x4m/postgres_g/runs/6665396918
> [1] https://cirrus-ci.com/task/4580203115577344?logs=test_world#L1571
> [2] https://github.com/x4m/postgres_g/runs/6675153300
> [3] https://cirrus-ci.com/task/6540481319403520?logs=test_world#L842

Hmm ... Exactly what code is being tested here?  I see in one of these
backtraces that there is a reference to a function called
heap_prune_from_root, but I can't find that in the Postgres sources.

At this point, I'm clear that d9d076222f5b needs to be reverted because
of the interactions with HOT prune, but it's not at all clear to me what
is the relationship of the other two commits to bugs.  I am going to
revert that one now, so that we can continue to diagnose any lingering
problem that may be related to the other two commits -- but perhaps it
is something else.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



On Tue, May 31, 2022 at 11:42 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Hmm ... Exactly what code is being tested here?  I see in one of these
> backtraces that there is a reference to a function called
> heap_prune_from_root, but I can't find that in the Postgres sources.

This includes a patch from me, to make pruning more robust. It could
technically be a bug in my patch, though a bug revealed by my patch
seems quite possible too.

-- 
Peter Geoghegan



On 2022-May-31, Peter Geoghegan wrote:

> On Tue, May 31, 2022 at 11:42 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > Hmm ... Exactly what code is being tested here?  I see in one of these
> > backtraces that there is a reference to a function called
> > heap_prune_from_root, but I can't find that in the Postgres sources.
> 
> This includes a patch from me, to make pruning more robust. It could
> technically be a bug in my patch, though a bug revealed by my patch
> seems quite possible too.

Ah, I see.  Well, the explanation that depends on Xmin doesn't apply to
this case.  I'm not saying there are no bugs in the other two commits,
just that such bug(s) have a different explanation, so it makes sense to
revert separately.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)



On Tue, May 31, 2022 at 11:54 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> On 2022-May-31, Peter Geoghegan wrote:
> > This includes a patch from me, to make pruning more robust. It could
> > technically be a bug in my patch, though a bug revealed by my patch
> > seems quite possible too.
>
> Ah, I see.  Well, the explanation that depends on Xmin doesn't apply to
> this case.  I'm not saying there are no bugs in the other two commits,
> just that such bug(s) have a different explanation, so it makes sense to
> revert separately.

The assertion failure takes place in heapam_index_build_range_scan(),
at the point that it is called by amcheck itself -- during an
opportunistic prune that occurs in passing. This is the specific
assertion that fails:


https://github.com/x4m/postgres_g/blob/09d76bdc2528b476ef2f04c793061dcbf23372b5/src/backend/access/heap/pruneheap.c#L762

Another bug in HEAD definitely seems possible here, but it's hard to
know without careful testing. The failure that we see is limited to
the "macOS - Monterey" run, which only fails on this new test.

-- 
Peter Geoghegan



Hi,

On 2022-05-31 12:09:04 -0700, Peter Geoghegan wrote:
> On Tue, May 31, 2022 at 11:54 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > On 2022-May-31, Peter Geoghegan wrote:
> > > This includes a patch from me, to make pruning more robust. It could
> > > technically be a bug in my patch, though a bug revealed by my patch
> > > seems quite possible too.
> >
> > Ah, I see.  Well, the explanation that depends on Xmin doesn't apply to
> > this case.  I'm not saying there are no bugs in the other two commits,
> > just that such bug(s) have a different explanation, so it makes sense to
> > revert separately.
> 
> The assertion failure takes place in heapam_index_build_range_scan(),
> at the point that it is called by amcheck itself -- during an
> opportunistic prune that occurs in passing. This is the specific
> assertion that fails:
> 
>
https://github.com/x4m/postgres_g/blob/09d76bdc2528b476ef2f04c793061dcbf23372b5/src/backend/access/heap/pruneheap.c#L762

I don't think that assertion is correct.

Consider transactions aborting concurrently with heap pruning. You could have
done a HTSV for one chain element, a concurrent abort happened, then you did
the HTSV for another chain element. If the HTSVs were not in the order of the
HOT chain you could see HEAPTUPLE_DEAD for an earlier chain element, while
seeing HEAPTUPLE_INSERT_IN_PROGRESS in a later one.  There's several other
scenarios with subtransaction aborts as well, I think.

Greetings,

Andres Freund



On 2022-May-31, Peter Geoghegan wrote:

> The assertion failure takes place in heapam_index_build_range_scan(),
> at the point that it is called by amcheck itself -- during an
> opportunistic prune that occurs in passing. This is the specific
> assertion that fails:
> 
>
https://github.com/x4m/postgres_g/blob/09d76bdc2528b476ef2f04c793061dcbf23372b5/src/backend/access/heap/pruneheap.c#L762
> 
> Another bug in HEAD definitely seems possible here, but it's hard to
> know without careful testing. The failure that we see is limited to
> the "macOS - Monterey" run, which only fails on this new test.

Actually, it fails on Linux, FreeBSD and macOS -- it's only that one
particular version of the test fails in macOS and the other fails in the
other two systems.  But both of those failures occur in this assertion.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



On Tue, May 31, 2022 at 12:19 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think that assertion is correct.
>
> Consider transactions aborting concurrently with heap pruning. You could have
> done a HTSV for one chain element, a concurrent abort happened, then you did
> the HTSV for another chain element. If the HTSVs were not in the order of the
> HOT chain you could see HEAPTUPLE_DEAD for an earlier chain element, while
> seeing HEAPTUPLE_INSERT_IN_PROGRESS in a later one.  There's several other
> scenarios with subtransaction aborts as well, I think.

I think that it probably was correct before I rebased the patch on top
of your bugfix commit 18b87b201f. The original version would have
actually called HTSV directly, at the point that it accessed each
tuple from a HOT chain. If nothing else this suggests that the patch
should be clear on this point about not calling HTSV in HOT chain
order.

Offhand I think that it probably would still work if it was limited to
HEAPTUPLE_LIVE (no more asserting in the HEAPTUPLE_INSERT_IN_PROGRESS
case). Not sure if that's worth it. A topic for another time.

-- 
Peter Geoghegan



On 2022-May-31, Alvaro Herrera wrote:

> At this point, I'm clear that d9d076222f5b needs to be reverted because
> of the interactions with HOT prune, but it's not at all clear to me what
> is the relationship of the other two commits to bugs.  I am going to
> revert that one now, so that we can continue to diagnose any lingering
> problem that may be related to the other two commits -- but perhaps it
> is something else.

Reverted.  (This is the same commit that Michaël found to blame at the
start of the thread.)

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Hi,

On 2022-05-31 16:43:00 +0200, Alvaro Herrera wrote:
> On 2022-May-26, Christophe Pettus wrote:
>
> > Unless we are 100% sure that amcheck will never return a false
> > negative in this situation, we may have to recommend reindexing
> > everything.
>
> I am not aware of any reason why amcheck would return a false negative.
> Is there any evidence behind this assertion?

I think it's very likely to be possible to have false negatives:

1) It's pretty obvious that you could have constraint violations that went
   unnoticed due to the index corruption, were the missing heap-tuple since
   has been removed. Consider e.g. a foreign key cascading deletion that
   missed the tuple due to the missing index entry - if the
   wrongly-not-deleted tuple is subsequently updated, without changing the
   fkey relevant columns, everything would look fine from amcheck's view now.

2) It looks like we may not just end up with too few index entries, but can
   also end up with multiple index entries for a HOT chain. I don't think
   amcheck can detect this right now.

3) I think it's possible to end up with chain bogus root pointers being used
   when inserting index entries due to the bug. Those indexes will pass
   heapallindexed, but the index entries will point to bogus tuples.

Greetings,

Andres Freund