Thread: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS
husky just reported $SUBJECT: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17 and I find I can reproduce that locally: diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out 2021-01-20 11:12:24.854346717 -0500 +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out 2021-06-06 22:12:07.948890104 -0400 @@ -215,7 +215,8 @@ 0 | f | f 1 | f | f 2 | t | t -(3 rows) + 3 | t | t +(4 rows) select * from pg_check_frozen('copyfreeze'); t_ctid @@ -235,7 +236,8 @@ 0 | t | t 1 | f | f 2 | t | t -(3 rows) + 3 | t | t +(4 rows) select * from pg_check_frozen('copyfreeze'); t_ctid The test cases that are failing date back to January (7db0cd2145f), so I think this is some side-effect of a recent commit, but I have no idea which one. regards, tom lane
On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > husky just reported $SUBJECT: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17 > > and I find I can reproduce that locally: > > diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out > --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out 2021-01-20 11:12:24.854346717 -0500 > +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out 2021-06-06 22:12:07.948890104 -0400 > @@ -215,7 +215,8 @@ > 0 | f | f > 1 | f | f > 2 | t | t > -(3 rows) > + 3 | t | t > +(4 rows) > > select * from pg_check_frozen('copyfreeze'); > t_ctid > @@ -235,7 +236,8 @@ > 0 | t | t > 1 | f | f > 2 | t | t > -(3 rows) > + 3 | t | t > +(4 rows) > > select * from pg_check_frozen('copyfreeze'); > t_ctid > > > The test cases that are failing date back to January (7db0cd2145f), > so I think this is some side-effect of a recent commit, but I have > no idea which one. It seems like the recent revert (8e03eb92e9a) is relevant. After committing 7db0cd2145f we had the same regression test failure in January[1]. Then we fixed that issue by 39b66a91b. But since we recently reverted most of 39b66a91b, the same issue happened again. Regards, [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-01-19+20%3A27%3A46 -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > husky just reported $SUBJECT: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17 > > > > and I find I can reproduce that locally: > > > > diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out > > --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out 2021-01-20 11:12:24.854346717 -0500 > > +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out 2021-06-06 22:12:07.948890104 -0400 > > @@ -215,7 +215,8 @@ > > 0 | f | f > > 1 | f | f > > 2 | t | t > > -(3 rows) > > + 3 | t | t > > +(4 rows) > > > > select * from pg_check_frozen('copyfreeze'); > > t_ctid > > @@ -235,7 +236,8 @@ > > 0 | t | t > > 1 | f | f > > 2 | t | t > > -(3 rows) > > + 3 | t | t > > +(4 rows) > > > > select * from pg_check_frozen('copyfreeze'); > > t_ctid > > > > > > The test cases that are failing date back to January (7db0cd2145f), > > so I think this is some side-effect of a recent commit, but I have > > no idea which one. > > It seems like the recent revert (8e03eb92e9a) is relevant. > > After committing 7db0cd2145f we had the same regression test failure > in January[1]. Then we fixed that issue by 39b66a91b. But since we > recently reverted most of 39b66a91b, the same issue happened again. > So the cause of this failure seems the same as before. The failed test is, begin; truncate copyfreeze; copy copyfreeze from stdin freeze; 1 '1' 2 '2' 3 '3' 4 '4' 5 '5' \. copy copyfreeze from stdin; 6 '6' \. copy copyfreeze from stdin freeze; 7 '7' 8 '8' 9 '9' 10 '10' 11 '11' 12 '12' \. commit; If the target block cache is invalidated before the third COPY, we will start to insert the frozen tuple into a new page, resulting in adding two blocks in total during the third COPY. I think we still need the following part of the reverted code so that we don't leave the page partially empty after relcache invalidation: --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len, * target. */ targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace); - } - /* - * If the FSM knows nothing of the rel, try the last page before we give - * up and extend. This avoids one-tuple-per-page syndrome during - * bootstrapping or in a recently-started system. - */ - if (targetBlock == InvalidBlockNumber) - { - BlockNumber nblocks = RelationGetNumberOfBlocks(relation); + /* + * If the FSM knows nothing of the rel, try the last page before we + * give up and extend. This avoids one-tuple-per-page syndrome during + * bootstrapping or in a recently-started system. + */ + if (targetBlock == InvalidBlockNumber) + { + BlockNumber nblocks = RelationGetNumberOfBlocks(relation); - if (nblocks > 0) - targetBlock = nblocks - 1; + if (nblocks > 0) + targetBlock = nblocks - 1; + } } Attached the patch that brings back the above change. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On 6/7/21 2:11 PM, Masahiko Sawada wrote: > On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> >>> husky just reported $SUBJECT: >>> >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17 >>> >>> and I find I can reproduce that locally: >>> >>> diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out >>> --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out 2021-01-20 11:12:24.854346717 -0500 >>> +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out 2021-06-06 22:12:07.948890104 -0400 >>> @@ -215,7 +215,8 @@ >>> 0 | f | f >>> 1 | f | f >>> 2 | t | t >>> -(3 rows) >>> + 3 | t | t >>> +(4 rows) >>> >>> select * from pg_check_frozen('copyfreeze'); >>> t_ctid >>> @@ -235,7 +236,8 @@ >>> 0 | t | t >>> 1 | f | f >>> 2 | t | t >>> -(3 rows) >>> + 3 | t | t >>> +(4 rows) >>> >>> select * from pg_check_frozen('copyfreeze'); >>> t_ctid >>> >>> >>> The test cases that are failing date back to January (7db0cd2145f), >>> so I think this is some side-effect of a recent commit, but I have >>> no idea which one. >> >> It seems like the recent revert (8e03eb92e9a) is relevant. >> >> After committing 7db0cd2145f we had the same regression test failure >> in January[1]. Then we fixed that issue by 39b66a91b. But since we >> recently reverted most of 39b66a91b, the same issue happened again. >> > > So the cause of this failure seems the same as before. The failed test is, > > begin; > truncate copyfreeze; > copy copyfreeze from stdin freeze; > 1 '1' > 2 '2' > 3 '3' > 4 '4' > 5 '5' > \. > copy copyfreeze from stdin; > 6 '6' > \. > copy copyfreeze from stdin freeze; > 7 '7' > 8 '8' > 9 '9' > 10 '10' > 11 '11' > 12 '12' > \. > commit; > > If the target block cache is invalidated before the third COPY, we > will start to insert the frozen tuple into a new page, resulting in > adding two blocks in total during the third COPY. I think we still > need the following part of the reverted code so that we don't leave > the page partially empty after relcache invalidation: > > --- a/src/backend/access/heap/hio.c > +++ b/src/backend/access/heap/hio.c > @@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len, > * target. > */ > targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace); > - } > > - /* > - * If the FSM knows nothing of the rel, try the last page before we give > - * up and extend. This avoids one-tuple-per-page syndrome during > - * bootstrapping or in a recently-started system. > - */ > - if (targetBlock == InvalidBlockNumber) > - { > - BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > + /* > + * If the FSM knows nothing of the rel, try the last page before we > + * give up and extend. This avoids one-tuple-per-page syndrome during > + * bootstrapping or in a recently-started system. > + */ > + if (targetBlock == InvalidBlockNumber) > + { > + BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > > - if (nblocks > 0) > - targetBlock = nblocks - 1; > + if (nblocks > 0) > + targetBlock = nblocks - 1; > + } > } > > Attached the patch that brings back the above change. > Thanks for the analysis! I think you're right - this bit should have been kept. Partial reverts are tricky :-( I'll get this fixed / pushed later today, after a bit more testing. I'd swear I ran tests with CCA, but it's possible I skipped contrib. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jun 7, 2021 at 10:01 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 6/7/21 2:11 PM, Masahiko Sawada wrote: > > On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >> > >> On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> > >>> husky just reported $SUBJECT: > >>> > >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17 > >>> > >>> and I find I can reproduce that locally: > >>> > >>> diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out > >>> --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out 2021-01-20 11:12:24.854346717 -0500 > >>> +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out 2021-06-06 22:12:07.948890104 -0400 > >>> @@ -215,7 +215,8 @@ > >>> 0 | f | f > >>> 1 | f | f > >>> 2 | t | t > >>> -(3 rows) > >>> + 3 | t | t > >>> +(4 rows) > >>> > >>> select * from pg_check_frozen('copyfreeze'); > >>> t_ctid > >>> @@ -235,7 +236,8 @@ > >>> 0 | t | t > >>> 1 | f | f > >>> 2 | t | t > >>> -(3 rows) > >>> + 3 | t | t > >>> +(4 rows) > >>> > >>> select * from pg_check_frozen('copyfreeze'); > >>> t_ctid > >>> > >>> > >>> The test cases that are failing date back to January (7db0cd2145f), > >>> so I think this is some side-effect of a recent commit, but I have > >>> no idea which one. > >> > >> It seems like the recent revert (8e03eb92e9a) is relevant. > >> > >> After committing 7db0cd2145f we had the same regression test failure > >> in January[1]. Then we fixed that issue by 39b66a91b. But since we > >> recently reverted most of 39b66a91b, the same issue happened again. > >> > > > > So the cause of this failure seems the same as before. The failed test is, > > > > begin; > > truncate copyfreeze; > > copy copyfreeze from stdin freeze; > > 1 '1' > > 2 '2' > > 3 '3' > > 4 '4' > > 5 '5' > > \. > > copy copyfreeze from stdin; > > 6 '6' > > \. > > copy copyfreeze from stdin freeze; > > 7 '7' > > 8 '8' > > 9 '9' > > 10 '10' > > 11 '11' > > 12 '12' > > \. > > commit; > > > > If the target block cache is invalidated before the third COPY, we > > will start to insert the frozen tuple into a new page, resulting in > > adding two blocks in total during the third COPY. I think we still > > need the following part of the reverted code so that we don't leave > > the page partially empty after relcache invalidation: > > > > --- a/src/backend/access/heap/hio.c > > +++ b/src/backend/access/heap/hio.c > > @@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len, > > * target. > > */ > > targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace); > > - } > > > > - /* > > - * If the FSM knows nothing of the rel, try the last page before we give > > - * up and extend. This avoids one-tuple-per-page syndrome during > > - * bootstrapping or in a recently-started system. > > - */ > > - if (targetBlock == InvalidBlockNumber) > > - { > > - BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > > + /* > > + * If the FSM knows nothing of the rel, try the last page before we > > + * give up and extend. This avoids one-tuple-per-page syndrome during > > + * bootstrapping or in a recently-started system. > > + */ > > + if (targetBlock == InvalidBlockNumber) > > + { > > + BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > > > > - if (nblocks > 0) > > - targetBlock = nblocks - 1; > > + if (nblocks > 0) > > + targetBlock = nblocks - 1; > > + } > > } > > > > Attached the patch that brings back the above change. > > > > Thanks for the analysis! I think you're right - this bit should have > been kept. Partial reverts are tricky :-( > > I'll get this fixed / pushed later today, after a bit more testing. I'd > swear I ran tests with CCA, but it's possible I skipped contrib. I had missed this mail. Thank you for pushing the fix! Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/