Thread: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

From

Tom Lane

Date:

07 June 2021, 05:15:14

husky just reported $SUBJECT:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17

and I find I can reproduce that locally:

diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out
/home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
--- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out    2021-01-20 11:12:24.854346717 -0500
+++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out    2021-06-06 22:12:07.948890104 -0400
@@ -215,7 +215,8 @@
      0 | f           | f
      1 | f           | f
      2 | t           | t
-(3 rows)
+     3 | t           | t
+(4 rows)

 select * from pg_check_frozen('copyfreeze');
  t_ctid
@@ -235,7 +236,8 @@
      0 | t           | t
      1 | f           | f
      2 | t           | t
-(3 rows)
+     3 | t           | t
+(4 rows)

 select * from pg_check_frozen('copyfreeze');
  t_ctid


The test cases that are failing date back to January (7db0cd2145f),
so I think this is some side-effect of a recent commit, but I have
no idea which one.

            regards, tom lane

Re: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

From

Masahiko Sawada

Date:

07 June 2021, 10:30:57

On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> husky just reported $SUBJECT:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17
>
> and I find I can reproduce that locally:
>
> diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out
/home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
> --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out       2021-01-20 11:12:24.854346717 -0500
> +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out        2021-06-06 22:12:07.948890104 -0400
> @@ -215,7 +215,8 @@
>       0 | f           | f
>       1 | f           | f
>       2 | t           | t
> -(3 rows)
> +     3 | t           | t
> +(4 rows)
>
>  select * from pg_check_frozen('copyfreeze');
>   t_ctid
> @@ -235,7 +236,8 @@
>       0 | t           | t
>       1 | f           | f
>       2 | t           | t
> -(3 rows)
> +     3 | t           | t
> +(4 rows)
>
>  select * from pg_check_frozen('copyfreeze');
>   t_ctid
>
>
> The test cases that are failing date back to January (7db0cd2145f),
> so I think this is some side-effect of a recent commit, but I have
> no idea which one.

It seems like the recent revert (8e03eb92e9a) is relevant.

After committing 7db0cd2145f we had the same regression test failure
in January[1]. Then we fixed that issue by 39b66a91b. But since we
recently reverted most of 39b66a91b, the same issue happened again.

Regards,

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hyrax&dt=2021-01-19+20%3A27%3A46

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

From

Masahiko Sawada

Date:

07 June 2021, 15:11:35

On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > husky just reported $SUBJECT:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17
> >
> > and I find I can reproduce that locally:
> >
> > diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out
/home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
> > --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out       2021-01-20 11:12:24.854346717
-0500
> > +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out        2021-06-06 22:12:07.948890104
-0400
> > @@ -215,7 +215,8 @@
> >       0 | f           | f
> >       1 | f           | f
> >       2 | t           | t
> > -(3 rows)
> > +     3 | t           | t
> > +(4 rows)
> >
> >  select * from pg_check_frozen('copyfreeze');
> >   t_ctid
> > @@ -235,7 +236,8 @@
> >       0 | t           | t
> >       1 | f           | f
> >       2 | t           | t
> > -(3 rows)
> > +     3 | t           | t
> > +(4 rows)
> >
> >  select * from pg_check_frozen('copyfreeze');
> >   t_ctid
> >
> >
> > The test cases that are failing date back to January (7db0cd2145f),
> > so I think this is some side-effect of a recent commit, but I have
> > no idea which one.
>
> It seems like the recent revert (8e03eb92e9a) is relevant.
>
> After committing 7db0cd2145f we had the same regression test failure
> in January[1]. Then we fixed that issue by 39b66a91b. But since we
> recently reverted most of 39b66a91b, the same issue happened again.
>

So the cause of this failure seems the same as before. The failed test is,

begin;
truncate copyfreeze;
copy copyfreeze from stdin freeze;
1       '1'
2       '2'
3       '3'
4       '4'
5       '5'
\.
copy copyfreeze from stdin;
6       '6'
\.
copy copyfreeze from stdin freeze;
7       '7'
8       '8'
9       '9'
10      '10'
11      '11'
12      '12'
\.
commit;

If the target block cache is invalidated before the third COPY, we
will start to insert the frozen tuple into a new page, resulting in
adding two blocks in total during the third COPY. I think we still
need the following part of the reverted code so that we don't leave
the page partially empty after relcache invalidation:

--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
         * target.
         */
        targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
-   }

-   /*
-    * If the FSM knows nothing of the rel, try the last page before we give
-    * up and extend.  This avoids one-tuple-per-page syndrome during
-    * bootstrapping or in a recently-started system.
-    */
-   if (targetBlock == InvalidBlockNumber)
-   {
-       BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
+       /*
+        * If the FSM knows nothing of the rel, try the last page before we
+        * give up and extend.  This avoids one-tuple-per-page syndrome during
+        * bootstrapping or in a recently-started system.
+        */
+       if (targetBlock == InvalidBlockNumber)
+       {
+           BlockNumber nblocks = RelationGetNumberOfBlocks(relation);

-       if (nblocks > 0)
-           targetBlock = nblocks - 1;
+           if (nblocks > 0)
+               targetBlock = nblocks - 1;
+       }
    }

Attached the patch that brings back the above change.

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Attachment

use_last_block.patch

Re: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

From

Tomas Vondra

Date:

07 June 2021, 16:01:43


On 6/7/21 2:11 PM, Masahiko Sawada wrote:
> On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>
>>> husky just reported $SUBJECT:
>>>
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17
>>>
>>> and I find I can reproduce that locally:
>>>
>>> diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out
/home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
>>> --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out       2021-01-20 11:12:24.854346717
-0500
>>> +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out        2021-06-06 22:12:07.948890104
-0400
>>> @@ -215,7 +215,8 @@
>>>       0 | f           | f
>>>       1 | f           | f
>>>       2 | t           | t
>>> -(3 rows)
>>> +     3 | t           | t
>>> +(4 rows)
>>>
>>>  select * from pg_check_frozen('copyfreeze');
>>>   t_ctid
>>> @@ -235,7 +236,8 @@
>>>       0 | t           | t
>>>       1 | f           | f
>>>       2 | t           | t
>>> -(3 rows)
>>> +     3 | t           | t
>>> +(4 rows)
>>>
>>>  select * from pg_check_frozen('copyfreeze');
>>>   t_ctid
>>>
>>>
>>> The test cases that are failing date back to January (7db0cd2145f),
>>> so I think this is some side-effect of a recent commit, but I have
>>> no idea which one.
>>
>> It seems like the recent revert (8e03eb92e9a) is relevant.
>>
>> After committing 7db0cd2145f we had the same regression test failure
>> in January[1]. Then we fixed that issue by 39b66a91b. But since we
>> recently reverted most of 39b66a91b, the same issue happened again.
>>
> 
> So the cause of this failure seems the same as before. The failed test is,
> 
> begin;
> truncate copyfreeze;
> copy copyfreeze from stdin freeze;
> 1       '1'
> 2       '2'
> 3       '3'
> 4       '4'
> 5       '5'
> \.
> copy copyfreeze from stdin;
> 6       '6'
> \.
> copy copyfreeze from stdin freeze;
> 7       '7'
> 8       '8'
> 9       '9'
> 10      '10'
> 11      '11'
> 12      '12'
> \.
> commit;
> 
> If the target block cache is invalidated before the third COPY, we
> will start to insert the frozen tuple into a new page, resulting in
> adding two blocks in total during the third COPY. I think we still
> need the following part of the reverted code so that we don't leave
> the page partially empty after relcache invalidation:
> 
> --- a/src/backend/access/heap/hio.c
> +++ b/src/backend/access/heap/hio.c
> @@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
>          * target.
>          */
>         targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
> -   }
> 
> -   /*
> -    * If the FSM knows nothing of the rel, try the last page before we give
> -    * up and extend.  This avoids one-tuple-per-page syndrome during
> -    * bootstrapping or in a recently-started system.
> -    */
> -   if (targetBlock == InvalidBlockNumber)
> -   {
> -       BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
> +       /*
> +        * If the FSM knows nothing of the rel, try the last page before we
> +        * give up and extend.  This avoids one-tuple-per-page syndrome during
> +        * bootstrapping or in a recently-started system.
> +        */
> +       if (targetBlock == InvalidBlockNumber)
> +       {
> +           BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
> 
> -       if (nblocks > 0)
> -           targetBlock = nblocks - 1;
> +           if (nblocks > 0)
> +               targetBlock = nblocks - 1;
> +       }
>     }
> 
> Attached the patch that brings back the above change.
> 

Thanks for the analysis! I think you're right - this bit should have
been kept. Partial reverts are tricky :-(

I'll get this fixed / pushed later today, after a bit more testing. I'd
swear I ran tests with CCA, but it's possible I skipped contrib.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: contrib/pg_visibility fails regression under CLOBBER_CACHE_ALWAYS

From

Masahiko Sawada

Date:

14 June 2021, 15:56:15

On Mon, Jun 7, 2021 at 10:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 6/7/21 2:11 PM, Masahiko Sawada wrote:
> > On Mon, Jun 7, 2021 at 4:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>
> >> On Mon, Jun 7, 2021 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>>
> >>> husky just reported $SUBJECT:
> >>>
> >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=husky&dt=2021-06-05%2013%3A42%3A17
> >>>
> >>> and I find I can reproduce that locally:
> >>>
> >>> diff -U3 /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out
/home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out
> >>> --- /home/postgres/pgsql/contrib/pg_visibility/expected/pg_visibility.out       2021-01-20 11:12:24.854346717
-0500
> >>> +++ /home/postgres/pgsql/contrib/pg_visibility/results/pg_visibility.out        2021-06-06 22:12:07.948890104
-0400
> >>> @@ -215,7 +215,8 @@
> >>>       0 | f           | f
> >>>       1 | f           | f
> >>>       2 | t           | t
> >>> -(3 rows)
> >>> +     3 | t           | t
> >>> +(4 rows)
> >>>
> >>>  select * from pg_check_frozen('copyfreeze');
> >>>   t_ctid
> >>> @@ -235,7 +236,8 @@
> >>>       0 | t           | t
> >>>       1 | f           | f
> >>>       2 | t           | t
> >>> -(3 rows)
> >>> +     3 | t           | t
> >>> +(4 rows)
> >>>
> >>>  select * from pg_check_frozen('copyfreeze');
> >>>   t_ctid
> >>>
> >>>
> >>> The test cases that are failing date back to January (7db0cd2145f),
> >>> so I think this is some side-effect of a recent commit, but I have
> >>> no idea which one.
> >>
> >> It seems like the recent revert (8e03eb92e9a) is relevant.
> >>
> >> After committing 7db0cd2145f we had the same regression test failure
> >> in January[1]. Then we fixed that issue by 39b66a91b. But since we
> >> recently reverted most of 39b66a91b, the same issue happened again.
> >>
> >
> > So the cause of this failure seems the same as before. The failed test is,
> >
> > begin;
> > truncate copyfreeze;
> > copy copyfreeze from stdin freeze;
> > 1       '1'
> > 2       '2'
> > 3       '3'
> > 4       '4'
> > 5       '5'
> > \.
> > copy copyfreeze from stdin;
> > 6       '6'
> > \.
> > copy copyfreeze from stdin freeze;
> > 7       '7'
> > 8       '8'
> > 9       '9'
> > 10      '10'
> > 11      '11'
> > 12      '12'
> > \.
> > commit;
> >
> > If the target block cache is invalidated before the third COPY, we
> > will start to insert the frozen tuple into a new page, resulting in
> > adding two blocks in total during the third COPY. I think we still
> > need the following part of the reverted code so that we don't leave
> > the page partially empty after relcache invalidation:
> >
> > --- a/src/backend/access/heap/hio.c
> > +++ b/src/backend/access/heap/hio.c
> > @@ -407,19 +407,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
> >          * target.
> >          */
> >         targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
> > -   }
> >
> > -   /*
> > -    * If the FSM knows nothing of the rel, try the last page before we give
> > -    * up and extend.  This avoids one-tuple-per-page syndrome during
> > -    * bootstrapping or in a recently-started system.
> > -    */
> > -   if (targetBlock == InvalidBlockNumber)
> > -   {
> > -       BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
> > +       /*
> > +        * If the FSM knows nothing of the rel, try the last page before we
> > +        * give up and extend.  This avoids one-tuple-per-page syndrome during
> > +        * bootstrapping or in a recently-started system.
> > +        */
> > +       if (targetBlock == InvalidBlockNumber)
> > +       {
> > +           BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
> >
> > -       if (nblocks > 0)
> > -           targetBlock = nblocks - 1;
> > +           if (nblocks > 0)
> > +               targetBlock = nblocks - 1;
> > +       }
> >     }
> >
> > Attached the patch that brings back the above change.
> >
>
> Thanks for the analysis! I think you're right - this bit should have
> been kept. Partial reverts are tricky :-(
>
> I'll get this fixed / pushed later today, after a bit more testing. I'd
> swear I ran tests with CCA, but it's possible I skipped contrib.

I had missed this mail. Thank you for pushing the fix!

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/