Thread: Flaky vacuum truncate test in reloptions.sql

Flaky vacuum truncate test in reloptions.sql

From
Arseny Sher
Date:
Hi,

I rarely observe failure of vacuum with truncation test in
reloptions.sql, i.e. the truncation doesn't happen:

--- ../../src/test/regress/expected/reloptions.out      2020-04-16 12:37:17.749547401 +0300
+++ ../../src/test/regress/results/reloptions.out       2020-04-17 00:14:58.999211750 +0300
@@ -131,7 +131,7 @@
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column?
 ----------
- t
+ f
 (1 row)

Intimate reading of lazy_scan_heap says that the failure indeed might
happen; if ConditionalLockBufferForCleanup couldn't lock the buffer and
either the buffer doesn't need freezing or vacuum is not aggressive, we
don't insist on close inspection of the page contents and count it as
nonempty according to lazy_check_needs_freeze. It means the page is
regarded as such even if it contains only garbage (but occupied) ItemIds,
which is the case of the test. And of course this allegedly nonempty
page prevents the truncation. Obvious competitors for the page are
bgwriter/checkpointer; the chances of a simultaneous attack are small
but they exist.

A simple fix is to perform aggressive VACUUM FREEZE, as attached.

I'm a bit puzzled that I've ever seen this only when running regression
tests under our multimaster. While multimaster contains a fair amount of
C code, I don't see how any of it can interfere with the vacuuming
business here. I can't say I did my best to create the repoduction
though -- the explanation above seems to be enough.


--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index 44c130409ff..fa1c223c87a 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -127,7 +127,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index cac5b0bcb0d..a84aae5093b 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -71,7 +71,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options

Re: Flaky vacuum truncate test in reloptions.sql

From
Michael Paquier
Date:
On Tue, Mar 30, 2021 at 01:58:50AM +0300, Arseny Sher wrote:
> Intimate reading of lazy_scan_heap says that the failure indeed might
> happen; if ConditionalLockBufferForCleanup couldn't lock the buffer and
> either the buffer doesn't need freezing or vacuum is not aggressive, we
> don't insist on close inspection of the page contents and count it as
> nonempty according to lazy_check_needs_freeze. It means the page is
> regarded as such even if it contains only garbage (but occupied) ItemIds,
> which is the case of the test. And of course this allegedly nonempty
> page prevents the truncation. Obvious competitors for the page are
> bgwriter/checkpointer; the chances of a simultaneous attack are small
> but they exist.

Yep, this is the same problem as the one discussed for c2dc1a7, where
a concurrent checkpoint may cause a page to be skipped, breaking the
test.

> I'm a bit puzzled that I've ever seen this only when running regression
> tests under our multimaster. While multimaster contains a fair amount of
> C code, I don't see how any of it can interfere with the vacuuming
> business here. I can't say I did my best to create the repoduction
> though -- the explanation above seems to be enough.

Why not just using DISABLE_PAGE_SKIPPING instead here?
--
Michael

Attachment

Re: Flaky vacuum truncate test in reloptions.sql

From
Arseny Sher
Date:
On 3/30/21 10:12 AM, Michael Paquier wrote:

 > Yep, this is the same problem as the one discussed for c2dc1a7, where
 > a concurrent checkpoint may cause a page to be skipped, breaking the
 > test.

Indeed, Alexander Lakhin pointed me to that commit after I wrote the 
message.

 > Why not just using DISABLE_PAGE_SKIPPING instead here?

I think this is not enough. DISABLE_PAGE_SKIPPING disables vm consulting 
(sets
aggressive=true in the routine); however, if the page is locked and
lazy_check_needs_freeze says there is nothing to freeze on it, we again 
don't
look at its contents closely.


-- cheers, arseny



Re: Flaky vacuum truncate test in reloptions.sql

From
Masahiko Sawada
Date:
On Tue, Mar 30, 2021 at 10:22 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
> On 3/30/21 10:12 AM, Michael Paquier wrote:
>
>  > Yep, this is the same problem as the one discussed for c2dc1a7, where
>  > a concurrent checkpoint may cause a page to be skipped, breaking the
>  > test.
>
> Indeed, Alexander Lakhin pointed me to that commit after I wrote the
> message.
>
>  > Why not just using DISABLE_PAGE_SKIPPING instead here?
>
> I think this is not enough. DISABLE_PAGE_SKIPPING disables vm consulting
> (sets
> aggressive=true in the routine); however, if the page is locked and
> lazy_check_needs_freeze says there is nothing to freeze on it, we again
> don't
> look at its contents closely.

Right.

Is it better to add FREEZE to the first "VACUUM reloptions_test;" as well?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Flaky vacuum truncate test in reloptions.sql

From
Arseny Sher
Date:
On 3/31/21 4:17 PM, Masahiko Sawada wrote:

 > Is it better to add FREEZE to the first "VACUUM reloptions_test;" as 
well?

I don't think this matters much, as it tests the contrary and the 
probability of
successful test passing (in case of theoretical bug making vacuum to 
truncate
non-empty relation) becomes stunningly small. But adding it wouldn't hurt
either.

-- cheers, arseny



Re: Flaky vacuum truncate test in reloptions.sql

From
Masahiko Sawada
Date:
On Wed, Mar 31, 2021 at 10:39 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
>
> On 3/31/21 4:17 PM, Masahiko Sawada wrote:
>
>  > Is it better to add FREEZE to the first "VACUUM reloptions_test;" as
> well?
>
> I don't think this matters much, as it tests the contrary and the
> probability of
> successful test passing (in case of theoretical bug making vacuum to
> truncate
> non-empty relation) becomes stunningly small. But adding it wouldn't hurt
> either.

I was concerned a bit that without FREEZE in the first VACUUM we could
not test it properly because the table could not be truncated because
either vacuum_truncate is off or the page is skipped.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Flaky vacuum truncate test in reloptions.sql

From
Arseny Sher
Date:
Masahiko Sawada <sawada.mshk@gmail.com> writes:

>> I don't think this matters much, as it tests the contrary and the
>> probability of
>> successful test passing (in case of theoretical bug making vacuum to
>> truncate
>> non-empty relation) becomes stunningly small. But adding it wouldn't hurt
>> either.
>
> I was concerned a bit that without FREEZE in the first VACUUM we could
> not test it properly because the table could not be truncated because
> either vacuum_truncate is off

FREEZE won't help us there.

> or the page is skipped.

You mean at the same time there is a potential bug in vacuum which would
force the truncation of non-empy relation if the page wasn't locked?
That would mean the chance of test getting passed even single time is
close to 0, as currently the chance of its failure is close to 1.


-- cheers, arseny



Re: Flaky vacuum truncate test in reloptions.sql

From
Arseny Sher
Date:
Arseny Sher <a.sher@postgrespro.ru> writes:

> as currently the chance of its failure is close to 1.

A typo, to 0 too, of course.



Re: Flaky vacuum truncate test in reloptions.sql

From
Masahiko Sawada
Date:
On Thu, Apr 1, 2021 at 12:08 PM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
>
> Masahiko Sawada <sawada.mshk@gmail.com> writes:
>
> >> I don't think this matters much, as it tests the contrary and the
> >> probability of
> >> successful test passing (in case of theoretical bug making vacuum to
> >> truncate
> >> non-empty relation) becomes stunningly small. But adding it wouldn't hurt
> >> either.
> >
> > I was concerned a bit that without FREEZE in the first VACUUM we could
> > not test it properly because the table could not be truncated because
> > either vacuum_truncate is off
>
> FREEZE won't help us there.
>
> > or the page is skipped.
>
> You mean at the same time there is a potential bug in vacuum which would
> force the truncation of non-empy relation if the page wasn't locked?

Just to be clear the context, I’m mentioning the following test case:

CREATE TABLE reloptions_test(i INT NOT NULL, j text)
        WITH (vacuum_truncate=false,
        toast.vacuum_truncate=false,
        autovacuum_enabled=false);
SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
VACUUM reloptions_test;
SELECT pg_relation_size('reloptions_test') > 0;

What I meant is that without FREEZE option, there are two possible
cases where the table is not truncated (i.g.,
pg_relation_size('reloptions_test') > 0 is true): the page got empty
by vacuum but is not truncated because of vacuum_truncate = false, and
the page could not be vacuumed (i.g., tuples remain in the page)
because the page is skipped due to conflict on cleanup lock on the
page. This test is intended to test the former case. I guess adding
FREEZE will prevent the latter case.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Flaky vacuum truncate test in reloptions.sql

From
Michael Paquier
Date:
On Thu, Apr 01, 2021 at 12:52:21PM +0900, Masahiko Sawada wrote:
> Just to be clear the context, I’m mentioning the following test case:

(Coming back a couple of emails later, where indeed I forgot about the
business with lazy_check_needs_freeze() that could cause a page to be
skipped even if DISABLE_PAGE_SKIPPING is used.)

> What I meant is that without FREEZE option, there are two possible
> cases where the table is not truncated (i.g.,
> pg_relation_size('reloptions_test') > 0 is true): the page got empty
> by vacuum but is not truncated because of vacuum_truncate = false, and
> the page could not be vacuumed (i.g., tuples remain in the page)
> because the page is skipped due to conflict on cleanup lock on the
> page. This test is intended to test the former case. I guess adding
> FREEZE will prevent the latter case.

What you are writing here makes sense to me.  Looking at the test, it
is designed to test vacuum_truncate, aka that the behavior we want to
stress (your former case here) gets stressed all the time, so adding
the options to avoid the latter case all the time is an improvement.
And this, even if the latter case does not actually cause a diff and
it has a small chance to happen in practice.

It would be good to add a comment explaining why the options are
added (aka just don't skip any pages).
--
Michael

Attachment

Re: Flaky vacuum truncate test in reloptions.sql

From
Arseny Sher
Date:
Michael Paquier <michael@paquier.xyz> writes:

> On Thu, Apr 01, 2021 at 12:52:21PM +0900, Masahiko Sawada wrote:
>> Just to be clear the context, I’m mentioning the following test case:

Sorry, I misremembered the test and assumed the table is non-empty there
while it is empty but vacuum_truncate is disabled. Still, this doesn't
change my conclusion of freezing being not a big deal there due to small
chance of locked page. Anyway, let's finish with this.

> What you are writing here makes sense to me.  Looking at the test, it
> is designed to test vacuum_truncate, aka that the behavior we want to
> stress (your former case here) gets stressed all the time, so adding
> the options to avoid the latter case all the time is an improvement.
> And this, even if the latter case does not actually cause a diff and
> it has a small chance to happen in practice.
>
> It would be good to add a comment explaining why the options are
> added (aka just don't skip any pages).

How about the attached?


-- cheers, arseny


diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index 44c130409ff..d6f5bf98114 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -102,7 +102,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page even if it is locked
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
  ?column? 
 ----------
@@ -127,7 +128,8 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page even if it is locked
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
  ?column? 
 ----------
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index cac5b0bcb0d..61638c3bcae 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -61,7 +61,8 @@ CREATE TABLE reloptions_test(i INT NOT NULL, j text)
     autovacuum_enabled=false);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page even if it is locked
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') > 0;
 
 SELECT reloptions FROM pg_class WHERE oid =
@@ -71,7 +72,8 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
-VACUUM reloptions_test;
+-- do aggressive vacuum to be sure we won't skip the page even if it is locked
+VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
 
 -- Test toast.* options

Re: Flaky vacuum truncate test in reloptions.sql

From
Michael Paquier
Date:
On Thu, Apr 01, 2021 at 10:58:25AM +0300, Arseny Sher wrote:
> How about the attached?

Sounds fine to me.  Sawada-san?
--
Michael

Attachment

Re: Flaky vacuum truncate test in reloptions.sql

From
Masahiko Sawada
Date:
On Thu, Apr 1, 2021 at 5:49 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Apr 01, 2021 at 10:58:25AM +0300, Arseny Sher wrote:
> > How about the attached?

Thank you for updating the patch!

> Sounds fine to me.  Sawada-san?

Looks good to me too.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Flaky vacuum truncate test in reloptions.sql

From
Michael Paquier
Date:
On Thu, Apr 01, 2021 at 10:54:29PM +0900, Masahiko Sawada wrote:
> Looks good to me too.

Okay, applied and back-patched down to 12 then.
--
Michael

Attachment

Re: Flaky vacuum truncate test in reloptions.sql

From
Masahiko Sawada
Date:
On Fri, Apr 2, 2021 at 9:46 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Apr 01, 2021 at 10:54:29PM +0900, Masahiko Sawada wrote:
> > Looks good to me too.
>
> Okay, applied and back-patched down to 12 then.

Thank you!


Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



Re: Flaky vacuum truncate test in reloptions.sql

From
Arseny Sher
Date:
On Fri, Apr 2, 2021 at 9:46 AM Michael Paquier <michael@paquier.xyz> wrote:

> Okay, applied and back-patched down to 12 then.

Thank you both. Unfortunately and surprisingly, the test still fails
(perhaps even rarer, once in several hundred runs) under
multimaster. After scratching the head for some more time, it seems to
me the following happens: not only vacuum encounters locked page, but
also there exist a concurrent backend (as the parallel schedule is run)
who holds back oldestxmin keeping it less than xid of transaction which
did the insertion

INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);

FreezeLimit can't be higher than oldestxmin, so lazy_check_needs_freeze
decides there is nothing to freeze on the page. multimaster commits are
quite heavy, which apparently shifts the timings making the issue more
likely.

Currently we are testing the rather funny attached patch which forces
all such old-snapshot-holders to finish. It is crutchy, but I doubt we
want to change vacuum logic (e.g. checking tuple liveness in
lazy_check_needs_freeze) due to this issue. (it is especially crutchy in
xid::bigint casts, but wraparound is hardly expected in regression tests
run).


-- cheers, arseny

diff --git a/src/test/regress/expected/reloptions.out b/src/test/regress/expected/reloptions.out
index bb7bd6e1e7e..78bbf4a5255 100644
--- a/src/test/regress/expected/reloptions.out
+++ b/src/test/regress/expected/reloptions.out
@@ -128,6 +128,20 @@ SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 ERROR:  null value in column "i" of relation "reloptions_test" violates not-null constraint
 DETAIL:  Failing row contains (null, null).
+do $$
+declare
+  my_xid bigint;
+  oldest_xmin bigint;
+begin
+  my_xid := txid_current();
+  while true loop
+    oldest_xmin := min(backend_xmin::text::bigint) from pg_stat_activity where pid != pg_backend_pid();
+    exit when oldest_xmin is null or oldest_xmin >= my_xid;
+    perform pg_sleep(0.1);
+    perform pg_stat_clear_snapshot();
+  end loop;
+end
+$$;
 -- Do an aggressive vacuum to prevent page-skipping.
 VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;
diff --git a/src/test/regress/sql/reloptions.sql b/src/test/regress/sql/reloptions.sql
index 95f7ab4189e..96fb59d16ad 100644
--- a/src/test/regress/sql/reloptions.sql
+++ b/src/test/regress/sql/reloptions.sql
@@ -72,6 +72,20 @@ SELECT reloptions FROM pg_class WHERE oid =
 ALTER TABLE reloptions_test RESET (vacuum_truncate);
 SELECT reloptions FROM pg_class WHERE oid = 'reloptions_test'::regclass;
 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
+do $$
+declare
+  my_xid bigint;
+  oldest_xmin bigint;
+begin
+  my_xid := txid_current();
+  while true loop
+    oldest_xmin := min(backend_xmin::text::bigint) from pg_stat_activity where pid != pg_backend_pid();
+    exit when oldest_xmin is null or oldest_xmin >= my_xid;
+    perform pg_sleep(0.1);
+    perform pg_stat_clear_snapshot();
+  end loop;
+end
+$$;
 -- Do an aggressive vacuum to prevent page-skipping.
 VACUUM FREEZE reloptions_test;
 SELECT pg_relation_size('reloptions_test') = 0;

Re: Flaky vacuum truncate test in reloptions.sql

From
Masahiko Sawada
Date:
On Mon, Apr 5, 2021 at 5:00 AM Arseny Sher <a.sher@postgrespro.ru> wrote:
>
>
> On Fri, Apr 2, 2021 at 9:46 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> > Okay, applied and back-patched down to 12 then.
>
> Thank you both. Unfortunately and surprisingly, the test still fails
> (perhaps even rarer, once in several hundred runs) under
> multimaster. After scratching the head for some more time, it seems to
> me the following happens: not only vacuum encounters locked page, but
> also there exist a concurrent backend (as the parallel schedule is run)
> who holds back oldestxmin keeping it less than xid of transaction which
> did the insertion
>
> INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
>
> FreezeLimit can't be higher than oldestxmin, so lazy_check_needs_freeze
> decides there is nothing to freeze on the page. multimaster commits are
> quite heavy, which apparently shifts the timings making the issue more
> likely.
>
> Currently we are testing the rather funny attached patch which forces
> all such old-snapshot-holders to finish. It is crutchy, but I doubt we
> want to change vacuum logic (e.g. checking tuple liveness in
> lazy_check_needs_freeze) due to this issue. (it is especially crutchy in
> xid::bigint casts, but wraparound is hardly expected in regression tests
> run).

Or maybe we can remove reloptions.sql test from the parallel group.
BTW I wonder if the following tests in vacuum.sql test also have the
same issue (page skipping and oldestxmin):

-- TRUNCATE option
CREATE TABLE vac_truncate_test(i INT NOT NULL, j text)
        WITH (vacuum_truncate=true, autovacuum_enabled=false);
INSERT INTO vac_truncate_test VALUES (1, NULL), (NULL, NULL);
VACUUM (TRUNCATE FALSE) vac_truncate_test;
SELECT pg_relation_size('vac_truncate_test') > 0;
VACUUM vac_truncate_test;
SELECT pg_relation_size('vac_truncate_test') = 0;
VACUUM (TRUNCATE FALSE, FULL TRUE) vac_truncate_test;
DROP TABLE vac_truncate_test;

Should we add FREEZE to those tests as well?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/