Thread: Odd, intermittent failure in contrib/pageinspect

Odd, intermittent failure in contrib/pageinspect

From
Tom Lane
Date:
whelk failed today [1] with this surprising symptom:

--- snip ---
diff -w -U3 C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/expected/page.out
C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/results/page.out
--- C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/expected/page.out    2020-03-08 09:00:35.036254700 +0100
+++ C:/buildfarm/buildenv/HEAD/pgsql.build/contrib/pageinspect/results/page.out    2021-01-18 22:10:10.889655500 +0100
@@ -90,8 +90,8 @@
 FROM heap_page_items(get_raw_page('test1', 0)),
      LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
  t_infomask | t_infomask2 |                         raw_flags                         |   combined_flags
-------------+-------------+-----------------------------------------------------------+--------------------
-       2816 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
+------------+-------------+-----------------------------------------+----------------
+       2304 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (1 row)

 -- output the decoded flag HEAP_XMIN_FROZEN instead
@@ -99,8 +99,8 @@
 FROM heap_page_items(get_raw_page('test1', 0)),
      LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
  t_infomask | t_infomask2 |                         raw_flags                         |   combined_flags
-------------+-------------+-----------------------------------------------------------+--------------------
-       2816 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
+------------+-------------+-----------------------------------------+----------------
+       2304 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID} | {}
 (1 row)

 -- tests for decoding of combined flags
--- snip ---

Searching the buildfarm logs turned up exactly one previous occurrence,
also on whelk [2].  So I'm not sure what to make of it.  Could the
immediately preceding VACUUM FREEZE command have silently skipped this
page for some reason?  That'd be a bug I should think.

Also, not really a bug, but why is this test script running exactly
the same query twice in a row?  If that's of value, and not just a
copy-and-paste error, the comments sure don't explain why.  But what
it looks like is that these queries were different when first added,
and then 58b4cb30a5b made them the same when it probably should have
removed one.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2021-01-18%2020%3A42%3A13
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=whelk&dt=2020-04-17%2023%3A42%3A10



Re: Odd, intermittent failure in contrib/pageinspect

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-Jan-18, Tom Lane wrote:
>> Searching the buildfarm logs turned up exactly one previous occurrence,
>> also on whelk [2].  So I'm not sure what to make of it.  Could the
>> immediately preceding VACUUM FREEZE command have silently skipped this
>> page for some reason?  That'd be a bug I should think.

> Hmm, doesn't vacuum skip pages when they are pinned?  I don't think
> VACUUM FREEZE would be treated especially -- only "aggressive"
> wraparound would be an exception, IIRC.

Right.  If that's the explanation, then adding DISABLE_PAGE_SKIPPING
to the test's VACUUM options should fix it.  However, to believe that
theory you have to have some reason to think that some other process
might have the page pinned.  What would that be?  test1 only has one
small tuple in it, so it doesn't seem credible that autovacuum or
autoanalyze would have fired on it.

[ thinks for a bit... ]  Does the checkpointer pin pages it's writing
out?  I guess it'd have to ...

            regards, tom lane



Re: Odd, intermittent failure in contrib/pageinspect

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-Jan-18, Tom Lane wrote:
>> [ thinks for a bit... ]  Does the checkpointer pin pages it's writing
>> out?  I guess it'd have to ...

> It does, per SyncOneBuffer(), called from BufferSync(), called from
> CheckPointBuffers().

Right, then we don't need any strange theories about autovacuum,
just bad timing luck.  whelk does seem pretty slow, so it's not
much of a stretch to imagine that it's more susceptible to this
corner case than faster machines.

So, do we have any other tests that are invoking a manual vacuum
and assuming it won't skip any pages?  By this theory, they'd
all be failures waiting to happen.

            regards, tom lane



Re: Odd, intermittent failure in contrib/pageinspect

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Jan 18, 2021 at 05:47:40PM -0500, Tom Lane wrote:
>> So, do we have any other tests that are invoking a manual vacuum
>> and assuming it won't skip any pages?  By this theory, they'd
>> all be failures waiting to happen.

> check_heap.sql and heap_surgery.sql have one VACUUM FREEZE each and it
> seems to me that we had better be sure that no pages are skipped for
> their cases?

It looks to me like heap_surgery ought to be okay, because it's operating
on a temp table; if there are any page access conflicts on that, we've
got BIG trouble ;-)

Poking around, I found a few other places where it looked like a skipped
page could produce diffs in the expected output:
contrib/amcheck/t/001_verify_heapam.pl
contrib/pg_visibility/sql/pg_visibility.sql

There are lots of other vacuums of course, but they don't look like
a missed page would have any effect on the visible results, so I think
we should leave them alone.

In short I propose the attached patch, which also gets rid of
that duplicate query.

            regards, tom lane

diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index 882f853d56..1fb3823142 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -109,7 +109,7 @@ ERROR:  ending block number must be between 0 and 0
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 10000, endblock := 11000);
 ERROR:  starting block number must be between 0 and 0
 -- Vacuum freeze to change the xids encountered in subsequent tests
-VACUUM FREEZE heaptest;
+VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) heaptest;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty frozen table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
diff --git a/contrib/amcheck/sql/check_heap.sql b/contrib/amcheck/sql/check_heap.sql
index c10a25f21c..298de6886a 100644
--- a/contrib/amcheck/sql/check_heap.sql
+++ b/contrib/amcheck/sql/check_heap.sql
@@ -51,7 +51,7 @@ SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock :=
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 10000, endblock := 11000);

 -- Vacuum freeze to change the xids encountered in subsequent tests
-VACUUM FREEZE heaptest;
+VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) heaptest;

 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty frozen table
diff --git a/contrib/amcheck/t/001_verify_heapam.pl b/contrib/amcheck/t/001_verify_heapam.pl
index 1581e51f3c..a2f65b826d 100644
--- a/contrib/amcheck/t/001_verify_heapam.pl
+++ b/contrib/amcheck/t/001_verify_heapam.pl
@@ -46,7 +46,7 @@ detects_heap_corruption(
 # Check a corrupt table with all-frozen data
 #
 fresh_test_table('test');
-$node->safe_psql('postgres', q(VACUUM FREEZE test));
+$node->safe_psql('postgres', q(VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test));
 corrupt_first_page('test');
 detects_heap_corruption("verify_heapam('test')",
     "all-frozen corrupted table");
diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 4cd0db8018..4da28f0a1d 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -1,7 +1,7 @@
 CREATE EXTENSION pageinspect;
 CREATE TABLE test1 (a int, b int);
 INSERT INTO test1 VALUES (16777217, 131584);
-VACUUM test1;  -- set up FSM
+VACUUM (DISABLE_PAGE_SKIPPING) test1;  -- set up FSM
 -- The page contents can vary, so just test that it can be read
 -- successfully, but don't keep the output.
 SELECT octet_length(get_raw_page('test1', 'main', 0)) AS main_0;
@@ -87,18 +87,8 @@ SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));
 (1 row)

 -- If we freeze the only tuple on test1, the infomask should
--- always be the same in all test runs. we show raw flags by
--- default: HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID.
-VACUUM FREEZE test1;
-SELECT t_infomask, t_infomask2, raw_flags, combined_flags
-FROM heap_page_items(get_raw_page('test1', 0)),
-     LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
- t_infomask | t_infomask2 |                         raw_flags                         |   combined_flags
-------------+-------------+-----------------------------------------------------------+--------------------
-       2816 |           2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID} | {HEAP_XMIN_FROZEN}
-(1 row)
-
--- output the decoded flag HEAP_XMIN_FROZEN instead
+-- always be the same in all test runs.
+VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test1;
 SELECT t_infomask, t_infomask2, raw_flags, combined_flags
 FROM heap_page_items(get_raw_page('test1', 0)),
      LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
diff --git a/contrib/pageinspect/sql/page.sql b/contrib/pageinspect/sql/page.sql
index 01844cb629..d333b763d7 100644
--- a/contrib/pageinspect/sql/page.sql
+++ b/contrib/pageinspect/sql/page.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION pageinspect;
 CREATE TABLE test1 (a int, b int);
 INSERT INTO test1 VALUES (16777217, 131584);

-VACUUM test1;  -- set up FSM
+VACUUM (DISABLE_PAGE_SKIPPING) test1;  -- set up FSM

 -- The page contents can vary, so just test that it can be read
 -- successfully, but don't keep the output.
@@ -34,15 +34,9 @@ SELECT tuple_data_split('test1'::regclass, t_data, t_infomask, t_infomask2, t_bi
 SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));

 -- If we freeze the only tuple on test1, the infomask should
--- always be the same in all test runs. we show raw flags by
--- default: HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID.
-VACUUM FREEZE test1;
+-- always be the same in all test runs.
+VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) test1;

-SELECT t_infomask, t_infomask2, raw_flags, combined_flags
-FROM heap_page_items(get_raw_page('test1', 0)),
-     LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
-
--- output the decoded flag HEAP_XMIN_FROZEN instead
 SELECT t_infomask, t_infomask2, raw_flags, combined_flags
 FROM heap_page_items(get_raw_page('test1', 0)),
      LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2);
diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out
index 0017e3415c..315633bfea 100644
--- a/contrib/pg_visibility/expected/pg_visibility.out
+++ b/contrib/pg_visibility/expected/pg_visibility.out
@@ -105,7 +105,7 @@ ERROR:  "test_foreign_table" is not a table, materialized view, or TOAST table
 create table regular_table (a int, b text);
 alter table regular_table alter column b set storage external;
 insert into regular_table values (1, repeat('one', 1000)), (2, repeat('two', 1000));
-vacuum regular_table;
+vacuum (disable_page_skipping) regular_table;
 select count(*) > 0 from pg_visibility('regular_table');
  ?column?
 ----------
@@ -132,7 +132,7 @@ select count(*) > 0 from pg_visibility((select reltoastrelid from pg_class where
 (1 row)

 create materialized view matview_visibility_test as select * from regular_table;
-vacuum matview_visibility_test;
+vacuum (disable_page_skipping) matview_visibility_test;
 select count(*) > 0 from pg_visibility('matview_visibility_test');
  ?column?
 ----------
@@ -149,7 +149,7 @@ select count(*) > 0 from pg_visibility('matview_visibility_test');

 -- regular tables which are part of a partition *do* have visibility maps
 insert into test_partition values (1);
-vacuum test_partition;
+vacuum (disable_page_skipping) test_partition;
 select count(*) > 0 from pg_visibility('test_partition', 0);
  ?column?
 ----------
diff --git a/contrib/pg_visibility/sql/pg_visibility.sql b/contrib/pg_visibility/sql/pg_visibility.sql
index ec1afd4906..ff3538f996 100644
--- a/contrib/pg_visibility/sql/pg_visibility.sql
+++ b/contrib/pg_visibility/sql/pg_visibility.sql
@@ -71,7 +71,7 @@ select pg_truncate_visibility_map('test_foreign_table');
 create table regular_table (a int, b text);
 alter table regular_table alter column b set storage external;
 insert into regular_table values (1, repeat('one', 1000)), (2, repeat('two', 1000));
-vacuum regular_table;
+vacuum (disable_page_skipping) regular_table;
 select count(*) > 0 from pg_visibility('regular_table');
 select count(*) > 0 from pg_visibility((select reltoastrelid from pg_class where relname = 'regular_table'));
 truncate regular_table;
@@ -79,7 +79,7 @@ select count(*) > 0 from pg_visibility('regular_table');
 select count(*) > 0 from pg_visibility((select reltoastrelid from pg_class where relname = 'regular_table'));

 create materialized view matview_visibility_test as select * from regular_table;
-vacuum matview_visibility_test;
+vacuum (disable_page_skipping) matview_visibility_test;
 select count(*) > 0 from pg_visibility('matview_visibility_test');
 insert into regular_table values (1), (2);
 refresh materialized view matview_visibility_test;
@@ -87,7 +87,7 @@ select count(*) > 0 from pg_visibility('matview_visibility_test');

 -- regular tables which are part of a partition *do* have visibility maps
 insert into test_partition values (1);
-vacuum test_partition;
+vacuum (disable_page_skipping) test_partition;
 select count(*) > 0 from pg_visibility('test_partition', 0);
 select count(*) > 0 from pg_visibility_map('test_partition');
 select count(*) > 0 from pg_visibility_map_summary('test_partition');

Re: Odd, intermittent failure in contrib/pageinspect

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I think you don't event need checkpointer to be involved, normal buffer
> replacement would do the trick. We briefly pin the page in BufferAlloc()
> even if the page is clean. Longer when it's dirty, of course.

True, but it seems unlikely that the pages in question here would be
chosen as replacement victims.  These are non-parallel tests, so
there's little competitive pressure.  I could believe that a background
autovacuum is active, but not that it's dirtied so many pages that
tables the test script just created need to get swapped out.

The checkpointer theory seems good because it requires no assumptions
at all about competing demand for buffers.  If the clock sweep gets
to the table page (which we know is recently dirtied) at just the right
time, we'll see a failure.

            regards, tom lane



Re: Odd, intermittent failure in contrib/pageinspect

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Jan 19, 2021 at 05:03:49PM -0500, Tom Lane wrote:
>> In short I propose the attached patch, which also gets rid of
>> that duplicate query.

> Agreed, +1.

Pushed, thanks for looking at it.

            regards, tom lane