Thread: Why the buildfarm is all pink

Why the buildfarm is all pink

From
Tom Lane
Date:
I was surprised to see that my back-patches of the recent SubLink
unpleasantness were failing on many of the buildfarm members, but
only in the 9.1 and 9.0 branches.  The difficulty appears to be
that the EXPLAIN output for the new test query changes depending on
whether or not "tenk1" has been analyzed yet.  In 9.2 and up,
it reliably has been, because create_index runs first and that script
does this:

create_index.sql:901:vacuum analyze tenk1;        -- ensure we get consistent plans here

But the older branches lack that.  Running the tests serially
usually fails in 9.1 and 9.0, and likely would fail in 8.4 except
that that branch isn't printing the selected plan for lack of
EXPLAIN (COSTS OFF).  Parallel tests sometimes succeed (and
did for me), because the subselect test runs concurrently with
"aggregates" and "join", which have

aggregates.sql:211:analyze tenk1;        -- ensure we get consistent plans here
join.sql:333:analyze tenk1;        -- ensure we get consistent plans here

so depending on timing, one of those might have gotten the job done,
or maybe autovacuum would show up in time to save the day.

We need a more consistent strategy for this :-(

The minimum-change strategy for getting the buildfarm green again
would be to insert another ad-hoc "analyze tenk1" into subselect.sql
in the back branches.  I don't particularly want to fix it that way,
though, because it'd just be a problem waiting to happen anytime
someone back-patches a bug fix that includes EXPLAIN output.

What I think would be the best strategy, on the whole, is to put
a whole-database "ANALYZE;" at the end of the "copy" regression test,
which is the one that loads up tenk1 and the other large test tables.
It also comes after the tests that load up small static tables such
as int4_tbl.  This would ensure that all the tables that we typically
use for one-off EXPLAIN tests are analyzed early in the proceedings.
Then we could get rid of the various ad-hoc analyzes that have snuck
into various tests.

While I'm on the subject ... I noticed that the recently-added 
matview test has this:

matview.sql:133:VACUUM ANALYZE;

This doesn't make me happy.  Aside from the sheer waste of cycles
involved in re-analyzing the entire regression database, this
test runs in parallel with half a dozen others, and it could cause
plan instability in those.  Of course, if it does, then most likely
those tests have a hazard from autovacuum anyway.  But this still
looks to me like a poor bit of test design.

Anyway, bottom line is that I think we need to institute, and
back-patch, some consistent scheme for when to analyze the standard
tables during the regression tests, so that we don't have hazards
like this for tests that want to check what plan gets selected.

Comments?
        regards, tom lane



Re: Why the buildfarm is all pink

From
Robert Haas
Date:
On Tue, Dec 10, 2013 at 7:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This doesn't make me happy.  Aside from the sheer waste of cycles
> involved in re-analyzing the entire regression database, this
> test runs in parallel with half a dozen others, and it could cause
> plan instability in those.  Of course, if it does, then most likely
> those tests have a hazard from autovacuum anyway.  But this still
> looks to me like a poor bit of test design.

Agreed.

> Anyway, bottom line is that I think we need to institute, and
> back-patch, some consistent scheme for when to analyze the standard
> tables during the regression tests, so that we don't have hazards
> like this for tests that want to check what plan gets selected.
>
> Comments?

Everything you're saying sounds reasonable from here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why the buildfarm is all pink

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Dec 10, 2013 at 7:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Anyway, bottom line is that I think we need to institute, and
>> back-patch, some consistent scheme for when to analyze the standard
>> tables during the regression tests, so that we don't have hazards
>> like this for tests that want to check what plan gets selected.
>>
>> Comments?

> Everything you're saying sounds reasonable from here.

I experimented with this and found out that an across-the-board ANALYZE
in the copy test causes a number of existing plan results to change,
because actually the very small tables like int4_tbl have never been
analyzed at all in the past.  Now, that might be just fine, or it might
be that those tests would no longer test the code path they were meant
to test.  I lack the energy right now to examine each one and decide
whether the change is OK.  What I propose instead is that the copy test
only analyze the tables that it itself loaded.  This lets us get rid
of most of the ad-hoc ANALYZEs of pre-existing tables, as in the attached
proposed patch against HEAD.  I ended up leaving in the "vacuum analyze
tenk1" in create_index.sql, because it turns out what that is actually
accomplishing is to fill the visibility map so that an index-only scan
will be chosen.  Maybe we should change all the added ANALYZEs in the
copy test to VACUUM ANALYZEs?  That would be redundant though with the
database-wide VACUUM in sanity_check.sql, which unfortunately runs after
create_index.sql.

I haven't touched matview.sql here; that seems like a distinct issue.

            regards, tom lane

diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index e17881c..c05b39c 100644
*** a/src/test/regress/expected/aggregates.out
--- b/src/test/regress/expected/aggregates.out
*************** FROM bool_test;
*** 506,512 ****
  -- Test cases that should be optimized into indexscans instead of
  -- the generic aggregate implementation.
  --
- analyze tenk1;        -- ensure we get consistent plans here
  -- Basic cases
  explain (costs off)
    select min(unique1) from tenk1;
--- 506,511 ----
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 232a233..0f0c638 100644
*** a/src/test/regress/expected/alter_table.out
--- b/src/test/regress/expected/alter_table.out
*************** ALTER INDEX tmp_onek_unique1 RENAME TO o
*** 140,146 ****
  CREATE VIEW tmp_view (unique1) AS SELECT unique1 FROM tenk1;
  ALTER TABLE tmp_view RENAME TO tmp_view_new;
  -- hack to ensure we get an indexscan here
- ANALYZE tenk1;
  set enable_seqscan to off;
  set enable_bitmapscan to off;
  -- 5 values, sorted
--- 140,145 ----
diff --git a/src/test/regress/expected/arrays.out b/src/test/regress/expected/arrays.out
index d47861e..23b3902 100644
*** a/src/test/regress/expected/arrays.out
--- b/src/test/regress/expected/arrays.out
*************** SELECT 0 || ARRAY[1,2] || 3 AS "{0,1,2,3
*** 421,427 ****
   {0,1,2,3}
  (1 row)

- ANALYZE array_op_test;
  SELECT * FROM array_op_test WHERE i @> '{32}' ORDER BY seqno;
   seqno |                i                |                                                                 t
                                                        

-------+---------------------------------+------------------------------------------------------------------------------------------------------------------------------------
--- 421,426 ----
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index c3598e2..85113a9 100644
*** a/src/test/regress/expected/join.out
--- b/src/test/regress/expected/join.out
*************** on (x1 = xx1) where (xx2 is not null);
*** 2129,2135 ****
  -- regression test: check for bug with propagation of implied equality
  -- to outside an IN
  --
- analyze tenk1;        -- ensure we get consistent plans here
  select count(*) from tenk1 a where unique1 in
    (select unique1 from tenk1 b join tenk1 c using (unique1)
     where b.unique2 = 42);
--- 2129,2134 ----
diff --git a/src/test/regress/input/copy.source b/src/test/regress/input/copy.source
index ab3f508..80f911f 100644
*** a/src/test/regress/input/copy.source
--- b/src/test/regress/input/copy.source
*************** COPY array_op_test FROM '@abs_srcdir@/da
*** 60,65 ****
--- 60,90 ----

  COPY array_index_op_test FROM '@abs_srcdir@/data/array.data';

+ -- analyze all the data we just loaded, to ensure plan consistency
+ -- in later tests
+
+ ANALYZE aggtest;
+ ANALYZE onek;
+ ANALYZE tenk1;
+ ANALYZE slow_emp4000;
+ ANALYZE person;
+ ANALYZE emp;
+ ANALYZE student;
+ ANALYZE stud_emp;
+ ANALYZE road;
+ ANALYZE real_city;
+ ANALYZE hash_i4_heap;
+ ANALYZE hash_name_heap;
+ ANALYZE hash_txt_heap;
+ ANALYZE hash_f8_heap;
+ ANALYZE test_tsvector;
+ ANALYZE bt_i4_heap;
+ ANALYZE bt_name_heap;
+ ANALYZE bt_txt_heap;
+ ANALYZE bt_f8_heap;
+ ANALYZE array_op_test;
+ ANALYZE array_index_op_test;
+
  --- test copying in CSV mode with various styles
  --- of embedded line ending characters

diff --git a/src/test/regress/output/copy.source b/src/test/regress/output/copy.source
index febca71..e8951c5 100644
*** a/src/test/regress/output/copy.source
--- b/src/test/regress/output/copy.source
*************** COPY bt_txt_heap FROM '@abs_srcdir@/data
*** 34,39 ****
--- 34,62 ----
  COPY bt_f8_heap FROM '@abs_srcdir@/data/hash.data';
  COPY array_op_test FROM '@abs_srcdir@/data/array.data';
  COPY array_index_op_test FROM '@abs_srcdir@/data/array.data';
+ -- analyze all the data we just loaded, to ensure plan consistency
+ -- in later tests
+ ANALYZE aggtest;
+ ANALYZE onek;
+ ANALYZE tenk1;
+ ANALYZE slow_emp4000;
+ ANALYZE person;
+ ANALYZE emp;
+ ANALYZE student;
+ ANALYZE stud_emp;
+ ANALYZE road;
+ ANALYZE real_city;
+ ANALYZE hash_i4_heap;
+ ANALYZE hash_name_heap;
+ ANALYZE hash_txt_heap;
+ ANALYZE hash_f8_heap;
+ ANALYZE test_tsvector;
+ ANALYZE bt_i4_heap;
+ ANALYZE bt_name_heap;
+ ANALYZE bt_txt_heap;
+ ANALYZE bt_f8_heap;
+ ANALYZE array_op_test;
+ ANALYZE array_index_op_test;
  --- test copying in CSV mode with various styles
  --- of embedded line ending characters
  create temp table copytest (
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 1bbe073..f99a07d 100644
*** a/src/test/regress/sql/aggregates.sql
--- b/src/test/regress/sql/aggregates.sql
*************** FROM bool_test;
*** 220,226 ****
  -- Test cases that should be optimized into indexscans instead of
  -- the generic aggregate implementation.
  --
- analyze tenk1;        -- ensure we get consistent plans here

  -- Basic cases
  explain (costs off)
--- 220,225 ----
diff --git a/src/test/regress/sql/alter_table.sql b/src/test/regress/sql/alter_table.sql
index b555430..87973c1 100644
*** a/src/test/regress/sql/alter_table.sql
--- b/src/test/regress/sql/alter_table.sql
*************** CREATE VIEW tmp_view (unique1) AS SELECT
*** 178,184 ****
  ALTER TABLE tmp_view RENAME TO tmp_view_new;

  -- hack to ensure we get an indexscan here
- ANALYZE tenk1;
  set enable_seqscan to off;
  set enable_bitmapscan to off;
  -- 5 values, sorted
--- 178,183 ----
diff --git a/src/test/regress/sql/arrays.sql b/src/test/regress/sql/arrays.sql
index 5a5a582..e4f9f31 100644
*** a/src/test/regress/sql/arrays.sql
--- b/src/test/regress/sql/arrays.sql
*************** SELECT ARRAY[[1,2],[3,4]] || ARRAY[5,6]
*** 196,203 ****
  SELECT ARRAY[0,0] || ARRAY[1,1] || ARRAY[2,2] AS "{0,0,1,1,2,2}";
  SELECT 0 || ARRAY[1,2] || 3 AS "{0,1,2,3}";

- ANALYZE array_op_test;
-
  SELECT * FROM array_op_test WHERE i @> '{32}' ORDER BY seqno;
  SELECT * FROM array_op_test WHERE i && '{32}' ORDER BY seqno;
  SELECT * FROM array_op_test WHERE i @> '{17}' ORDER BY seqno;
--- 196,201 ----
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 6ae37f5..718175b 100644
*** a/src/test/regress/sql/join.sql
--- b/src/test/regress/sql/join.sql
*************** on (x1 = xx1) where (xx2 is not null);
*** 330,337 ****
  -- regression test: check for bug with propagation of implied equality
  -- to outside an IN
  --
- analyze tenk1;        -- ensure we get consistent plans here
-
  select count(*) from tenk1 a where unique1 in
    (select unique1 from tenk1 b join tenk1 c using (unique1)
     where b.unique2 = 42);
--- 330,335 ----

Re: Why the buildfarm is all pink

From
Andres Freund
Date:
On 2013-12-10 19:55:12 -0500, Tom Lane wrote:
> I was surprised to see that my back-patches of the recent SubLink
> unpleasantness were failing on many of the buildfarm members, but
> only in the 9.1 and 9.0 branches.  The difficulty appears to be
> that the EXPLAIN output for the new test query changes depending on
> whether or not "tenk1" has been analyzed yet.  In 9.2 and up,
> it reliably has been, because create_index runs first and that script
> does this:

> create_index.sql:901:vacuum analyze tenk1;        -- ensure we get consistent plans here

> so depending on timing, one of those might have gotten the job done,
> or maybe autovacuum would show up in time to save the day.

> We need a more consistent strategy for this :-(

Agreed, although I have no clue how it should look like. As a further
datapoint I'll add that installcheck already regularly fails in HEAD if
you have a HS standby connected via SR and hot_standby_feedback=on on
the standby. Some plans just change from index(only) scans to sequential
scans, presumably because of the lower xmin horizon changed the
stats. Since there's nothing running on the standby in those cases,
there has to be a pretty damn tiny window here somewhere.

I've wondered whether we could fix that by a) more explicit
vacuum/analyzes b) a function waiting for quiescent state.

Arguably we could just define that being unsupported, but given there's
no testing of recovery but that at all that doesn't seem like a good
idea.

There's also fun in running with vacuum_defer_cleanup_age > 0, but I
don't think there's much chance of supporting that.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Why the buildfarm is all pink

From
Kevin Grittner
Date:
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I haven't touched matview.sql here; that seems like a distinct issue.


I'll fix that.

--

Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why the buildfarm is all pink

From
Kevin Grittner
Date:
Kevin Grittner <kgrittn@ymail.com> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> I haven't touched matview.sql here; that seems like a distinct
>> issue.
>
> I'll fix that.

Done.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why the buildfarm is all pink

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-12-10 19:55:12 -0500, Tom Lane wrote:
>> We need a more consistent strategy for this :-(

> Agreed, although I have no clue how it should look like. As a further
> datapoint I'll add that installcheck already regularly fails in HEAD if
> you have a HS standby connected via SR and hot_standby_feedback=on on
> the standby. Some plans just change from index(only) scans to sequential
> scans, presumably because of the lower xmin horizon changed the
> stats. Since there's nothing running on the standby in those cases,
> there has to be a pretty damn tiny window here somewhere.

The case in create_index does a "vacuum analyze tenk1" and expects
to get an index-only scan in the very next SQL command.  So any delay
in considering the table all-visible could break that test.  I'm not
sure if that's what you're talking about though.  We could easily
create some more delay for that case, for instance by moving the
vacuum step to copy.sql as I was idly speculating about upthread.
Do you remember offhand where the failures are?
        regards, tom lane



Re: Why the buildfarm is all pink

From
Andres Freund
Date:
On 2013-12-11 10:07:19 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-12-10 19:55:12 -0500, Tom Lane wrote:
> >> We need a more consistent strategy for this :-(
>
> > Agreed, although I have no clue how it should look like. As a further
> > datapoint I'll add that installcheck already regularly fails in HEAD if
> > you have a HS standby connected via SR and hot_standby_feedback=on on
> > the standby. Some plans just change from index(only) scans to sequential
> > scans, presumably because of the lower xmin horizon changed the
> > stats. Since there's nothing running on the standby in those cases,
> > there has to be a pretty damn tiny window here somewhere.
>
> The case in create_index does a "vacuum analyze tenk1" and expects
> to get an index-only scan in the very next SQL command.  So any delay
> in considering the table all-visible could break that test.  I'm not
> sure if that's what you're talking about though.  We could easily
> create some more delay for that case, for instance by moving the
> vacuum step to copy.sql as I was idly speculating about upthread.
> Do you remember offhand where the failures are?

No, but they are easy enough to reproduce. Out of 10 runs, I've attached
the one with the most failures and checked that it seems to contain all
the failures from other runs. All of them probably could be fixed by
moving things around, but I am not sure how maintainable that approach
is :/

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Why the buildfarm is all pink

From
Tom Lane
Date:
Kevin Grittner <kgrittn@ymail.com> writes:
> Kevin Grittner <kgrittn@ymail.com> wrote:
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I haven't touched matview.sql here; that seems like a distinct issue.

>> I'll fix that.

> Done.

Thanks.
        regards, tom lane



Re: Why the buildfarm is all pink

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-12-11 10:07:19 -0500, Tom Lane wrote:
>> Do you remember offhand where the failures are?

> No, but they are easy enough to reproduce. Out of 10 runs, I've attached
> the one with the most failures and checked that it seems to contain all
> the failures from other runs. All of them probably could be fixed by
> moving things around, but I am not sure how maintainable that approach
> is :/

Thanks for doing the legwork.  These all seem to be cases where the
planner decided against doing an index-only scan on tenk1, which is
presumably because its relallvisible fraction is too low.  But these are
later in the test series than the "vacuum analyze tenk1" that's currently
present in create_index, and most of them are even later than the
database-wide VACUUM in sanity_check.  So those vacuums are failing to
mark the table as all-visible, even though it's not changed since the COPY
test.  This seems odd.  Do you know why your slave server is holding back
the xmin horizon so much?

After looking at this, I conclude that moving the vacuums earlier would
probably make things worse not better, because the critical interval seems
to be from the "COPY TO tenk1" command to the vacuum command.  So the idea
of putting vacuums into the COPY test is a bad one, and I'll proceed with
the patch I posted yesterday for moving the ANALYZE steps around.  I think
fixing what you're seeing is going to be a different issue.
        regards, tom lane