Thread: Intermittent regression test failures from index-only plan changes

Intermittent regression test failures from index-only plan changes

From
Tom Lane
Date:
Another regression test failure that I've been seeing lately is a change
from index-only scan to seqscan in create_index, as for instance here:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2012-01-02%2023%3A05%3A02
I've managed to duplicate and debug this one too.  What I find is
that the planner switches from preferring index-only scan to preferring
seqscan because normally it sees the table as being fully all-visible
after the immediately preceding VACUUM, but in the failure cases the
table is seen as having no all-visible pages.  That inflates the
estimated cost of an index-only scan by enough to make it look like a
loser.  The create_index test is usually running by itself at this point
(since the concurrently started create_view test is much shorter), but
I had autovacuum logging on and found that there was a concurrent
auto-ANALYZE on another table when the manual VACUUM was running.
So failure to set the all-visible flags is expected given that context.

This is a bit troublesome because, AFAICS, this means that every single
one of the fifty-odd regression test cases that expect to see an Index
Only Scan plan might transiently fail, if there happens to be a
background auto-ANALYZE running at just the moment that the previous
vacuum would've otherwise set the all-visible bits.  It might be that
all the other ones are safe in practice for various reasons, but even
if they are there's no guarantee that new regression tests added in
future will be reliable.

Background auto-VACUUMs shouldn't cause this problem because they don't
take snapshots, and ideally it'd be nice if auto-ANALYZE couldn't create
the issue either, but ANALYZE does need a snapshot so it's hard to see
how to avoid having it trip the all-visible logic.  Anybody have any
ideas?

If nothing else comes to mind I'll probably just remove the
sometimes-failing EXPLAIN test case, but I'm worried about the prospects
of future failures of the same ilk.
        regards, tom lane


Re: Intermittent regression test failures from index-only plan changes

From
Simon Riggs
Date:
On Fri, Jan 6, 2012 at 10:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Background auto-VACUUMs shouldn't cause this problem because they don't
> take snapshots, and ideally it'd be nice if auto-ANALYZE couldn't create
> the issue either, but ANALYZE does need a snapshot so it's hard to see
> how to avoid having it trip the all-visible logic.  Anybody have any
> ideas?

That seems like a huge production problem looming over us, not just a
regression test issue.

I feel like this is a trick question, but I'll ask anyway: Can't we
just ignore ANALYZE?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Intermittent regression test failures from index-only plan changes

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, Jan 6, 2012 at 10:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Background auto-VACUUMs shouldn't cause this problem because they don't
>> take snapshots, and ideally it'd be nice if auto-ANALYZE couldn't create
>> the issue either, but ANALYZE does need a snapshot so it's hard to see
>> how to avoid having it trip the all-visible logic.  Anybody have any
>> ideas?

> That seems like a huge production problem looming over us, not just a
> regression test issue.

It could be mildly annoying, but I'm not sure it rises to the level of a
"huge production problem".

> I feel like this is a trick question, but I'll ask anyway: Can't we
> just ignore ANALYZE?

AFAICS, no.  ANALYZE will run user-defined code: not only user-supplied
stats collection functions, but user-defined index expressions.  We
cannot assume that none of that ever requires a snapshot.

BTW, if you think this is a fatal issue, then I'm afraid it's also
a fatal objection to your proposal to turn SnapshotNow into an MVCC
snapshot.  Even if we could assume ANALYZE doesn't run any user-defined
code, it most certainly must be able to do system catalog accesses,
and that will mean SnapshotNow which would become an MVCC snapshot.
What's worse, VACUUM would also begin taking MVCC snapshots for
its catalog accesses, and thus could no longer be ignored for the
purpose of determining global xmin.

I have an uneasy feeling that the assumption that VACUUM doesn't take
MVCC snapshots is wired into more places than just GetSnapshotData,
but can't think of any other places just now.
        regards, tom lane


Re: Intermittent regression test failures from index-only plan changes

From
Simon Riggs
Date:
On Sat, Jan 7, 2012 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> BTW, if you think this is a fatal issue, then I'm afraid it's also
> a fatal objection to your proposal to turn SnapshotNow into an MVCC
> snapshot.

Not just because of this, but I think its the wrong time in the cycle
to be completely rethinking catalog access semantics. Or put another
way, I'm too late and its too risky because the patch was becoming
enormous.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Intermittent regression test failures from index-only plan changes

From
Robert Haas
Date:
On Sat, Jan 7, 2012 at 12:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I feel like this is a trick question, but I'll ask anyway: Can't we
>> just ignore ANALYZE?
>
> AFAICS, no.  ANALYZE will run user-defined code: not only user-supplied
> stats collection functions, but user-defined index expressions.  We
> cannot assume that none of that ever requires a snapshot.

The question is: Why would it matter if we expunged tuples from table
A while ANALYZE was running on table B?   I guess the problem is that
the index on B might involve a user-defined function which (under the
covers) peeks at table A, possibly now seeing an inconsistent view of
the database.

It's pretty unfortunate to have to cater to that situation, though,
because most of the time an ANALYZE on table A is only going to look
at table A and the system catalogs.  In fact, it wouldn't even be
disastrous (in most cases) if we removed tuples from the table being
analyzed - we're engaged in an inherently statistical process anyway,
so who really cares if things change on us in medias res?

Could we easily detect the cases where user code is being run and
ignore ANALYZE when none is?

A probably crazy idea is to add an option to vacuum that would cause
it, upon discovering that it can't set PD_ALL_VISIBLE on a page
because the global xmin is too old, to wait for all of the virtual
transaction IDs who might not be able to see every tuple on the page.
This would allow us to get into a state where all the PD_ALL_VISIBLE
bits are known to be set.  But that seems a bit complex for something
that we probably don't care about much outside of the regression
tests.

If none of the above is feasible (and I suspect it isn't), we might
just want to tweak the queries to do something that will preclude
using an index-only scan, like including tableoid::regclass in the
target list.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Intermittent regression test failures from index-only plan changes

From
Bruce Momjian
Date:
On Fri, Jan 27, 2012 at 01:45:28PM -0500, Robert Haas wrote:
> On Sat, Jan 7, 2012 at 12:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I feel like this is a trick question, but I'll ask anyway: Can't we
> >> just ignore ANALYZE?
> >
> > AFAICS, no.  ANALYZE will run user-defined code: not only user-supplied
> > stats collection functions, but user-defined index expressions.  We
> > cannot assume that none of that ever requires a snapshot.
> 
> The question is: Why would it matter if we expunged tuples from table
> A while ANALYZE was running on table B?   I guess the problem is that
> the index on B might involve a user-defined function which (under the
> covers) peeks at table A, possibly now seeing an inconsistent view of
> the database.
> 
> It's pretty unfortunate to have to cater to that situation, though,
> because most of the time an ANALYZE on table A is only going to look
> at table A and the system catalogs.  In fact, it wouldn't even be
> disastrous (in most cases) if we removed tuples from the table being
> analyzed - we're engaged in an inherently statistical process anyway,
> so who really cares if things change on us in medias res?
> 
> Could we easily detect the cases where user code is being run and
> ignore ANALYZE when none is?
> 
> A probably crazy idea is to add an option to vacuum that would cause
> it, upon discovering that it can't set PD_ALL_VISIBLE on a page
> because the global xmin is too old, to wait for all of the virtual
> transaction IDs who might not be able to see every tuple on the page.
> This would allow us to get into a state where all the PD_ALL_VISIBLE
> bits are known to be set.  But that seems a bit complex for something
> that we probably don't care about much outside of the regression
> tests.
> 
> If none of the above is feasible (and I suspect it isn't), we might
> just want to tweak the queries to do something that will preclude
> using an index-only scan, like including tableoid::regclass in the
> target list.

Was this addressed?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Intermittent regression test failures from index-only plan changes

From
Robert Haas
Date:
On Mon, Aug 27, 2012 at 9:18 AM, Bruce Momjian <bruce@momjian.us> wrote:
> Was this addressed?

See commit d6d5f67b5b98b1685f9158e9d00a726afb2ae789.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company