Thread: Intermittent regression test failures from index-only plan changes
Another regression test failure that I've been seeing lately is a change from index-only scan to seqscan in create_index, as for instance here: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguar&dt=2012-01-02%2023%3A05%3A02 I've managed to duplicate and debug this one too. What I find is that the planner switches from preferring index-only scan to preferring seqscan because normally it sees the table as being fully all-visible after the immediately preceding VACUUM, but in the failure cases the table is seen as having no all-visible pages. That inflates the estimated cost of an index-only scan by enough to make it look like a loser. The create_index test is usually running by itself at this point (since the concurrently started create_view test is much shorter), but I had autovacuum logging on and found that there was a concurrent auto-ANALYZE on another table when the manual VACUUM was running. So failure to set the all-visible flags is expected given that context. This is a bit troublesome because, AFAICS, this means that every single one of the fifty-odd regression test cases that expect to see an Index Only Scan plan might transiently fail, if there happens to be a background auto-ANALYZE running at just the moment that the previous vacuum would've otherwise set the all-visible bits. It might be that all the other ones are safe in practice for various reasons, but even if they are there's no guarantee that new regression tests added in future will be reliable. Background auto-VACUUMs shouldn't cause this problem because they don't take snapshots, and ideally it'd be nice if auto-ANALYZE couldn't create the issue either, but ANALYZE does need a snapshot so it's hard to see how to avoid having it trip the all-visible logic. Anybody have any ideas? If nothing else comes to mind I'll probably just remove the sometimes-failing EXPLAIN test case, but I'm worried about the prospects of future failures of the same ilk. regards, tom lane
On Fri, Jan 6, 2012 at 10:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Background auto-VACUUMs shouldn't cause this problem because they don't > take snapshots, and ideally it'd be nice if auto-ANALYZE couldn't create > the issue either, but ANALYZE does need a snapshot so it's hard to see > how to avoid having it trip the all-visible logic. Anybody have any > ideas? That seems like a huge production problem looming over us, not just a regression test issue. I feel like this is a trick question, but I'll ask anyway: Can't we just ignore ANALYZE? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, Jan 6, 2012 at 10:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Background auto-VACUUMs shouldn't cause this problem because they don't >> take snapshots, and ideally it'd be nice if auto-ANALYZE couldn't create >> the issue either, but ANALYZE does need a snapshot so it's hard to see >> how to avoid having it trip the all-visible logic. Anybody have any >> ideas? > That seems like a huge production problem looming over us, not just a > regression test issue. It could be mildly annoying, but I'm not sure it rises to the level of a "huge production problem". > I feel like this is a trick question, but I'll ask anyway: Can't we > just ignore ANALYZE? AFAICS, no. ANALYZE will run user-defined code: not only user-supplied stats collection functions, but user-defined index expressions. We cannot assume that none of that ever requires a snapshot. BTW, if you think this is a fatal issue, then I'm afraid it's also a fatal objection to your proposal to turn SnapshotNow into an MVCC snapshot. Even if we could assume ANALYZE doesn't run any user-defined code, it most certainly must be able to do system catalog accesses, and that will mean SnapshotNow which would become an MVCC snapshot. What's worse, VACUUM would also begin taking MVCC snapshots for its catalog accesses, and thus could no longer be ignored for the purpose of determining global xmin. I have an uneasy feeling that the assumption that VACUUM doesn't take MVCC snapshots is wired into more places than just GetSnapshotData, but can't think of any other places just now. regards, tom lane
On Sat, Jan 7, 2012 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, if you think this is a fatal issue, then I'm afraid it's also > a fatal objection to your proposal to turn SnapshotNow into an MVCC > snapshot. Not just because of this, but I think its the wrong time in the cycle to be completely rethinking catalog access semantics. Or put another way, I'm too late and its too risky because the patch was becoming enormous. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Jan 7, 2012 at 12:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I feel like this is a trick question, but I'll ask anyway: Can't we >> just ignore ANALYZE? > > AFAICS, no. ANALYZE will run user-defined code: not only user-supplied > stats collection functions, but user-defined index expressions. We > cannot assume that none of that ever requires a snapshot. The question is: Why would it matter if we expunged tuples from table A while ANALYZE was running on table B? I guess the problem is that the index on B might involve a user-defined function which (under the covers) peeks at table A, possibly now seeing an inconsistent view of the database. It's pretty unfortunate to have to cater to that situation, though, because most of the time an ANALYZE on table A is only going to look at table A and the system catalogs. In fact, it wouldn't even be disastrous (in most cases) if we removed tuples from the table being analyzed - we're engaged in an inherently statistical process anyway, so who really cares if things change on us in medias res? Could we easily detect the cases where user code is being run and ignore ANALYZE when none is? A probably crazy idea is to add an option to vacuum that would cause it, upon discovering that it can't set PD_ALL_VISIBLE on a page because the global xmin is too old, to wait for all of the virtual transaction IDs who might not be able to see every tuple on the page. This would allow us to get into a state where all the PD_ALL_VISIBLE bits are known to be set. But that seems a bit complex for something that we probably don't care about much outside of the regression tests. If none of the above is feasible (and I suspect it isn't), we might just want to tweak the queries to do something that will preclude using an index-only scan, like including tableoid::regclass in the target list. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 27, 2012 at 01:45:28PM -0500, Robert Haas wrote: > On Sat, Jan 7, 2012 at 12:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I feel like this is a trick question, but I'll ask anyway: Can't we > >> just ignore ANALYZE? > > > > AFAICS, no. ANALYZE will run user-defined code: not only user-supplied > > stats collection functions, but user-defined index expressions. We > > cannot assume that none of that ever requires a snapshot. > > The question is: Why would it matter if we expunged tuples from table > A while ANALYZE was running on table B? I guess the problem is that > the index on B might involve a user-defined function which (under the > covers) peeks at table A, possibly now seeing an inconsistent view of > the database. > > It's pretty unfortunate to have to cater to that situation, though, > because most of the time an ANALYZE on table A is only going to look > at table A and the system catalogs. In fact, it wouldn't even be > disastrous (in most cases) if we removed tuples from the table being > analyzed - we're engaged in an inherently statistical process anyway, > so who really cares if things change on us in medias res? > > Could we easily detect the cases where user code is being run and > ignore ANALYZE when none is? > > A probably crazy idea is to add an option to vacuum that would cause > it, upon discovering that it can't set PD_ALL_VISIBLE on a page > because the global xmin is too old, to wait for all of the virtual > transaction IDs who might not be able to see every tuple on the page. > This would allow us to get into a state where all the PD_ALL_VISIBLE > bits are known to be set. But that seems a bit complex for something > that we probably don't care about much outside of the regression > tests. > > If none of the above is feasible (and I suspect it isn't), we might > just want to tweak the queries to do something that will preclude > using an index-only scan, like including tableoid::regclass in the > target list. Was this addressed? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Aug 27, 2012 at 9:18 AM, Bruce Momjian <bruce@momjian.us> wrote: > Was this addressed? See commit d6d5f67b5b98b1685f9158e9d00a726afb2ae789. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company