Re: views much slower in 9.3 than 8.4 - Mailing list pgsql-performance

From Tom Lane
Subject Re: views much slower in 9.3 than 8.4
Date
Msg-id 1174.1427562844@sss.pgh.pa.us
Whole thread Raw
In response to Re: views much slower in 9.3 than 8.4  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: views much slower in 9.3 than 8.4  (Kevin Grittner <kgrittn@ymail.com>)
List pgsql-performance
I wrote:
> "Carson, Leonard" <lcarson@sdsc.edu> writes:
>> Here are the 3 views and some timing notes:
>> http://pgsql.privatepaste.com/decae31693#

> That doesn't really leave us any wiser than before, unfortunately.

> It's clear that the root of the problem is the drastic underestimation of
> the size of the rq/a join, but it's not clear why that's happening, nor
> why 8.4 would not have fallen into the same trap.

Leonard was kind enough to provide the problematic data off-list, and
here's what I find after some poking around: 8.4 is not, in fact, any
smarter than the more recent versions, it just happens to get lucky
on this particular query.  The core of the problem is this aspect of
the projv view:

SELECT ...
   FROM
    allocations a,
    ... other relations ...
  WHERE
    a.initial_start_date = (SELECT max(allocations.initial_start_date)
                            FROM allocations
                            WHERE allocations.account_id = a.account_id AND
                                  allocations.resource_id = a.resource_id)
    AND ... a bunch of other conditions ...

(There's a similar consider-only-the-latest-row construction for
accounts_history, which doubles the problem, but let's just look at this
one for now.)  Now there are two things that are bad about this: first,
the construct requires executing the MAX-subselect for every row of "a",
which is expensive and the planner knows it.  Second, it's very very hard
for the planner to guess right about the selectivity of this condition on
a.initial_start_date.  It ends up using DEFAULT_EQ_SEL which is 0.005,
but given Leonard's data set the actual selectivity is just about 1.0,
ie, there are no records that aren't the latest for their account_id cum
resource_id and thus no rows are eliminated by the condition anyway.

So we have an expensive scan on "a" that is going to produce many more
rows than the planner thinks.  By the time we get done joining to
accounts_history, which has a similar problem, the planner is estimating
only one row out of the join (vs. nearly 24000 in reality), and it's
setting the total cost estimate at 148163 cost units.  This just totally
bollixes planning of the joins to the remaining half-dozen tables.
The single-row estimate is nearly fatal in itself, because it encourages
nestloop joining which is pretty inappropriate here.  But the other
problem is that the planner considers less-than-1% differences in cost
estimates to be "in the noise", which means that it's not going to
consider cost differences of less than 1480 units in the remaining join
steps to be significant.  This is how come we end up with the apparently
brain-dead decisions to use seqscans on some of the other tables such as
"pi" and "ac": comparing the seqscan to a potential inner indexscan, the
total cost of the join is "the same" according to the 1% rule, and then
the first tiebreaker is startup cost, and the indexscan has a higher
startup cost.

Now, 8.4 also had the 1% rule, but it had slightly different tiebreaking
procedures, which caused it to end up picking the inner indexscans over
the seqscans; and in this particular data set inner indexscans do far
better than repeated seqscans when the rq/a/ah join turns out to produce
24000 times more tuples than predicted.  But I can't persuade myself that
the tiebreaking changes amount to a bug.  (I did experiment with varying
the tiebreaking rules a bit, but I think that would just be moving the
pain around.)

Long-term planner fixes for this type of problem might include improving
the statistics enough that we could get better rowcount estimates.
(Cross-column stats would help, since a contributing factor is that some
of the joins are on two join columns that are pretty heavily correlated.)
Another thing we've discussed is making risk estimates, whereby we could
realize that the nestloop-plus-seqscan plans are going to be a lot worse
if our rowcount estimates are off at all.  But both of those things are
research projects.

What seems like a potential near-term fix for Leonard is to recast his
views to do the latest-row selection more intelligently.  I experimented
with redoing the projv view like this to eliminate the
subselects-in-WHERE:

SELECT ...
   FROM
    acct.requests rq,
    acct.fields_of_science fos,
    acct.accounts ac,
    acct.allocations a,
    (select account_id, resource_id, max(initial_start_date) AS initial_start_date
     FROM acct.allocations GROUP BY 1,2) a_latest,
    acct.transaction_types tt,
    acct.resources ar,
    acct.accounts_history ah,
    (select account_id, resource_id, max(activity_time) AS activity_time
     FROM acct.accounts_history GROUP BY 1,2) ah_latest,
    acct.allocation_states sx,
    acct.principal_investigators pi,
    acct.people p
  WHERE
 a.account_id = ac.account_id AND
 a.account_id = a_latest.account_id AND
 a.resource_id = a_latest.resource_id AND
 a.initial_start_date = a_latest.initial_start_date AND
 rq.account_id = a.account_id AND
 rq.start_date = a.initial_start_date AND
 ar.resource_id = a.resource_id AND
 a.allocation_type_id = tt.transaction_type_id AND
 ah.account_id = a.account_id AND
 ah.resource_id = a.resource_id AND
 ah.account_id = ah_latest.account_id AND
 ah.resource_id = ah_latest.resource_id AND
 ah.activity_time = ah_latest.activity_time AND
 sx.state_id = ah.state_id AND
 rq.primary_fos_id = fos.field_of_science_id AND
 pi.request_id = rq.request_id AND
 p.person_id = pi.person_id
;

That produces significantly better plans.  It doesn't look like the
rowcount estimates are better :-( ... but the total estimated cost
is now down in the range of 3000 or so cost units, which means that
the 1% rule doesn't keep us from adopting the inner indexscans.  And
this is fundamentally a better way to do latest-row selection, anyhow.

(I guess another potential research project is to do this sort of
aggregated-subselect transformation automatically.  But don't hold
your breath for that to happen, either.)

            regards, tom lane


pgsql-performance by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Index only scan sometimes switches to sequential scan for small amount of rows
Next
From: Kevin Grittner
Date:
Subject: Re: views much slower in 9.3 than 8.4