Thread: SQL select query becomes slow when using limit (with no offset)
Hi folks,
We have problems with performance of a simple SQL statement.
If we add a LIMIT 50, the query is about 6 times slower than without a limit (query returns 2 rows).
I have read this discussion: http://archives.postgresql.org/pgsql-performance/2008-09/msg00005.php but there seems to be no solution in it.
I tried this things: http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server but changing settings doesn't have significant effect.
The DDL statements (create tables, indices) are attached.
The events_events table contains 375K rows, the events_event_types contains 71 rows.
The query:
select events_events.id FROM events_events
left join events_event_types on events_events.eventType_id=events_event_types.id
where events_event_types.severity=70
and events_events.cleared='f'
order by events_events.dateTime DESC
It takes 155ms to run this query (returning 2 rows)
After adding LIMIT 10, it takes 950 ms to run.
Query plan: without limit:
"Sort (cost=20169.62..20409.50 rows=95952 width=16)"
" Sort Key: events_events.datetime"
" -> Hash Join (cost=2.09..12229.58 rows=95952 width=16)"
" Hash Cond: (events_events.eventtype_id = events_event_types.id)"
" -> Seq Scan on events_events (cost=0.00..9918.65 rows=359820 width=24)"
" Filter: (NOT cleared)"
" -> Hash (cost=1.89..1.89 rows=16 width=8)"
" -> Seq Scan on events_event_types (cost=0.00..1.89 rows=16 width=8)"
" Filter: (severity = 70)"
Query plan: with limit:
"Limit (cost=0.00..12.50 rows=10 width=16)"
" -> Nested Loop (cost=0.00..119932.21 rows=95952 width=16)"
" -> Index Scan Backward using events_events_datetime_ind on events_events (cost=0.00..18242.28 rows=359820 width=24)"
" Filter: (NOT cleared)"
" -> Index Scan using events_event_types_pkey on events_event_types (cost=0.00..0.27 rows=1 width=8)"
" Index Cond: (events_event_types.id = events_events.eventtype_id)"
" Filter: (events_event_types.severity = 70)"
So postgres seems to handle a query with limit different internally. Tried to set default_statistics_target to 10, 100, 200, but no significant differences.
This problem appears on both Postgres 8.3 and 8.4.
Any suggestions?
Thanks in advance!
Best regards,
Kees van Dieren
--
Squins | IT, Honestly
Oranjestraat 23
2983 HL Ridderkerk
The Netherlands
Phone: +31 (0)180 414520
Mobile: +31 (0)6 30413841
www.squins.com
Chamber of commerce Rotterdam: 22048547
We have problems with performance of a simple SQL statement.
If we add a LIMIT 50, the query is about 6 times slower than without a limit (query returns 2 rows).
I have read this discussion: http://archives.postgresql.org/pgsql-performance/2008-09/msg00005.php but there seems to be no solution in it.
I tried this things: http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server but changing settings doesn't have significant effect.
The DDL statements (create tables, indices) are attached.
The events_events table contains 375K rows, the events_event_types contains 71 rows.
The query:
select events_events.id FROM events_events
left join events_event_types on events_events.eventType_id=events_event_types.id
where events_event_types.severity=70
and events_events.cleared='f'
order by events_events.dateTime DESC
It takes 155ms to run this query (returning 2 rows)
After adding LIMIT 10, it takes 950 ms to run.
Query plan: without limit:
"Sort (cost=20169.62..20409.50 rows=95952 width=16)"
" Sort Key: events_events.datetime"
" -> Hash Join (cost=2.09..12229.58 rows=95952 width=16)"
" Hash Cond: (events_events.eventtype_id = events_event_types.id)"
" -> Seq Scan on events_events (cost=0.00..9918.65 rows=359820 width=24)"
" Filter: (NOT cleared)"
" -> Hash (cost=1.89..1.89 rows=16 width=8)"
" -> Seq Scan on events_event_types (cost=0.00..1.89 rows=16 width=8)"
" Filter: (severity = 70)"
Query plan: with limit:
"Limit (cost=0.00..12.50 rows=10 width=16)"
" -> Nested Loop (cost=0.00..119932.21 rows=95952 width=16)"
" -> Index Scan Backward using events_events_datetime_ind on events_events (cost=0.00..18242.28 rows=359820 width=24)"
" Filter: (NOT cleared)"
" -> Index Scan using events_event_types_pkey on events_event_types (cost=0.00..0.27 rows=1 width=8)"
" Index Cond: (events_event_types.id = events_events.eventtype_id)"
" Filter: (events_event_types.severity = 70)"
So postgres seems to handle a query with limit different internally. Tried to set default_statistics_target to 10, 100, 200, but no significant differences.
This problem appears on both Postgres 8.3 and 8.4.
Any suggestions?
Thanks in advance!
Best regards,
Kees van Dieren
--
Squins | IT, Honestly
Oranjestraat 23
2983 HL Ridderkerk
The Netherlands
Phone: +31 (0)180 414520
Mobile: +31 (0)6 30413841
www.squins.com
Chamber of commerce Rotterdam: 22048547
Attachment
On Fri, Jul 31, 2009 at 1:11 PM, Kees van Dieren<keesvandieren@gmail.com> wrote: > It takes 155ms to run this query (returning 2 rows) > > Query plan: without limit: > "Sort (cost=20169.62..20409.50 rows=95952 width=16)" Could you send the results of EXPLAIN ANALYZE for both queries? Evidently the planner is expecting a lot more rows than the 2 rows you're expecting but it's not clear where it's gone wrong. -- greg http://mit.edu/~gsstark/resume.pdf
> The query: > select events_events.id FROM events_events > left join events_event_types on events_events.eventType_id= > events_event_types.id > where events_event_types.severity=70 > and events_events.cleared='f' > order by events_events.dateTime DESC The main problem seems to be lack of a suitable index... - Try creating an index on events_events( eventType_id, cleared ) - Or the other way around : events_events( cleared, eventType_id ) (depends on your other queries) Please try both and report EXPLAIN ANALYZE.
Hi Folks,
Thanks for your response.
I have added the following index (suggested by other post):
CREATE INDEX events_events_cleared_eventtype
ON events_events
USING btree
(eventtype_id, cleared)
WHERE cleared = false;
Also with columns in reversed order.
No changes in response time noticed.
Index on cleared column already is there (indices are in sql file attached to initial post.). eventtype_id has a foreign key constraint, which adds an index automatically I believe?
The explain analyze results for both queries:
explain analyze select events_events.id FROM events_events
left join events_event_types on events_events.eventType_id=events_event_types.id
where events_event_types.severity=70
and not events_events.cleared
order by events_events.dateTime DESC LIMIT 100
>>>
"Limit (cost=0.00..125.03 rows=100 width=16) (actual time=0.046..3897.094 rows=77 loops=1)"
" -> Nested Loop (cost=0.00..120361.40 rows=96269 width=16) (actual time=0.042..3896.881 rows=77 loops=1)"
" -> Index Scan Backward using events_events_datetime_ind on events_events (cost=0.00..18335.76 rows=361008 width=24) (actual time=0.025..720.345 rows=360637 loops=1)"
" Filter: (NOT cleared)"
" -> Index Scan using events_event_types_pkey on events_event_types (cost=0.00..0.27 rows=1 width=8) (actual time=0.003..0.003 rows=0 loops=360637)"
" Index Cond: (events_event_types.id = events_events.eventtype_id)"
" Filter: (events_event_types.severity = 70)"
"Total runtime: 3897.268 ms"
explain analyze select events_events.id FROM events_events
left join events_event_types on events_events.eventType_id=events_event_types.id
where events_event_types.severity=70
and not events_events.cleared
order by events_events.dateTime DESC
>>>
"Sort (cost=20255.18..20495.85 rows=96269 width=16) (actual time=1084.842..1084.951 rows=77 loops=1)"
" Sort Key: events_events.datetime"
" Sort Method: quicksort Memory: 20kB"
" -> Hash Join (cost=2.09..12286.62 rows=96269 width=16) (actual time=1080.789..1084.696 rows=77 loops=1)"
" Hash Cond: (events_events.eventtype_id = events_event_types.id)"
" -> Seq Scan on events_events (cost=0.00..9968.06 rows=361008 width=24) (actual time=0.010..542.946 rows=360637 loops=1)"
" Filter: (NOT cleared)"
" -> Hash (cost=1.89..1.89 rows=16 width=8) (actual time=0.077..0.077 rows=16 loops=1)"
" -> Seq Scan on events_event_types (cost=0.00..1.89 rows=16 width=8) (actual time=0.010..0.046 rows=16 loops=1)"
" Filter: (severity = 70)"
"Total runtime: 1085.145 ms"
Any suggestions?
Thanks in advance!
Best regards,
Kees van Dieren
pgsql-performance@postgresql.org
--
Squins | IT, Honestly
Oranjestraat 23
2983 HL Ridderkerk
The Netherlands
Phone: +31 (0)180 414520
Mobile: +31 (0)6 30413841
www.squins.com
Chamber of commerce Rotterdam: 22048547
Thanks for your response.
I have added the following index (suggested by other post):
CREATE INDEX events_events_cleared_eventtype
ON events_events
USING btree
(eventtype_id, cleared)
WHERE cleared = false;
Also with columns in reversed order.
No changes in response time noticed.
Index on cleared column already is there (indices are in sql file attached to initial post.). eventtype_id has a foreign key constraint, which adds an index automatically I believe?
The explain analyze results for both queries:
explain analyze select events_events.id FROM events_events
left join events_event_types on events_events.eventType_id=events_event_types.id
where events_event_types.severity=70
and not events_events.cleared
order by events_events.dateTime DESC LIMIT 100
>>>
"Limit (cost=0.00..125.03 rows=100 width=16) (actual time=0.046..3897.094 rows=77 loops=1)"
" -> Nested Loop (cost=0.00..120361.40 rows=96269 width=16) (actual time=0.042..3896.881 rows=77 loops=1)"
" -> Index Scan Backward using events_events_datetime_ind on events_events (cost=0.00..18335.76 rows=361008 width=24) (actual time=0.025..720.345 rows=360637 loops=1)"
" Filter: (NOT cleared)"
" -> Index Scan using events_event_types_pkey on events_event_types (cost=0.00..0.27 rows=1 width=8) (actual time=0.003..0.003 rows=0 loops=360637)"
" Index Cond: (events_event_types.id = events_events.eventtype_id)"
" Filter: (events_event_types.severity = 70)"
"Total runtime: 3897.268 ms"
explain analyze select events_events.id FROM events_events
left join events_event_types on events_events.eventType_id=events_event_types.id
where events_event_types.severity=70
and not events_events.cleared
order by events_events.dateTime DESC
>>>
"Sort (cost=20255.18..20495.85 rows=96269 width=16) (actual time=1084.842..1084.951 rows=77 loops=1)"
" Sort Key: events_events.datetime"
" Sort Method: quicksort Memory: 20kB"
" -> Hash Join (cost=2.09..12286.62 rows=96269 width=16) (actual time=1080.789..1084.696 rows=77 loops=1)"
" Hash Cond: (events_events.eventtype_id = events_event_types.id)"
" -> Seq Scan on events_events (cost=0.00..9968.06 rows=361008 width=24) (actual time=0.010..542.946 rows=360637 loops=1)"
" Filter: (NOT cleared)"
" -> Hash (cost=1.89..1.89 rows=16 width=8) (actual time=0.077..0.077 rows=16 loops=1)"
" -> Seq Scan on events_event_types (cost=0.00..1.89 rows=16 width=8) (actual time=0.010..0.046 rows=16 loops=1)"
" Filter: (severity = 70)"
"Total runtime: 1085.145 ms"
Any suggestions?
Thanks in advance!
Best regards,
Kees van Dieren
pgsql-performance@postgresql.org
2009/7/31 Greg Stark <gsstark@mit.edu>
On Fri, Jul 31, 2009 at 1:11 PM, Kees van Dieren<keesvandieren@gmail.com> wrote:
> It takes 155ms to run this query (returning 2 rows)
>> Query plan: without limit:Could you send the results of EXPLAIN ANALYZE for both queries?
> "Sort (cost=20169.62..20409.50 rows=95952 width=16)"
Evidently the planner is expecting a lot more rows than the 2 rows
you're expecting but it's not clear where it's gone wrong.
--
greg
http://mit.edu/~gsstark/resume.pdf
--
Squins | IT, Honestly
Oranjestraat 23
2983 HL Ridderkerk
The Netherlands
Phone: +31 (0)180 414520
Mobile: +31 (0)6 30413841
www.squins.com
Chamber of commerce Rotterdam: 22048547
Kees van Dieren wrote: > Hi Folks, > > Thanks for your response. > > I have added the following index (suggested by other post): > > CREATE INDEX events_events_cleared_eventtype > ON events_events > USING btree > (eventtype_id, cleared) > WHERE cleared = false; > > Also with columns in reversed order. > > No changes in response time noticed. > > Index on cleared column already is there (indices are in sql file > attached to initial post.). eventtype_id has a foreign key constraint, > which adds an index automatically I believe? > > The explain analyze results for both queries: > explain analyze select events_events.id <http://events_events.id> FROM > events_events > left join events_event_types on > events_events.eventType_id=events_event_types.id > <http://events_event_types.id> > where events_event_types.severity=70 > and not events_events.cleared > order by events_events.dateTime DESC LIMIT 100 > >>> > "Limit (cost=0.00..125.03 rows=100 width=16) (actual > time=0.046..3897.094 rows=77 loops=1)" > " -> Nested Loop (cost=0.00..120361.40 rows=96269 width=16) (actual > time=0.042..3896.881 rows=77 loops=1)" > " -> Index Scan Backward using events_events_datetime_ind on > events_events (cost=0.00..18335.76 rows=361008 width=24) (actual > time=0.025..720.345 rows=360637 loops=1)" > " Filter: (NOT cleared)" > " -> Index Scan using events_event_types_pkey on > events_event_types (cost=0.00..0.27 rows=1 width=8) (actual > time=0.003..0.003 rows=0 loops=360637)" > " Index Cond: (events_event_types.id > <http://events_event_types.id> = events_events.eventtype_id)" > " Filter: (events_event_types.severity = 70)" > "Total runtime: 3897.268 ms" > The plan here is guessing that we will find the 100 rows we want pretty quickly by scanning the dateTime index. As we aren't expecting to have to look through many rows to find 100 that match the criteria. With no cross column statistics it's more a guess than a good calculation. So the guess is bad and we end up scanning 360k rows from the index before we find what we want. My skills are not up to giving specific advise on how to avert this problem. Maybe somebody else can help there. > explain analyze select events_events.id <http://events_events.id> FROM > events_events > left join events_event_types on > events_events.eventType_id=events_event_types.id > <http://events_event_types.id> > where events_event_types.severity=70 > and not events_events.cleared > order by events_events.dateTime DESC > >>> > "Sort (cost=20255.18..20495.85 rows=96269 width=16) (actual > time=1084.842..1084.951 rows=77 loops=1)" > " Sort Key: events_events.datetime" > " Sort Method: quicksort Memory: 20kB" > " -> Hash Join (cost=2.09..12286.62 rows=96269 width=16) (actual > time=1080.789..1084.696 rows=77 loops=1)" > " Hash Cond: (events_events.eventtype_id = > events_event_types.id <http://events_event_types.id>)" > " -> Seq Scan on events_events (cost=0.00..9968.06 > rows=361008 width=24) (actual time=0.010..542.946 rows=360637 loops=1)" > " Filter: (NOT cleared)" > " -> Hash (cost=1.89..1.89 rows=16 width=8) (actual > time=0.077..0.077 rows=16 loops=1)" > " -> Seq Scan on events_event_types (cost=0.00..1.89 > rows=16 width=8) (actual time=0.010..0.046 rows=16 loops=1)" > " Filter: (severity = 70)" > "Total runtime: 1085.145 ms" > > Any suggestions? This plan is faster as you avoid the index scan. The planner is preferring to do a tablescan to find what it needs. This is much faster than the 360k random I/O index lookups. You can force this type of plan with a subquery and the OFFSET 0 trick, but I'm not sure it's the best solution. eg explain analyze SELECT * FROM (SELECT events_events.id <http://events_events.id> FROM events_events LEFT JOIN events_event_types on events_events.eventType_id=events_event_types.id <http://events_event_types.id> WHERE events_event_types.severity=70 AND not events_events.cleared ORDER BY events_events.dateTime DESC OFFSET 0) AS a LIMIT 100 Regards Russell
Thanks for your response.
I think your analysis is correct, When there are more than 100 rows that match this query, limit 100 is fast.
However, we often have less than hundred rows, so this is not sufficient for us.
This suggestion ('OFFSET 0' trick) did not show differences in response time (runs in 947ms).
One thing that helps, is limiting the set by adding this to where clause:
and events_events.dateTime > '2009-07-24'.
(query now runs in approx 500ms)
The workaround we implemented, is query caching in our application (Java, with JPA / Hibernate second level query cache). This actually solves the problem for us, but I'd prefer to get this query perform in postgres as well. I'd think that the Postgresql query planner should be smarter in handling LIMIT statements.
Would it get attention if I submit this to http://www.postgresql.org/support/submitbug ? (in fact it is not really a bug, but an improvement request).
Best regards,
Kees
--
Squins | IT, Honestly
Oranjestraat 23
2983 HL Ridderkerk
The Netherlands
Phone: +31 (0)180 414520
Mobile: +31 (0)6 30413841
www.squins.com
Chamber of commerce Rotterdam: 22048547
I think your analysis is correct, When there are more than 100 rows that match this query, limit 100 is fast.
However, we often have less than hundred rows, so this is not sufficient for us.
This suggestion ('OFFSET 0' trick) did not show differences in response time (runs in 947ms).
One thing that helps, is limiting the set by adding this to where clause:
and events_events.dateTime > '2009-07-24'.
(query now runs in approx 500ms)
The workaround we implemented, is query caching in our application (Java, with JPA / Hibernate second level query cache). This actually solves the problem for us, but I'd prefer to get this query perform in postgres as well. I'd think that the Postgresql query planner should be smarter in handling LIMIT statements.
Would it get attention if I submit this to http://www.postgresql.org/support/submitbug ? (in fact it is not really a bug, but an improvement request).
Best regards,
Kees
2009/8/5 Russell Smith <mr-russ@pws.com.au>
Kees van Dieren wrote:> explain analyze select events_events.id <http://events_events.id> FROM
> Hi Folks,
>
> Thanks for your response.
>
> I have added the following index (suggested by other post):
>
> CREATE INDEX events_events_cleared_eventtype
> ON events_events
> USING btree
> (eventtype_id, cleared)
> WHERE cleared = false;
>
> Also with columns in reversed order.
>
> No changes in response time noticed.
>
> Index on cleared column already is there (indices are in sql file
> attached to initial post.). eventtype_id has a foreign key constraint,
> which adds an index automatically I believe?
>
> The explain analyze results for both queries:> events_events> <http://events_event_types.id>
> left join events_event_types on
> events_events.eventType_id=events_event_types.id> where events_event_types.severity=70> <http://events_event_types.id> = events_events.eventtype_id)"
> and not events_events.cleared
> order by events_events.dateTime DESC LIMIT 100
> >>>
> "Limit (cost=0.00..125.03 rows=100 width=16) (actual
> time=0.046..3897.094 rows=77 loops=1)"
> " -> Nested Loop (cost=0.00..120361.40 rows=96269 width=16) (actual
> time=0.042..3896.881 rows=77 loops=1)"
> " -> Index Scan Backward using events_events_datetime_ind on
> events_events (cost=0.00..18335.76 rows=361008 width=24) (actual
> time=0.025..720.345 rows=360637 loops=1)"
> " Filter: (NOT cleared)"
> " -> Index Scan using events_event_types_pkey on
> events_event_types (cost=0.00..0.27 rows=1 width=8) (actual
> time=0.003..0.003 rows=0 loops=360637)"
> " Index Cond: (events_event_types.id> " Filter: (events_event_types.severity = 70)"The plan here is guessing that we will find the 100 rows we want pretty
> "Total runtime: 3897.268 ms"
>
quickly by scanning the dateTime index. As we aren't expecting to have
to look through many rows to find 100 that match the criteria. With no
cross column statistics it's more a guess than a good calculation. So
the guess is bad and we end up scanning 360k rows from the index before
we find what we want. My skills are not up to giving specific advise
on how to avert this problem. Maybe somebody else can help there.
> explain analyze select events_events.id <http://events_events.id> FROM> events_events> <http://events_event_types.id>
> left join events_event_types on
> events_events.eventType_id=events_event_types.id> where events_event_types.severity=70> events_event_types.id <http://events_event_types.id>)"
> and not events_events.cleared
> order by events_events.dateTime DESC
> >>>
> "Sort (cost=20255.18..20495.85 rows=96269 width=16) (actual
> time=1084.842..1084.951 rows=77 loops=1)"
> " Sort Key: events_events.datetime"
> " Sort Method: quicksort Memory: 20kB"
> " -> Hash Join (cost=2.09..12286.62 rows=96269 width=16) (actual
> time=1080.789..1084.696 rows=77 loops=1)"
> " Hash Cond: (events_events.eventtype_id => " -> Seq Scan on events_events (cost=0.00..9968.06This plan is faster as you avoid the index scan. The planner is
> rows=361008 width=24) (actual time=0.010..542.946 rows=360637 loops=1)"
> " Filter: (NOT cleared)"
> " -> Hash (cost=1.89..1.89 rows=16 width=8) (actual
> time=0.077..0.077 rows=16 loops=1)"
> " -> Seq Scan on events_event_types (cost=0.00..1.89
> rows=16 width=8) (actual time=0.010..0.046 rows=16 loops=1)"
> " Filter: (severity = 70)"
> "Total runtime: 1085.145 ms"
>
> Any suggestions?
preferring to do a tablescan to find what it needs. This is much faster
than the 360k random I/O index lookups. You can force this type of plan
with a subquery and the OFFSET 0 trick, but I'm not sure it's the best
solution.
eg
explain analyze SELECT * FROM
(SELECT events_events.id <http://events_events.id> FROM events_events
LEFT JOIN events_event_types onevents_events.eventType_id=events_event_types.id<http://events_event_types.id>
WHERE events_event_types.severity=70
AND not events_events.cleared
ORDER BY events_events.dateTime DESC OFFSET 0) AS a LIMIT 100
Regards
Russell
--
Squins | IT, Honestly
Oranjestraat 23
2983 HL Ridderkerk
The Netherlands
Phone: +31 (0)180 414520
Mobile: +31 (0)6 30413841
www.squins.com
Chamber of commerce Rotterdam: 22048547
On Fri, Aug 7, 2009 at 4:00 AM, Kees van Dieren<keesvandieren@gmail.com> wrote: > Would it get attention if I submit this to > http://www.postgresql.org/support/submitbug ? (in fact it is not really a > bug, but an improvement request). I think that many of the people who read that mailing list also read this one, including, critically, Tom Lane, and you're not the first person to run into a problem caused by lack of cross-column statistics. I don't think you're going to get very far by submitting this as a bug. There are already several people interested in this problem, but as most of us don't get paid to hack on PostgreSQL, it's a question of finding enough round tuits; this is not an easy thing to fix. If you are sufficiently bothered by this problem that you are willing to pay someone to fix it for you, there are several companies with whom you can contract to get this feature developed and committed for the next release of PostgreSQL. ...Robert
On 8/7/09 5:53 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Fri, Aug 7, 2009 at 4:00 AM, Kees van Dieren<keesvandieren@gmail.com> > wrote: >> Would it get attention if I submit this to >> http://www.postgresql.org/support/submitbug ? (in fact it is not really a >> bug, but an improvement request). > > I think that many of the people who read that mailing list also read > this one, including, critically, Tom Lane, and you're not the first > person to run into a problem caused by lack of cross-column > statistics. Critically, it should be understood that this general problem is not just born from lack of cross-column statistics. It is also one of using the statistical expected value to calculate cost without consideration of the variance. Often, the cost of a plan varies widely and nonlinearly with a small change in the expected value the stats used to estimate cost. The problem is acute with LIMIT and various other boundary conditions where there is a steep 'cost cliff' for certain plan types. When LIMIT is applied, the planner changes its estimates, but does not take into account the _greatly_ increased uncertainty of those estimates. Imagine a case where the planner's estimate is 100% correct, and on average one side of a join will have 2 tuples. The planner chooses nested loops to do that join. But the distribution of the number of tuples at this node is skewed, so although the expected value is 2, a values of 10 and 0 are both common. When values of 10 occur, the execution time goes up significantly. An alternate plan might be slower for the case where the actual values in execution equal the expected values, but faster in the average case! The average cost of a plan is NOT that cost of the query with average statistics, due to variance, nonlinearity, and skew. And even if they were equal, it might not be preferable to choose the plan that is best on average over the one that is best at the 90th percentile. Getting back to the case above with the nestloop -- if the planner estimated the cost of the nestloop join versus other joins with some idea of the variance in the estimations it could favor more 'stable' execution plans. So in short, cross-column statistics don't have to be gathered and used to make this problem less acute. The planner just has to be more aware of variance and the things that lead to it, such as column correlation. Thus with a LIMIT applied, the expected value for the number of tuples scanned before a match will shrink, but the uncertainty of this estimate grows significantly as well, so the right plan to choose is one that hedges against the uncertainty, not the one that assumes the expected value is correct. Gathering and using cross column statistics will change the expected value for some plans, but more importantly will allow the uncertainty to be reduced. Better stats go hand in hand with the uncertainty analysis because one can never have all cross column statistics, across all tables and all join-function spaces, analyzed. Stats gathering can never be complete or without flaws. The planner can never be perfect. > I don't think you're going to get very far by submitting > this as a bug. There are already several people interested in this > problem, but as most of us don't get paid to hack on PostgreSQL, it's > a question of finding enough round tuits; this is not an easy thing to > fix. > > If you are sufficiently bothered by this problem that you are willing > to pay someone to fix it for you, there are several companies with > whom you can contract to get this feature developed and committed for > the next release of PostgreSQL. > > ...Robert > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Fri, Aug 7, 2009 at 5:09 PM, Scott Carey<scott@richrelevance.com> wrote: > On 8/7/09 5:53 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > >> On Fri, Aug 7, 2009 at 4:00 AM, Kees van Dieren<keesvandieren@gmail.com> >> wrote: >>> Would it get attention if I submit this to >>> http://www.postgresql.org/support/submitbug ? (in fact it is not really a >>> bug, but an improvement request). >> >> I think that many of the people who read that mailing list also read >> this one, including, critically, Tom Lane, and you're not the first >> person to run into a problem caused by lack of cross-column >> statistics. > > Critically, it should be understood that this general problem is not just > born from lack of cross-column statistics. > > It is also one of using the statistical expected value to calculate cost > without consideration of the variance. Often, the cost of a plan varies > widely and nonlinearly with a small change in the expected value the stats > used to estimate cost. > > The problem is acute with LIMIT and various other boundary conditions where > there is a steep 'cost cliff' for certain plan types. When LIMIT is > applied, the planner changes its estimates, but does not take into account > the _greatly_ increased uncertainty of those estimates. > > Imagine a case where the planner's estimate is 100% correct, and on average > one side of a join will have 2 tuples. The planner chooses nested loops to > do that join. But the distribution of the number of tuples at this node is > skewed, so although the expected value is 2, a values of 10 and 0 are both > common. When values of 10 occur, the execution time goes up significantly. > An alternate plan might be slower for the case where the actual values in > execution equal the expected values, but faster in the average case! > The average cost of a plan is NOT that cost of the query with average > statistics, due to variance, nonlinearity, and skew. And even if they were > equal, it might not be preferable to choose the plan that is best on average > over the one that is best at the 90th percentile. > > Getting back to the case above with the nestloop -- if the planner estimated > the cost of the nestloop join versus other joins with some idea of the > variance in the estimations it could favor more 'stable' execution plans. > > So in short, cross-column statistics don't have to be gathered and used to > make this problem less acute. The planner just has to be more aware of > variance and the things that lead to it, such as column correlation. > Thus with a LIMIT applied, the expected value for the number of tuples > scanned before a match will shrink, but the uncertainty of this estimate > grows significantly as well, so the right plan to choose is one that hedges > against the uncertainty, not the one that assumes the expected value is > correct. This is a good analysis. I think I proposed some kind of idea about tracking uncertainty in the planner a while back, but I never got very far with it. The problem is, first, figuring out how to estimate the uncertainty, and second, figuring out what to do with the result once you've estimated it. The concept of hedging against uncertainty is the right one, I think, but it's not obvious how to fit that into the cost-comparison algorithm that the planner uses. A sticking point here too is that the comparison of path costs is already a hot spot; making it more complex will likely have a noticeable negative impact on query planning time. For queries against huge databases that may not matter much, but for OLTP queries it certainly does. There are a couple of methods that have been proposed to deal with this in the past. The most obvious one that's been talked about a couple of times is switching a nested loop to a hash join if the number of iterations exceeds some bound, which would require some executor support. Empirically, it seems to me that the planner generally follows a pretty consistent pattern. If the inner relation is tiny or the number of loops is estimated to be very small, it uses a nested loop. When the inner rel is a bit bigger, or the number of iterations is nontrivial, it switches to a hash join. When the inner relation gets too big to fit in work_mem, or just big enough that hashing it looks too slow, it switches to a nested loop with inner index-scan or, especially if a useful sort is available, a merge join. Just handling better the case where we pick a straight nested loop rather than a hash join would help a lot of people. Some basic conservatism about the number of outer rows would be useful here (in particular, we should probably assume that there will be at least 2 when costing a nest loop, unless the outer side is known unique), and it's also worth thinking about the fact that a hash join won't build the table unless there is at least 1 outer row, which I don't think the current costing algorithm takes into account. Or maybe we should estimate the amount by which the nested loop figures to beat out the hash join and only accepted the nested loop plan if the win exceeds some threshold percentage. ...Robert
Robert Haas <robertmhaas@gmail.com> wrote: > Just handling better the case where we pick a straight nested loop > rather than a hash join would help a lot of people. Some basic > conservatism about the number of outer rows would be useful here (in > particular, we should probably assume that there will be at least 2 > when costing a nest loop, unless the outer side is known unique), > and it's also worth thinking about the fact that a hash join won't > build the table unless there is at least 1 outer row, which I don't > think the current costing algorithm takes into account. Or maybe we > should estimate the amount by which the nested loop figures to beat > out the hash join and only accepted the nested loop plan if the win > exceeds some threshold percentage. But in our environment the most common cause of a sub-optimal planning choice is over-estimating the cost of a nested loop. We've never been able to get good plans overall without dropping random_page_cost to twice the seq_page_cost or less -- in highly cached situations we routinely drop both to 0.1. Creating further bias against nested loop plans just means we'll have to push the numbers further from what the purportedly represent. It seems to me a significant unsolved problem is properly accounting for caching effects. That said, I have not really clear idea on how best to solve that problem. The only ideas which recur when facing these issues are: (1) The the planner refused to deal with fractional estimates of how many rows will be returned in a loop -- it treats 0.01 as 1 on the basis that you can't read a fractional row, rather than as a 1% chance that you will read a row and need to do the related work. I have always thought that changing this might allow more accurate estimates; perhaps I should hack a version which behaves that way and test it as a "proof of concept." Note that this is diametrically opposed to your suggestion that we always assume at least two rows in the absence of a unique index. (2) Somehow use effective_cache_size in combination with some sort of current activity metrics to dynamically adjust random access costs. (I know, that one's total hand-waving, but it seems to have some possibility of better modeling reality than what we currently do.) -Kevin
On Mon, Aug 10, 2009 at 11:19 AM, Kevin Grittner<Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > >> Just handling better the case where we pick a straight nested loop >> rather than a hash join would help a lot of people. Some basic >> conservatism about the number of outer rows would be useful here (in >> particular, we should probably assume that there will be at least 2 >> when costing a nest loop, unless the outer side is known unique), >> and it's also worth thinking about the fact that a hash join won't >> build the table unless there is at least 1 outer row, which I don't >> think the current costing algorithm takes into account. Or maybe we >> should estimate the amount by which the nested loop figures to beat >> out the hash join and only accepted the nested loop plan if the win >> exceeds some threshold percentage. > > But in our environment the most common cause of a sub-optimal planning > choice is over-estimating the cost of a nested loop. We've never been > able to get good plans overall without dropping random_page_cost to > twice the seq_page_cost or less -- in highly cached situations we > routinely drop both to 0.1. Even if our statistics were perfect, you'd still need to do this if your database is mostly cached. That's routine tuning. > Creating further bias against nested loop > plans just means we'll have to push the numbers further from what the > purportedly represent. Not at all. You'd have to set them closer to their real values i.e. the cost of reading a page from cache. > It seems to me a significant unsolved problem > is properly accounting for caching effects. Definitely true. > That said, I have not really clear idea on how best to solve that > problem. The only ideas which recur when facing these issues are: > > (1) The the planner refused to deal with fractional estimates of how > many rows will be returned in a loop -- it treats 0.01 as 1 on the > basis that you can't read a fractional row, rather than as a 1% chance > that you will read a row and need to do the related work. I have > always thought that changing this might allow more accurate estimates; > perhaps I should hack a version which behaves that way and test it as > a "proof of concept." Note that this is diametrically opposed to your > suggestion that we always assume at least two rows in the absence of a > unique index. You're right. The problem is that you can get hosed in both directions. My queries are always referencing a column that they have a foreign key towards, so this never happens to me. But it does happen to other people. > (2) Somehow use effective_cache_size in combination with some sort of > current activity metrics to dynamically adjust random access costs. > (I know, that one's total hand-waving, but it seems to have some > possibility of better modeling reality than what we currently do.) Yeah, I gave a lightning talk on this at PGcon, but I haven't had time to do anything with it. There are a couple of problems. One is that you have to have a source for your current activity metrics. Since a lot of the pages of interest will be in the OS buffer pool rather than PG shared buffers, there's no easy way to handle this, and you also have to keep in mind that plans can be cached and reused, so you need the estimates not to change too fast or you'll have horrible plan stability problems. The other is that right now, the page costs are constant. Making them per-relation will mean that they require syscache lookups. I'm not sure whether that's expensive enough to impact planning time, and if so what to do about it. ...Robert
Robert Haas wrote: > On Mon, Aug 10, 2009 at 11:19 AM, Kevin Grittner<Kevin.Grittner@wicourts.gov> wrote: >> (2) Somehow use effective_cache_size in combination with some sort of >> current activity metrics to dynamically adjust random access costs. >> (I know, that one's total hand-waving, but it seems to have some >> possibility of better modeling reality than what we currently do.) I was disappointed when I learned that effective_cache_size doesn't get generally used to predict the likelihood of a buffer fetch requiring physical io. > Yeah, I gave a lightning talk on this at PGcon, but I haven't had time > to do anything with it. There are a couple of problems. One is that > you have to have a source for your current activity metrics. Since a > lot of the pages of interest will be in the OS buffer pool rather than > PG shared buffers, there's no easy way to handle this While there are portability concerns, mmap + mincore works across BSD, Linux, Solaris and will return a vector of file pages in the OS buffer pool. So it's certainly possible that on supported systems, an activity monitor can have direct knowledge of OS caching effectiveness on a per relation/index basis. -- -Devin