Re: Planner's choice - Mailing list pgsql-general
From | Nigel J. Andrews |
---|---|
Subject | Re: Planner's choice |
Date | |
Msg-id | Pine.LNX.4.21.0211131907450.14887-200000@ponder.fairway2k.co.uk Whole thread Raw |
In response to | Re: Planner's choice (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-general |
On Wed, 13 Nov 2002, Tom Lane wrote: > "Nigel J. Andrews" <nandrews@investsystems.co.uk> writes: > > where chat_post_timeuser_idx is defined on the columns (time,poster_id) > > and chat_post_usertime_idx is defined on the columns (poster_id,time) > > > Why is the planner not choosing the user_time index [for] > > > avid_chat_archive=> explain analyze select * from chat_post where poster_id = '1600' order by time desc limit 2; > > NOTICE: QUERY PLAN: > > > Limit (cost=0.00..32.40 rows=2 width=46) (actual time=96204.53..96204.71 rows=2 loops=1) > > -> Index Scan Backward using chat_post_time_idx on chat_post (cost=0.00..42370.93 rows=2616 width=46) (actual time=96204.49..96204.64rows=3 loops=1) > > Total runtime: 96205.18 msec > > If you'd said "order by poster_id desc, time desc" then that index would be > considered to match the ORDER BY clause, and so would be usable in this > same type of plan. As-is, the index is only useful for matching > poster_id and not for obtaining the required order, so the only plan > type considered for it involves an explicit sort step, which isn't > considered a win for the estimated number of rows matching the poster_id. Ok, I see. Thinking about this a similar question may have been on the list several months ago. I was just thinking in terms that the first column selects out of a btree on only that column and that each leaf node in that tree contains a btree of the second column. Given that structure I thought it would be reasonable for the planner to choose what I was expecting it to, but then I was guessing. > > > My plan now is to maintain my own set of poster_id stats and use one > > of several query variants depending on what they say but this requires > > at least some understanding of the choices made by the planner. > > Rather than maintaining your own stats, consider boosting the statistics > target for the poster_id column. You probably want the pg_stats info to > cover all the poster_ids that account for more than 1% of the entries. > The n_distinct value should improve too, producing a better estimate for > the infrequent poster_ids even though they're not explicitly stored. > For completeness I've attached the real [rewritten] query. The original was much more simply of the form: select collist from chat_post p, chat_user u where p.poster_id = u.id and p.time > X and p.time < Y and u.name = 'someone' order by p.time, p.post_number I tried increasing the stats gathering. 2000 and 1000 targets ran out of memory and 500 made no difference. The lowest of the top 500 targets contained way less than 1%. The problem being that the original or rewritten forms either start at time X and scan the index until time Y or do a seqscan. In the rewritten form with the limit this is a big win where the result 'quota' is fill early in the time interval but a huge loss when not. This is because the index scan results in the entire table being scanned anyway, with the exists subselect running for each tuple. So what I am trying to achieve is a query where both high and low occuring poster_ids can be retrieved in a reasonable time and I can't see how to do that without having more than one form of the query and submitting which ever one is going to force use of whatever index I think most appropiate. This I think is reasonable, despite requiring an extra query to get the stats, since as the DBA I know more about the data than the general usage optimser does. I'm happy to take other suggests of course and you've already made me think of something I haven't tried. -- Nigel J. Andrews
Attachment
pgsql-general by date: