Re: Planner's choice - Mailing list pgsql-general

From Nigel J. Andrews
Subject Re: Planner's choice
Date
Msg-id Pine.LNX.4.21.0211131907450.14887-200000@ponder.fairway2k.co.uk
Whole thread Raw
In response to Re: Planner's choice  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
On Wed, 13 Nov 2002, Tom Lane wrote:

> "Nigel J. Andrews" <nandrews@investsystems.co.uk> writes:
> > where chat_post_timeuser_idx is defined on the columns (time,poster_id)
> > and chat_post_usertime_idx is defined on the columns (poster_id,time)
>
> > Why is the planner not choosing the user_time index [for]
>
> > avid_chat_archive=> explain analyze select * from chat_post where poster_id = '1600' order by time desc limit 2;
> > NOTICE:  QUERY PLAN:
>
> > Limit  (cost=0.00..32.40 rows=2 width=46) (actual time=96204.53..96204.71 rows=2 loops=1)
> >   ->  Index Scan Backward using chat_post_time_idx on chat_post  (cost=0.00..42370.93 rows=2616 width=46) (actual
time=96204.49..96204.64rows=3 loops=1) 
> > Total runtime: 96205.18 msec
>
> If you'd said "order by poster_id desc, time desc" then that index would be
> considered to match the ORDER BY clause, and so would be usable in this
> same type of plan.  As-is, the index is only useful for matching
> poster_id and not for obtaining the required order, so the only plan
> type considered for it involves an explicit sort step, which isn't
> considered a win for the estimated number of rows matching the poster_id.

Ok, I see. Thinking about this a similar question may have been on the list
several months ago. I was just thinking in terms that the first column selects
out of a btree on only that column and that each leaf node in that tree
contains a btree of the second column. Given that structure I thought it would
be reasonable for the planner to choose what I was expecting it to, but then I
was guessing.

>
> > My plan now is to maintain my own set of poster_id stats and use one
> > of several query variants depending on what they say but this requires
> > at least some understanding of the choices made by the planner.
>
> Rather than maintaining your own stats, consider boosting the statistics
> target for the poster_id column.  You probably want the pg_stats info to
> cover all the poster_ids that account for more than 1% of the entries.
> The n_distinct value should improve too, producing a better estimate for
> the infrequent poster_ids even though they're not explicitly stored.
>

For completeness I've attached the real [rewritten] query. The original was
much more simply of the form:

select collist
 from chat_post p, chat_user u
 where p.poster_id = u.id and p.time > X and p.time < Y and u.name = 'someone'
 order by p.time, p.post_number


I tried increasing the stats gathering. 2000 and 1000 targets ran out of memory
and 500 made no difference. The lowest of the top 500 targets contained way
less than 1%.

The problem being that the original or rewritten forms either start at time X
and scan the index until time Y or do a seqscan. In the rewritten form with the
limit this is a big win where the result 'quota' is fill early in the time
interval but a huge loss when not. This is because the index scan results in
the entire table being scanned anyway, with the exists subselect running for
each tuple.

So what I am trying to achieve is a query where both high and low occuring
poster_ids can be retrieved in a reasonable time and I can't see how to do that
without having more than one form of the query and submitting which ever one is
going to force use of whatever index I think most appropiate. This I think is
reasonable, despite requiring an extra query to get the stats, since as the DBA
I know more about the data than the general usage optimser does. I'm happy to
take other suggests of course and you've already made me think of something I
haven't tried.


--
Nigel J. Andrews

Attachment

pgsql-general by date:

Previous
From: Richard Huxton
Date:
Subject: Re: readline help
Next
From: "Mark Wilson"
Date:
Subject: Re: Invalid Dependancies