Re: The standard 'why does it take so long' question - Mailing list pgsql-general
From | Nigel J. Andrews |
---|---|
Subject | Re: The standard 'why does it take so long' question |
Date | |
Msg-id | Pine.LNX.4.21.0208091136240.3235-100000@ponder.fairway2k.co.uk Whole thread Raw |
In response to | Re: The standard 'why does it take so long' question (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: The standard 'why does it take so long' question
|
List | pgsql-general |
First I'll just say thanks to Joe Conway for the suggestion of an index on the poster_id column. Although I didn't show it I do have an index on that field as well, however, as I understand it the index on (poster_id,time) should fulfil this same function. On Fri, 9 Aug 2002, Tom Lane wrote: > "Nigel J. Andrews" <nandrews@investsystems.co.uk> writes: > > The first question is why would the index scan on chat_user take > > significantly longer than before? > > I think the indexscan is having to skip over more irrelevant data in the > second case. Think about it: you can scan an index range consisting of > a single user's posts between times T1 and T2, or you can scan an index > range consisting of all posts between times T1 and T2. The second > column of the index will save you from actually going to the heap for > posts from other users, but you'll still have to pass over those index > entries, because the contiguous range of index entries that covers the > data you want will include a a lot of posts from other users. That was the idea that popped into my head while I was writing the questions so that certainly seems reasonable to me. > > What interests me is why the planner chose the second index when it > had a choice; I'd have thought its cost models were good enough to > handle this subtlety. Can you post the pg_stats rows for the columns > in question? I really don't know sufficient to be able to see why the stats would favour one index over the other. Although looking at the pg_stats entries below now I notice that the correlation for the time column is 1, compared to 0.058 for the poster_id. As there is a natural order to the data, i.e. it's loaded in time order, I am not entirely surprised that there's a high correlation, just slightly surprised the stats gatherer picked up on it. Note that the poster_ids for the example query are also shown below and are not listed in the most_common_vals array. I've also shown the human understandable values for the time limits in the query and due to the nature/distribution of this column it's unsurprising to find that I managed to pick a time period that didn't include one of the more common times. So, what's making the planner choose the slower index is the correlation figure right? archive=> select * from pg_stats where attname = 'time' or attname = 'poster_id' and tablename = 'chat_post'; tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation -----------+-----------+-----------+-----------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------- chat_post | poster_id | 0 | 2 | 338 | {2149,731,1130,2595,1879,1656,688,1842,2473,1747} | {0.0603333,0.0306667,0.0306667,0.03,0.0286667,0.0276667,0.0273333,0.025,0.02,0.0196667}| {4,252,521,819,1112,1423,1639,1820,2096,2427,2879} | 0.0583958 chat_post | time | 0 | 8 | -0.350654 | {"1998-03-26 11:09:00-05","1999-05-13 00:49:00-04","2000-12-1413:05:00-05","2001-01-25 12:18:00-05","2002-04-04 11:37:00-05","2002-05-28 15:21:00-04","2002-07-1009:53:00-04"} | {0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667} | {"1998-03-06 01:26:00-05","1998-08-2111:00:00-04","1999-02-22 15:15:00-05","1999-07-22 05:37:00-04","2000-01-24 09:53:00-05","2000-10-2317:45:00-04","2001-02-26 16:22:00-05","2001-06-28 20:19:00-04","2001-12-06 14:06:00-05","2002-04-2616:05:00-04","2002-08-09 00:10:00-04"} | 1 (2 rows) archive=> select * from chat_user where name in ('thrifty', 'hope1'); id | name | lower_name ------+---------+------------ 2494 | thrifty | thrifty 1112 | hope1 | hope1 (2 rows) archive=> select 'epoch'::timestamptz + '959904000 seconds'::interval as from, 'epoch'::timestamptz + '1023667200 seconds'::intervalas to; from | to ------------------------+------------------------ 2000-06-02 01:00:00+01 | 2002-06-10 01:00:00+01 (1 row) Thanks for the input Tom. -- Nigel J. Andrews Director --- Logictree Systems Limited Computer Consultants
pgsql-general by date: