Thread: overzealous sorting?
In Mammoth Replicator (PG 8.3) I have a table described as Table "public.tevent_cdr" Column | Type | Modifiers ----------------+--------------------------+------------------------------------------------------------ event_id | integer | not null default nextval(('event_id_seq'::text)::regclass) timestamp | timestamp with time zone | not null classification | character varying | not null area | character varying | not null kind | character varying | device_id | integer | device_name | character varying | fleet_id | integer | fleet_name | character varying | customer_id | integer | customer_name | character varying | event | text | Indexes: "tevent_cdr_event_id" UNIQUE, btree (event_id) "tevent_cdr_timestamp" btree ("timestamp") Check constraints: "tevent_cdr_classification_check" CHECK (classification::text = 'cdr'::text) Inherits: tevent This simple query puzzles me. Why does it need to sort the records? Don't they come from the index in order? "explain analyze select * from tevent_cdr where timestamp >= '2011-09-09 12:00:00.000000+0' and timestamp < '2011-09-09 13:00:00.000000+0' and classification = 'cdr' order by timestamp;" QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Sort (cost=9270.93..9277.12 rows=2477 width=588) (actual time=9.219..11.489 rows=2480 loops=1) Sort Key: "timestamp" Sort Method: quicksort Memory: 2564kB -> Bitmap Heap Scan on tevent_cdr (cost=57.93..9131.30 rows=2477 width=588) (actual time=0.440..3.923 rows=2480 loops=1) Recheck Cond: (("timestamp" >= '2011-09-09 22:00:00+10'::timestamp with time zone) AND ("timestamp" < '2011-09-09 23:00:00+10'::timestamp with time zone)) Filter: ((classification)::text = 'cdr'::text) -> Bitmap Index Scan on tevent_cdr_timestamp (cost=0.00..57.31 rows=2477 width=0) (actual time=0.404..0.404 rows=2480 loops=1) Index Cond: (("timestamp" >= '2011-09-09 22:00:00+10'::timestamp with time zone) AND ("timestamp" < '2011-09-09 23:00:00+10'::timestamp with time zone)) Total runtime: 13.847 ms (9 rows) -- Anthony Shipman | flailover systems: When one goes down it Anthony.Shipman@symstream.com | flails about until the other goes down too.
Le Mon, 26 Sep 2011 16:28:15 +1000, anthony.shipman@symstream.com a écrit : > In Mammoth Replicator (PG 8.3) I have a table described as > > Table "public.tevent_cdr" > Column | Type | > Modifiers > ----------------+--------------------------+------------------------------------------------------------ > event_id | integer | not null default > nextval(('event_id_seq'::text)::regclass) timestamp | timestamp > with time zone | not null classification | character varying | > not null area | character varying | not null > kind | character varying | > device_id | integer | > device_name | character varying | > fleet_id | integer | > fleet_name | character varying | > customer_id | integer | > customer_name | character varying | > event | text | > Indexes: > "tevent_cdr_event_id" UNIQUE, btree (event_id) > "tevent_cdr_timestamp" btree ("timestamp") > Check constraints: > "tevent_cdr_classification_check" CHECK (classification::text > = 'cdr'::text) > Inherits: tevent > > > This simple query puzzles me. Why does it need to sort the records? > Don't they come from the index in order? > > "explain analyze select * from tevent_cdr where timestamp >= > '2011-09-09 12:00:00.000000+0' and timestamp < '2011-09-09 > 13:00:00.000000+0' and classification = 'cdr' order by timestamp;" > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > Sort (cost=9270.93..9277.12 rows=2477 width=588) (actual > time=9.219..11.489 rows=2480 loops=1) > Sort Key: "timestamp" > Sort Method: quicksort Memory: 2564kB > -> Bitmap Heap Scan on tevent_cdr (cost=57.93..9131.30 > rows=2477 width=588) (actual time=0.440..3.923 rows=2480 loops=1) > Recheck Cond: (("timestamp" >= '2011-09-09 > 22:00:00+10'::timestamp with time zone) AND ("timestamp" < > '2011-09-09 23:00:00+10'::timestamp with time zone)) > Filter: ((classification)::text = 'cdr'::text) > -> Bitmap Index Scan on tevent_cdr_timestamp > (cost=0.00..57.31 rows=2477 width=0) (actual time=0.404..0.404 > rows=2480 loops=1) > Index Cond: (("timestamp" >= '2011-09-09 > 22:00:00+10'::timestamp with time zone) AND ("timestamp" < > '2011-09-09 23:00:00+10'::timestamp with time zone)) > Total runtime: 13.847 ms > (9 rows) Because Index Scans are sorted, not Bitmap Index Scans, which builds a list of pages to visit, to be then visited by the Bitmap Heap Scan step. Marc.
On Monday 26 September 2011 19:39, Marc Cousin wrote: > Because Index Scans are sorted, not Bitmap Index Scans, which builds a > list of pages to visit, to be then visited by the Bitmap Heap Scan step. > > Marc. Where does this bitmap index scan come from? It seems to negate the advantages of b-tree indexes described in the section "Indexes and ORDER BY" of the manual. If I do "set enable_bitmapscan = off;" the query runs a bit faster although with a larger time range it reverts to a sequential scan. -- Anthony Shipman | Consider the set of blacklists that Anthony.Shipman@symstream.com | do not blacklist themselves...
Le Tue, 27 Sep 2011 12:45:00 +1000, anthony.shipman@symstream.com a écrit : > On Monday 26 September 2011 19:39, Marc Cousin wrote: > > Because Index Scans are sorted, not Bitmap Index Scans, which > > builds a list of pages to visit, to be then visited by the Bitmap > > Heap Scan step. > > > > Marc. > > Where does this bitmap index scan come from? It seems to negate the > advantages of b-tree indexes described in the section "Indexes and > ORDER BY" of the manual. If I do "set enable_bitmapscan = off;" the > query runs a bit faster although with a larger time range it reverts > to a sequential scan. > Bitmap Index Scan is just another way to use a btree index. It is often used when a bigger part of a table is required, as it costs more than plain index scan to retrieve a few records, but less when a lot of records are needed. Your tests show that index scans are a bit faster on this query. But it is probably true only when most needed data is cached, which is probably your case, as you are doing tests using the same query all the time. The bitmap index scan is probably cheaper when data isn't in cache. You could also see the bitmap index scan as safer, as it won't perform as bad when data is not cached (less random IO) :) The thing is, the optimizer doesn't know if your data will be in cache when you will run your query… if you are sure most of your data is in the cache most of the time, you could try to tune random_page_cost (lower it) to reflect that data is cached. But if the win is small on this query, it may not be worth it.
On Tuesday 27 September 2011 18:54, Marc Cousin wrote: > The thing is, the optimizer doesn't know if your data will be in cache > when you will run your query… if you are sure most of your data is in > the cache most of the time, you could try to tune random_page_cost > (lower it) to reflect that data is cached. But if the win is small on > this query, it may not be worth it. What I really want is to just read a sequence of records in timestamp order between two timestamps. The number of records to be read may be in the millions totalling more than 1GB of data so I'm trying to read them a slice at a time but I can't get PG to do just this. If I use offset and limit to grab a slice of the records from a large timestamp range then PG will grab all of the records in the range, sort them on disk and return just the slice I want. This is absurdly slow. The query that I've shown is one of a sequence of queries with the timestamp range progressing in steps of 1 hour through the timestamp range. All I want PG to do is find the range in the index, find the matching records in the table and return them. All of the planner's cleverness just seems to get in the way. -- Anthony Shipman | Consider the set of blacklists that Anthony.Shipman@symstream.com | do not blacklist themselves...
On 27/09/11 22:05, anthony.shipman@symstream.com wrote: > > What I really want is to just read a sequence of records in timestamp order > between two timestamps. The number of records to be read may be in the > millions totalling more than 1GB of data so I'm trying to read them a slice > at a time but I can't get PG to do just this. > > If I use offset and limit to grab a slice of the records from a large > timestamp range then PG will grab all of the records in the range, sort them > on disk and return just the slice I want. This is absurdly slow. > > The query that I've shown is one of a sequence of queries with the timestamp > range progressing in steps of 1 hour through the timestamp range. All I want > PG to do is find the range in the index, find the matching records in the > table and return them. All of the planner's cleverness just seems to get in > the way. > It is not immediately clear that the planner is making the wrong choices here. Index scans are not always the best choice, it depends heavily on the correlation of the column concerned to the physical order of the table's heap file. I suspect the reason for the planner choosing the bitmap scan is that said correlation is low (consult pg_stats to see). Now if you think that the table's heap data is cached anyway, then this is not such an issue - but you have to tell the planner that by reducing random_page_cost (as advised previously). Give it a try and report back! regards Mark
Le Tue, 27 Sep 2011 19:05:09 +1000, anthony.shipman@symstream.com a écrit : > On Tuesday 27 September 2011 18:54, Marc Cousin wrote: > > The thing is, the optimizer doesn't know if your data will be in > > cache when you will run your query… if you are sure most of your > > data is in the cache most of the time, you could try to tune > > random_page_cost (lower it) to reflect that data is cached. But if > > the win is small on this query, it may not be worth it. > > What I really want is to just read a sequence of records in timestamp > order between two timestamps. The number of records to be read may be > in the millions totalling more than 1GB of data so I'm trying to read > them a slice at a time but I can't get PG to do just this. > > If I use offset and limit to grab a slice of the records from a large > timestamp range then PG will grab all of the records in the range, > sort them on disk and return just the slice I want. This is absurdly > slow. > > The query that I've shown is one of a sequence of queries with the > timestamp range progressing in steps of 1 hour through the timestamp > range. All I want PG to do is find the range in the index, find the > matching records in the table and return them. All of the planner's > cleverness just seems to get in the way. > Maybe you should try using a cursor, if you don't know where you'll stop. This associated with a very low cursor_tuple_fraction will probably give you what you want (a fast start plan).
On Tuesday 27 September 2011 19:22, Mark Kirkwood wrote: > > The query that I've shown is one of a sequence of queries with the > > timestamp range progressing in steps of 1 hour through the timestamp > > range. All I want PG to do is find the range in the index, find the > > matching records in the table and return them. All of the planner's > > cleverness just seems to get in the way. > > It is not immediately clear that the planner is making the wrong choices > here. Index scans are not always the best choice, it depends heavily on > the correlation of the column concerned to the physical order of the > table's heap file. I suspect the reason for the planner choosing the > bitmap scan is that said correlation is low (consult pg_stats to see). > Now if you think that the table's heap data is cached anyway, then this > is not such an issue - but you have to tell the planner that by reducing > random_page_cost (as advised previously). Give it a try and report back! > > regards > > Mark I don't expect that any of it is cached. It is supposed to be a once-a-day linear scan of a slice of the table. The correlation on the timestamp is reported as 0.0348395. I can't use cluster since it would lock the table for too long. I would try a cursor but my framework for this case doesn't support cursors. In a later version of the framework I've tried cursors and haven't found them to be faster than reading in slices, in the tests I've done. Anyway at the moment it is fast enough. Thanks -- Anthony Shipman | It's caches all the way Anthony.Shipman@symstream.com | down.