Re: indexing for distinct search in timestamp based table - Mailing list pgsql-performance
From | David Rowley |
---|---|
Subject | Re: indexing for distinct search in timestamp based table |
Date | |
Msg-id | 20B8B02003014E48B263F657E9CEFAE3@amd64 Whole thread Raw |
In response to | indexing for distinct search in timestamp based table ("Rainer Mager" <rainer@vanten.com>) |
Responses |
Re: indexing for distinct search in timestamp based table
|
List | pgsql-performance |
I once also had a similar performance problem when looking for all matching rows between two timestamps. In fact that's why I'm here today. The problem was with MySQL. I had some tables of around 10 million rows and all my searching was timestamp based. MySQL didn't do what I wanted. I found that using a CLUSTERED index with postgresql to be lightning quick. Yet mostly the matching rows I was working with was not much over the 100k mark. I'm wondering if clustering the table on ad_log_start_time will help cut down on random reads. That's if you can afford to block the users while postgresql clusters the table. If you're inserting in order of the start_time column (which I was) then the cluster should almost maintain itself (I think), providing you're not updating or deleting anyway, I'd assume that since it looks like a log table. David. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Rainer Mager Sent: 28 August 2008 09:06 To: pgsql-performance@postgresql.org Subject: [PERFORM] indexing for distinct search in timestamp based table I'm looking for some help in speeding up searches. My table is pretty simple (see below), but somewhat large, and continuously growing. Currently it has about 50 million rows. The table is (I know I have excessive indexes, I'm trying to get the appropriate ones and drop the extras): Table "public.ad_log" Column | Type | Modifiers --------------+-----------------------------+------------------------------- ----------------------------- ad_log_id | integer | not null default nextval('ad_log_ad_log_id_seq'::regclass) channel_name | text | not null player_name | text | not null ad_name | text | not null start_time | timestamp without time zone | not null end_time | timestamp without time zone | not null Indexes: "ad_log_pkey" PRIMARY KEY, btree (ad_log_id) "ad_log_channel_name_key" UNIQUE, btree (channel_name, player_name, ad_name, start_time, end_time) "ad_log_ad_and_start" btree (ad_name, start_time) "ad_log_ad_name" btree (ad_name) "ad_log_all" btree (channel_name, player_name, start_time, ad_name) "ad_log_channel_name" btree (channel_name) "ad_log_end_time" btree (end_time) "ad_log_player_and_start" btree (player_name, start_time) "ad_log_player_name" btree (player_name) "ad_log_start_time" btree (start_time) The query I'm trying to speed up is below. In it the <field> tag can be one of channel_name, player_name, or ad_name. I'm actually trying to return the distinct values and I found GROUP BY to be slightly faster than using DISTINCT. Also, any of those fields may be unspecified in the WHERE clauses in which case we use '%', but it seems Postgres optimizes that pretty well. SELECT <field> FROM ad_log WHERE channel_name LIKE :channel_name AND player_name LIKE :player_name AND ad_name LIKE :ad_name AND start_time BETWEEN :start_date AND (date(:end_date) + 1) GROUP BY <field> ORDER BY <field> A typical query is: explain analyze SELECT channel_name FROM ad_log WHERE channel_name LIKE '%' AND ad_name LIKE '%' AND start_time BETWEEN '2008-07-01' AND (date('2008-07-28') + 1) GROUP BY channel_name ORDER BY channel_name; with the result being: QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ------- Sort (cost=1163169.02..1163169.03 rows=5 width=10) (actual time=75460.187..75460.192 rows=15 loops=1) Sort Key: channel_name Sort Method: quicksort Memory: 17kB -> HashAggregate (cost=1163168.91..1163168.96 rows=5 width=10) (actual time=75460.107..75460.114 rows=15 loops=1) -> Bitmap Heap Scan on ad_log (cost=285064.30..1129582.84 rows=13434427 width=10) (actual time=8506.250..65771.597 rows=13701296 loops=1) Recheck Cond: ((start_time >= '2008-07-01 00:00:00'::timestamp without time zone) AND (start_time <= '2008-07-29'::date)) Filter: ((channel_name ~~ '%'::text) AND (ad_name ~~ '%'::text)) -> Bitmap Index Scan on ad_log_start_time (cost=0.00..281705.70 rows=13434427 width=0) (actual time=8488.443..8488.443 rows=13701296 loops=1) Index Cond: ((start_time >= '2008-07-01 00:00:00'::timestamp without time zone) AND (start_time <= '2008-07-29'::date)) Total runtime: 75460.361 ms It seems to me there should be some way to create an index to speed this up, but the various ones I've tried so far haven't helped. Any suggestions would be greatly appreciated. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
pgsql-performance by date: