Indexing question - Mailing list pgsql-performance
From | Stan Bielski |
---|---|
Subject | Indexing question |
Date | |
Msg-id | Pine.LNX.3.96L.1040615172733.5306P-100000@elysium.pdl.cmu.edu Whole thread Raw |
Responses |
Re: Indexing question
|
List | pgsql-performance |
Hello all, I have a rather large table (~20 GB) of network logs taken over the period of a month and stored under postgres 7.3. I'm trying to create an indexing scheme so that I can do a query with a "where time > foo and time < bar" and get the results without too much waiting. With this in mind, I created a new database ordering each log entry by the time values (seconds, and microseconds), and then created an index on the time value for seconds. I didn't use the clusterdb command, but I did do a "select * into foo_tbl order by time", which I understand to be the functional equivalent. The result looks something like this: Table "public.allflow_tv_mydoom" Column | Type | Modifiers ------------+---------+----------- tv_s | bigint | tv_us | bigint | src | inet | dst | inet | sport | integer | dport | integer | bytes_in | bigint | bytes_out | bigint | isn_in | bigint | isn_out | bigint | num_starts | integer | result | integer | flags | integer | age_s | bigint | age_us | bigint | ttl_left | bigint | end_s | bigint | end_us | bigint | Indexes: allflow_tv_mydoom_x btree (tv_s) with tv_s, of course, being my value in seconds. I followed this up with an ANALYZE so postrgres could absorb this structure into its logic. However, whenever I want to do my query, it *insists* on doing a sequential scan, even when I explicitly turn sequential scan off in postgres.conf... here's an example: standb=# explain select * from allflow_tv_mydoom where tv_s < 1074200099 and tv_s > 107506499; QUERY PLAN ----------------------------------------------------------------------------------- Seq Scan on allflow_tv_mydoom (cost=100000000.00..102303307.94 rows=1 width=132) Filter: ((tv_s < 1074200099) AND (tv_s > 107506499)) (2 rows) In this query I'm basically asking for a day's worth of data. It should be straightforward: it's all laid out in order and indexed, but it still insists on (what I believe to be) tearing through the entire DB and filtering out the other time values as it sees them. Even if I want just the data from a particular second, it does the same thing: standb=# explain select * from allflow_tv_mydoom where tv_s = 1074200099; QUERY PLAN ------------------------------------------------------------------------------------- Seq Scan on allflow_tv_mydoom (cost=100000000.00..102140682.45 rows=145 width=132) Filter: (tv_s = 1074200099) (2 rows) Naturally, this is incredibly frustrating because it takes forever, regardless of the query. The funny thing though is that I have the same data in another table where it is ordered and indexed by IP address, and the queries use the index and work the way I want them to. Example: standb=# explain select * from flow_ip_mydoom where src = '10.0.5.5'; QUERY PLAN --------------------------------------------------------------------------------------------- Index Scan using flow_ip_mydoom_x on flow_ip_mydoom (cost=0.00..333.41 rows=376 width=132) Index Cond: (src = '10.0.5.5'::inet) (2 rows) Can someone please explain to me what's going on and how to fix it? If there's an easier way to 'jump' to different time regions in the table without indexing all the different seconds values, I'd like to know this as well. Thanks a bunch, -S
pgsql-performance by date: