Indexing question - Mailing list pgsql-performance

From Stan Bielski
Subject Indexing question
Date
Msg-id Pine.LNX.3.96L.1040615172733.5306P-100000@elysium.pdl.cmu.edu
Whole thread Raw
Responses Re: Indexing question
List pgsql-performance
Hello all,

I have a rather large table (~20 GB) of network logs taken over the period
of a month and stored under postgres 7.3. I'm trying to create an indexing
scheme so that I can do a query with a "where time > foo and time < bar"
and get the results without too much waiting. With this in mind, I created
a new database ordering each log entry by the time values (seconds, and
microseconds), and then created an index on the time value for seconds. I
didn't use the clusterdb command, but I did do a "select * into foo_tbl
order by time", which I understand to be the functional equivalent. The
result looks something like this:

 Table "public.allflow_tv_mydoom"
   Column   |  Type   | Modifiers
------------+---------+-----------
 tv_s       | bigint  |
 tv_us      | bigint  |
 src        | inet    |
 dst        | inet    |
 sport      | integer |
 dport      | integer |
 bytes_in   | bigint  |
 bytes_out  | bigint  |
 isn_in     | bigint  |
 isn_out    | bigint  |
 num_starts | integer |
 result     | integer |
 flags      | integer |
 age_s      | bigint  |
 age_us     | bigint  |
 ttl_left   | bigint  |
 end_s      | bigint  |
 end_us     | bigint  |
Indexes: allflow_tv_mydoom_x btree (tv_s)

with tv_s, of course, being my value in seconds. I followed this up with
an ANALYZE so postrgres could absorb this structure into its logic.
However, whenever I want to do my query, it *insists* on doing a
sequential scan, even when I explicitly turn sequential scan off in
postgres.conf...  here's an example:

standb=# explain select * from allflow_tv_mydoom where tv_s < 1074200099
and tv_s > 107506499;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Seq Scan on allflow_tv_mydoom  (cost=100000000.00..102303307.94 rows=1
width=132)
   Filter: ((tv_s < 1074200099) AND (tv_s > 107506499))
(2 rows)

In this query I'm basically asking for a day's worth of data. It should be
straightforward: it's all laid out in order and indexed, but it still
insists on (what I believe to be) tearing through the entire DB and
filtering out the other time values as it sees them. Even if I want just
the data from a particular second, it does the same thing:

standb=# explain select * from allflow_tv_mydoom where tv_s = 1074200099;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on allflow_tv_mydoom  (cost=100000000.00..102140682.45 rows=145
width=132)
   Filter: (tv_s = 1074200099)
(2 rows)

Naturally, this is incredibly frustrating because it takes forever,
regardless of the query.

The funny thing though is that I have the same data in another table
where it is ordered and indexed by IP address, and the queries use the
index and work the way I want them to. Example:

standb=# explain select * from flow_ip_mydoom where src = '10.0.5.5';
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Index Scan using flow_ip_mydoom_x on flow_ip_mydoom  (cost=0.00..333.41
rows=376 width=132)
   Index Cond: (src = '10.0.5.5'::inet)
(2 rows)

Can someone please explain to me what's going on and how to fix it? If
there's an easier way to 'jump' to different time regions in the table
without indexing all the different seconds values, I'd like to know this
as well.

Thanks a bunch,
-S





pgsql-performance by date:

Previous
From: "Janio Rosa da Silva"
Date:
Subject: Hi!
Next
From: "Thanks"
Date:
Subject: pg_fetch_array