Thread: Indexing question

Indexing question

From
Stan Bielski
Date:
Hello all,

I have a rather large table (~20 GB) of network logs taken over the period
of a month and stored under postgres 7.3. I'm trying to create an indexing
scheme so that I can do a query with a "where time > foo and time < bar"
and get the results without too much waiting. With this in mind, I created
a new database ordering each log entry by the time values (seconds, and
microseconds), and then created an index on the time value for seconds. I
didn't use the clusterdb command, but I did do a "select * into foo_tbl
order by time", which I understand to be the functional equivalent. The
result looks something like this:

 Table "public.allflow_tv_mydoom"
   Column   |  Type   | Modifiers
------------+---------+-----------
 tv_s       | bigint  |
 tv_us      | bigint  |
 src        | inet    |
 dst        | inet    |
 sport      | integer |
 dport      | integer |
 bytes_in   | bigint  |
 bytes_out  | bigint  |
 isn_in     | bigint  |
 isn_out    | bigint  |
 num_starts | integer |
 result     | integer |
 flags      | integer |
 age_s      | bigint  |
 age_us     | bigint  |
 ttl_left   | bigint  |
 end_s      | bigint  |
 end_us     | bigint  |
Indexes: allflow_tv_mydoom_x btree (tv_s)

with tv_s, of course, being my value in seconds. I followed this up with
an ANALYZE so postrgres could absorb this structure into its logic.
However, whenever I want to do my query, it *insists* on doing a
sequential scan, even when I explicitly turn sequential scan off in
postgres.conf...  here's an example:

standb=# explain select * from allflow_tv_mydoom where tv_s < 1074200099
and tv_s > 107506499;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Seq Scan on allflow_tv_mydoom  (cost=100000000.00..102303307.94 rows=1
width=132)
   Filter: ((tv_s < 1074200099) AND (tv_s > 107506499))
(2 rows)

In this query I'm basically asking for a day's worth of data. It should be
straightforward: it's all laid out in order and indexed, but it still
insists on (what I believe to be) tearing through the entire DB and
filtering out the other time values as it sees them. Even if I want just
the data from a particular second, it does the same thing:

standb=# explain select * from allflow_tv_mydoom where tv_s = 1074200099;
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Seq Scan on allflow_tv_mydoom  (cost=100000000.00..102140682.45 rows=145
width=132)
   Filter: (tv_s = 1074200099)
(2 rows)

Naturally, this is incredibly frustrating because it takes forever,
regardless of the query.

The funny thing though is that I have the same data in another table
where it is ordered and indexed by IP address, and the queries use the
index and work the way I want them to. Example:

standb=# explain select * from flow_ip_mydoom where src = '10.0.5.5';
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Index Scan using flow_ip_mydoom_x on flow_ip_mydoom  (cost=0.00..333.41
rows=376 width=132)
   Index Cond: (src = '10.0.5.5'::inet)
(2 rows)

Can someone please explain to me what's going on and how to fix it? If
there's an easier way to 'jump' to different time regions in the table
without indexing all the different seconds values, I'd like to know this
as well.

Thanks a bunch,
-S





Re: Indexing question

From
Tom Lane
Date:
Stan Bielski <bielski@ece.cmu.edu> writes:
>  Table "public.allflow_tv_mydoom"
>    Column   |  Type   | Modifiers
> ------------+---------+-----------
>  tv_s       | bigint  |
                ^^^^^^
> Indexes: allflow_tv_mydoom_x btree (tv_s)

> standb=# explain select * from allflow_tv_mydoom where tv_s < 1074200099
> and tv_s > 107506499;
> [ gives seqscan ]

This is a FAQ :-(.  Unadorned integer constants are taken to be int4
not int8 (unless they are too large for int4), and cross-data-type
comparisons are not indexable in existing releases.  So you have to
explicitly cast the comparison values to int8:

explain select * from allflow_tv_mydoom where tv_s < 1074200099::bigint
and tv_s > 107506499::bigint;

(or use the standard CAST syntax if you prefer).

7.5 will have a fix for this ancient annoyance.

BTW, is there a reason to be using tv_s+tv_us and not just a single
timestamptz column?

            regards, tom lane