Thread: Optimizing suggestions
Ok, I have a table which is currently a bit of a conglomerate. I intend to break it down into multiple tables as is more benefiting an RDBMS but here is the scoop so far. I have a table that I need to count up statistics on it based on the timestamp of the entry. bmilter=> \d history ; Table "history" Column | Type | Modifiers -----------------+--------------------------+--------------- bmilter_host_id | inet | not null timestamp | timestamp with time zone | default now() size | bigint | not null content_type | smallint | default '0' message_id | character varying | not null Indexes: history_idx The index is of a btree type on column timestamp. What I need to do is get a count of the number of rows for a given day for the last N days. Typically 7, 14, 1mo, etc. At present this takes just under one second per query, ~680ms. Each query returns the count of one day. So for a week of data, this comes out to 7 queries and about 4 seconds worth of time. An example query for two days ago is: bmilter=> select count(*) from history where timestamp::date = now()::date - '2 days'::interval; count ------- 3513 I am currently accumulating about 4K rows per day but I expect this to nearly double. I'm looking for enlightenment, suggestions on how to improve this etc. As stated above, I'd like to fetch a count of rows per day for the last N days. The explain return is here: bmilter=> explain analyze select count(*) from history where timestamp::date = now()::date - '2 days'::interval; NOTICE: QUERY PLAN: Aggregate (cost=690.66..690.66 rows=1 width=0) (actual time=674.95..674.95 rows=1 loops=1) -> Seq Scan on history (cost=0.00..690.46 rows=83 width=0) (actual time=219.03..663.33 rows=3513 loops=1) Total runtime: 675.11 msec In the near future I'm going to split this table up into many sub tables and introduce several more columns, i.e.: create table hx_id ( serial int8, bmilter_host_id inet NOT NULL, timestamp timestamp DEFAULT 'now'::timestamp primary key ); create table hx_stats ( serial int8, size int8 DEFAULT '0', content_type int2 DEFAULT '0', message_id character varying ); create table hx_recipients ( serial int8, recipients character varying ); create table hx_spam_tagged ( serial int8, spam_id int8 references spam_ids (id) on update cascade ); I believe I've laid it out now, so hit me with it if you care to. What can I do to improve the speed with which the queries are done? I don't have the fastest of machines, so I do need to tune as best can. Normal postmaster tuning aside, I've already increased the shared buffers etc but would touching any of the optimizer section help?. I'm needing improvement on my concepts I'm sure. Thank you, David
David Ford <david+cert@blue-labs.org> writes: > As stated above, I'd like to fetch a count of rows > per day for the last N days. GROUP BY seems like the obvious answer... > In the near future I'm going to split this table up into many sub tables Why? Aren't you just going to have a need for lots of joins, if you do that? regards, tom lane
Tom Lane wrote: >David Ford <david+cert@blue-labs.org> writes: > > >>As stated above, I'd like to fetch a count of rows >>per day for the last N days. >> >> > >GROUP BY seems like the obvious answer... > So it does, bmilter=> explain analyze select count(*) from history group by timestamp::date; NOTICE: QUERY PLAN: Aggregate (cost=1784.68..1878.46 rows=1876 width=8) (actual time=515.60..661.22 rows=8 loops=1) -> Group (cost=1784.68..1831.57 rows=18755 width=8) (actual time=510.75..623.29 rows=19310 loops=1) -> Sort (cost=1784.68..1784.68 rows=18755 width=8) (actual time=510.73..535.28 rows=19310 loops=1) -> Seq Scan on history (cost=0.00..453.55 rows=18755 width=8) (actual time=0.05..354.04 rows=19310 loops=1) Total runtime: 667.70 msec EXPLAIN and.. bmilter=> explain analyze select count(*) from history where timestamp::date > (now()::date - '7 days'::interval) group by timestamp::date ; NOTICE: QUERY PLAN: Aggregate (cost=1175.93..1207.19 rows=625 width=8) (actual time=1233.34..1349.87 rows=7 loops=1) -> Group (cost=1175.93..1191.56 rows=6252 width=8) (actual time=1204.19..1313.24 rows=18693 loops=1) -> Sort (cost=1175.93..1175.93 rows=6252 width=8) (actual time=1204.17..1227.90 rows=18693 loops=1) -> Seq Scan on history (cost=0.00..781.76 rows=6252 width=8) (actual time=8.52..1049.05 rows=18693 loops=1) Total runtime: 1356.16 msec EXPLAIN bmilter=> select count(*) from history where timestamp::date > (now()::date - '7 days'::interval) group by timestamp::date ; count ------- 3743 3414 3513 3077 2642 1553 751 (7 rows) Now.. :) How can I further improve on this? This would be faster if I used two columns of 'date' and 'time', yes? >>In the near future I'm going to split this table up into many sub tables >> >> > >Why? Aren't you just going to have a need for lots of joins, if you >do that? > > regards, tom lane > No, actually most of this data will be drawn into focused reporting for graphing metrics. Only when I drill down to a specific (small) set of message serial numbers, will I be doing a join of all this data. In my naivety, which is more performance effective? Most of the time I will be searching for a count of statistics on a single column. Thank you for your assistance, knowledge is appreciated, David p.s. I am assuming count(*) v.s. count(column) is always faster because it's doing a generic row count v.s. accumulating stats on a column inside a row?
David Ford <david+cert@blue-labs.org> writes: > bmilter=> select count(*) from history where timestamp::date > > (now()::date - '7 days'::interval) group by timestamp::date ; > Now.. :) How can I further improve on this? The date-restriction clause is poorly phrased: you're taking a timestamp from now(), coercing to date, coercing back to timestamp (because there is no date - interval operator, only timestamp - interval), subtracting an interval to give a timestamp ... and then on the other side of the comparison, coercing to date and then back to timestamp to do the comparison (unless there's a date > timestamp operator, which I doubt). Aside from all the wasted datatype conversion operators, there is no hope of using this clause with an index on the timestamp column. What I'd try is where timestamp >= (current_date - 6) (which I think has the same boundary-condition behavior as your given expression; adjust to taste). As of current development sources (7.3 to be) the above WHERE clause is directly indexscannable. In existing releases you have to play some games to get the planner to realize that (current_date - 6) can be treated as a constant for the purposes of making an indexscan. One way is to make a function called, say, "ago(int)" that returns current_date minus the parameter, and mark it isCachable. > In the near future I'm going to split this table up into many sub tables >> >> Why? Aren't you just going to have a need for lots of joins, if you >> do that? > No, actually most of this data will be drawn into focused reporting for > graphing metrics. Only when I drill down to a specific (small) set of > message serial numbers, will I be doing a join of all this data. > In my naivety, which is more performance effective? Most of the time I > will be searching for a count of statistics on a single column. I'd guess that you're better off with the single table anyway, just on grounds of simplicity. I really doubt that the split-up could provide enough performance gain to be worth the trouble... > p.s. I am assuming count(*) v.s. count(column) is always faster because > it's doing a generic row count v.s. accumulating stats on a column > inside a row? Marginally faster, yes --- the count(column) form has to take the time to see whether the column is null or not. regards, tom lane