Thread: Optimizing suggestions

Optimizing suggestions

From
David Ford
Date:
Ok, I have a table which is currently a bit of a conglomerate.  I intend
to break it down into multiple tables as is more benefiting an RDBMS but
here is the scoop so far.  I have a table that I need to count up
statistics on it based on the timestamp of the entry.

bmilter=> \d history ;
                      Table "history"
     Column      |           Type           |   Modifiers
-----------------+--------------------------+---------------
 bmilter_host_id | inet                     | not null
 timestamp       | timestamp with time zone | default now()
 size            | bigint                   | not null
 content_type    | smallint                 | default '0'
 message_id      | character varying        | not null
Indexes: history_idx

The index is of a btree type on column timestamp.

What I need to do is get a count of the number of rows for a given day
for the last N days.  Typically 7, 14, 1mo, etc.  At present this takes
just under one second per query, ~680ms.  Each query returns the count
of one day.  So for a week of data, this comes out to 7 queries and
about 4 seconds worth of time. An example query for two days ago is:

bmilter=> select count(*) from history where timestamp::date =
now()::date - '2 days'::interval;
 count
-------
  3513

I am currently accumulating about 4K rows per day but I expect this to
nearly double.  I'm looking for enlightenment, suggestions on how to
improve this etc.  As stated above, I'd like to fetch a count of rows
per day for the last N days.

The explain return is here:

bmilter=> explain analyze select count(*) from history where
timestamp::date = now()::date - '2 days'::interval;
NOTICE:  QUERY PLAN:

Aggregate  (cost=690.66..690.66 rows=1 width=0) (actual
time=674.95..674.95 rows=1 loops=1)
  ->  Seq Scan on history  (cost=0.00..690.46 rows=83 width=0) (actual
time=219.03..663.33 rows=3513 loops=1)
Total runtime: 675.11 msec

In the near future I'm going to split this table up into many sub tables
and introduce several more columns, i.e.:

      create table hx_id (
         serial            int8,
         bmilter_host_id   inet NOT NULL,
         timestamp         timestamp DEFAULT 'now'::timestamp primary key
      );

      create table hx_stats (
         serial            int8,
         size              int8 DEFAULT '0',
         content_type      int2 DEFAULT '0',
         message_id        character varying
      );

      create table hx_recipients (
         serial            int8,
         recipients        character varying
      );

      create table hx_spam_tagged (
         serial            int8,
         spam_id           int8 references spam_ids (id) on update cascade
      );


I believe I've laid it out now, so hit me with it if you care to.  What
can I do to improve the speed with which the queries are done?  I don't
have the fastest of machines, so I do need to tune as best can.  Normal
postmaster tuning aside, I've already increased the shared buffers etc
but would touching any of the optimizer section help?.  I'm needing
improvement on my concepts I'm sure.

Thank you,
David



Re: Optimizing suggestions

From
Tom Lane
Date:
David Ford <david+cert@blue-labs.org> writes:
> As stated above, I'd like to fetch a count of rows
> per day for the last N days.

GROUP BY seems like the obvious answer...

> In the near future I'm going to split this table up into many sub tables

Why?  Aren't you just going to have a need for lots of joins, if you
do that?

            regards, tom lane

Re: Optimizing suggestions

From
David Ford
Date:

Tom Lane wrote:

>David Ford <david+cert@blue-labs.org> writes:
>
>
>>As stated above, I'd like to fetch a count of rows
>>per day for the last N days.
>>
>>
>
>GROUP BY seems like the obvious answer...
>

So it does,

bmilter=> explain analyze select count(*) from history group by
timestamp::date;
NOTICE:  QUERY PLAN:

Aggregate  (cost=1784.68..1878.46 rows=1876 width=8) (actual
time=515.60..661.22 rows=8 loops=1)
  ->  Group  (cost=1784.68..1831.57 rows=18755 width=8) (actual
time=510.75..623.29 rows=19310 loops=1)
        ->  Sort  (cost=1784.68..1784.68 rows=18755 width=8) (actual
time=510.73..535.28 rows=19310 loops=1)
              ->  Seq Scan on history  (cost=0.00..453.55 rows=18755
width=8) (actual time=0.05..354.04 rows=19310 loops=1)
Total runtime: 667.70 msec

EXPLAIN

and..

bmilter=> explain analyze select count(*) from history where
timestamp::date > (now()::date - '7 days'::interval) group by
timestamp::date ;
NOTICE:  QUERY PLAN:

Aggregate  (cost=1175.93..1207.19 rows=625 width=8) (actual
time=1233.34..1349.87 rows=7 loops=1)
  ->  Group  (cost=1175.93..1191.56 rows=6252 width=8) (actual
time=1204.19..1313.24 rows=18693 loops=1)
        ->  Sort  (cost=1175.93..1175.93 rows=6252 width=8) (actual
time=1204.17..1227.90 rows=18693 loops=1)
              ->  Seq Scan on history  (cost=0.00..781.76 rows=6252
width=8) (actual time=8.52..1049.05 rows=18693 loops=1)
Total runtime: 1356.16 msec

EXPLAIN

bmilter=> select count(*) from history where timestamp::date >
(now()::date - '7 days'::interval) group by timestamp::date ;
 count
-------
  3743
  3414
  3513
  3077
  2642
  1553
   751
(7 rows)


Now.. :)  How can I further improve on this?  This would be faster if I
used two columns of 'date' and 'time', yes?

>>In the near future I'm going to split this table up into many sub tables
>>
>>
>
>Why?  Aren't you just going to have a need for lots of joins, if you
>do that?
>
>            regards, tom lane
>

No, actually most of this data will be drawn into focused reporting for
graphing metrics.  Only when I drill down to a specific (small) set of
message serial numbers, will I be doing a join of all this data.

In my naivety, which is more performance effective?  Most of the time I
will be searching for a count of statistics on a single column.

Thank you for your assistance, knowledge is appreciated,
David

p.s. I am assuming count(*) v.s. count(column) is always faster because
it's doing a generic row count v.s. accumulating stats on a column
inside a row?



Re: Optimizing suggestions

From
Tom Lane
Date:
David Ford <david+cert@blue-labs.org> writes:
> bmilter=> select count(*) from history where timestamp::date >
> (now()::date - '7 days'::interval) group by timestamp::date ;

> Now.. :)  How can I further improve on this?

The date-restriction clause is poorly phrased: you're taking a timestamp
from now(), coercing to date, coercing back to timestamp (because there
is no date - interval operator, only timestamp - interval), subtracting
an interval to give a timestamp ... and then on the other side of the
comparison, coercing to date and then back to timestamp to do the
comparison (unless there's a date > timestamp operator, which I doubt).

Aside from all the wasted datatype conversion operators, there is no
hope of using this clause with an index on the timestamp column.

What I'd try is

    where timestamp >= (current_date - 6)

(which I think has the same boundary-condition behavior as your given
expression; adjust to taste).

As of current development sources (7.3 to be) the above WHERE clause is
directly indexscannable.  In existing releases you have to play some
games to get the planner to realize that (current_date - 6) can be
treated as a constant for the purposes of making an indexscan.  One
way is to make a function called, say, "ago(int)" that returns
current_date minus the parameter, and mark it isCachable.

> In the near future I'm going to split this table up into many sub tables
>>
>> Why?  Aren't you just going to have a need for lots of joins, if you
>> do that?

> No, actually most of this data will be drawn into focused reporting for
> graphing metrics.  Only when I drill down to a specific (small) set of
> message serial numbers, will I be doing a join of all this data.

> In my naivety, which is more performance effective?  Most of the time I
> will be searching for a count of statistics on a single column.

I'd guess that you're better off with the single table anyway, just on
grounds of simplicity.  I really doubt that the split-up could provide
enough performance gain to be worth the trouble...

> p.s. I am assuming count(*) v.s. count(column) is always faster because
> it's doing a generic row count v.s. accumulating stats on a column
> inside a row?

Marginally faster, yes --- the count(column) form has to take the time
to see whether the column is null or not.

            regards, tom lane