Performance of query - Mailing list pgsql-general

From Edmund Dengler
Subject Performance of query
Date
Msg-id Pine.BSO.4.44.0306082339020.8114-100000@cyclops4.esentire.com
Whole thread Raw
In response to Re: Nulls get converted to 0 problem  (Avi Schwartz <avi@CFFtechnologies.com>)
List pgsql-general
Ok, now I am really confused. I have a query that runs really slow, yet
when I code it up as a function it runs very fast! The optimizer seems to
not evaluate a good method for performing the query.

Background: This is a database based off of Snort. There is a base table
called <event> (with <sid>=sensor ID, <cid>=event ID, <timestamp>=when
event collected, <signature>=which event was recognized), and a number of
depending tables for various bits and pieces of network packet data
(<iphdr>=IP header info, <tcphdr>=TCP header info, <data>=packet payload)
and <signature> (what event).

Notes: Running 7.3.3
Database is around 20GB on disk.

=> select version();
                               version
---------------------------------------------------------------------
 PostgreSQL 7.3.3 on i386-unknown-openbsd3.0, compiled by GCC 2.95.3
(1 row)

=> \d event
               Table "public.event"
  Column   |           Type           | Modifiers
-----------+--------------------------+-----------
 sid       | integer                  | not null
 cid       | bigint                   | not null
 signature | integer                  | not null
 timestamp | timestamp with time zone | not null
Indexes: event_pkey primary key btree (sid, cid),
         event_pull_idx unique btree ("timestamp", sid, cid),
         signature_idx btree (signature),
         timestamp_idx btree ("timestamp")
Triggers: event_summary

=> \d signature
                             Table "public.signature"
    Column    |  Type   |                        Modifiers
--------------+---------+----------------------------------------------------------
 sig_id       | integer | not null default nextval('"signature_sig_id_seq"'::text)
<<...more columns...>>
Indexes: signature_pkey primary key btree (sig_id),
         sig_class_id btree (sig_class_id),
         sig_name_idx btree (sig_name)

Rest of the tables are primary keyed with (sid, cid). For example,

=> \d iphdr
      Table "public.iphdr"
  Column  |   Type   | Modifiers
----------+----------+-----------
 sid      | integer  | not null
 cid      | bigint   | not null
<<...more columns...>>
Indexes: iphdr_pkey primary key btree (sid, cid),
         ip_dst_idx btree (ip_dst),
         ip_src_idx btree (ip_src)
Triggers: iphdr_summary

If I run the query:

explain analyze
select
  event.sid, event.cid, event.timestamp, signature.sig_name, iphdr.ip_src, iphdr.ip_dst,
  icmphdr.icmp_type, icmphdr.icmp_code, icmphdr.icmp_csum, icmphdr.icmp_id, icmphdr.icmp_seq,
  udphdr.udp_sport, udphdr.udp_dport, udphdr.udp_len, udphdr.udp_csum,
  tcphdr.tcp_sport, tcphdr.tcp_dport, tcphdr.tcp_seq, tcphdr.tcp_ack, tcphdr.tcp_off,
  tcphdr.tcp_res, tcphdr.tcp_flags, tcphdr.tcp_win, tcphdr.tcp_csum, tcphdr.tcp_urp,
  sensor.hostname, sensor.interface, data.data_payload
from (
  select * from event
  where timestamp > (select now() - '2 hours'::interval)
    and exists (select 1 from hack_pull_sid where sid = event.sid)
) as event
  left join signature on signature.sig_id = event.signature
  left join iphdr on iphdr.sid = event.sid and iphdr.cid = event.cid
  left join icmphdr on icmphdr.sid = event.sid and icmphdr.cid = event.cid
  left join udphdr on udphdr.sid = event.sid and udphdr.cid = event.cid
  left join tcphdr on tcphdr.sid = event.sid and tcphdr.cid = event.cid
  left join sensor on sensor.sid = event.sid
  left join data on data.sid = event.sid and data.cid = event.cid
order by
  timestamp desc, event.sid desc, event.cid desc
;

(note: <hack_pull_sid> is a table of SIDs I am interested in so that I
avoid the issues with IN)

I get the following output:


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1712181.78..1712776.52 rows=237893 width=853) (actual time=163484.81..163484.97 rows=129 loops=1)
   Sort Key: public.event."timestamp", public.event.sid, public.event.cid
   InitPlan
     ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.04..0.04 rows=1 loops=1)
   ->  Merge Join  (cost=1471148.73..1551631.57 rows=237893 width=853) (actual time=152693.86..163484.26 rows=129
loops=1)
         Merge Cond: (("outer".sid = "inner".sid) AND ("outer".cid = "inner".cid))
         ->  Merge Join  (cost=728456.82..802284.20 rows=237893 width=258) (actual time=33652.33..39050.11 rows=129
loops=1)
               Merge Cond: ("outer".sid = "inner".sid)
               ->  Merge Join  (cost=728454.78..798713.57 rows=237893 width=226) (actual time=33635.79..39031.06
rows=129loops=1) 
                     Merge Cond: (("outer".sid = "inner".sid) AND ("outer".cid = "inner".cid))
                     ->  Merge Join  (cost=728454.78..736750.59 rows=237893 width=172) (actual
time=22767.67..24071.84rows=129loops=1) 
                           Merge Cond: (("outer".sid = "inner".sid) AND ("outer".cid = "inner".cid))
                           ->  Merge Join  (cost=728454.78..731219.48 rows=237893 width=144) (actual
time=22164.44..22681.23rows=129 loops=1) 
                                 Merge Cond: (("outer".sid = "inner".sid) AND ("outer".cid = "inner".cid))
                                 ->  Sort  (cost=701134.84..701729.57 rows=237893 width=116) (actual
time=15343.87..15344.05rows=129 loops=1) 
                                       Sort Key: public.event.sid, public.event.cid
                                       ->  Merge Join  (cost=590896.58..671684.37 rows=237893 width=116) (actual
time=10857.65..15343.57rows=129 loops=1) 
                                             Merge Cond: (("outer".sid = "inner".sid) AND ("outer".cid = "inner".cid))
                                             ->  Index Scan using iphdr_pkey on iphdr  (cost=0.00..73422.89
rows=1432042width=28) (actual time=23.19..13810.58 rows=687918 loops=1) 
                                             ->  Sort  (cost=590896.58..591491.31 rows=237893 width=88) (actual
time=101.79..101.95rows=129 loops=1) 
                                                   Sort Key: public.event.sid, public.event.cid
                                                   ->  Hash Join  (cost=185.07..563108.55 rows=237893 width=88) (actual
time=97.42..101.39rows=129 loops=1) 
                                                         Hash Cond: ("outer".signature = "inner".sig_id)
                                                         ->  Index Scan using timestamp_idx on event
(cost=0.00..558165.62rows=237893 width=24) (actual time=0.18..3.05 rows=129 loops=1) 
                                                               Index Cond: ("timestamp" > $0)
                                                               Filter: (subplan)
                                                               SubPlan
                                                                 ->  Seq Scan on hack_pull_sid  (cost=0.00..1.15 rows=1
width=0)(actual time=0.01..0.01 rows=1 loops=171) 
                                                                       Filter: (sid = $1)
                                                         ->  Hash  (cost=167.26..167.26 rows=7126 width=64) (actual
time=97.12..97.12rows=0 loops=1) 
                                                               ->  Seq Scan on signature  (cost=0.00..167.26
rows=7126width=64)(actual time=0.01..80.86 rows=7126 loops=1) 
                                 ->  Sort  (cost=27319.94..27897.87 rows=231171 width=28) (actual time=6606.52..7073.38
rows=110717loops=1) 
                                       Sort Key: icmphdr.sid, icmphdr.cid
                                       ->  Seq Scan on icmphdr  (cost=0.00..3784.71 rows=231171 width=28) (actual
time=19.81..2082.01rows=231303 loops=1) 
                           ->  Index Scan using udphdr_pkey on udphdr  (cost=0.00..4411.44 rows=115228 width=28)
(actualtime=60.23..1269.61 rows=54669 loops=1) 
                     ->  Index Scan using tcphdr_pkey on tcphdr  (cost=0.00..56586.06 rows=1139141 width=54) (actual
time=88.72..13878.52rows=484419 loops=1) 
               ->  Sort  (cost=2.04..2.11 rows=30 width=32) (actual time=16.46..16.64 rows=136 loops=1)
                     Sort Key: sensor.sid
                     ->  Seq Scan on sensor  (cost=0.00..1.30 rows=30 width=32) (actual time=16.26..16.34 rows=30
loops=1)
         ->  Sort  (cost=742691.92..745059.80 rows=947154 width=595) (actual time=108643.60..123322.67 rows=417145
loops=1)
               Sort Key: data.sid, data.cid
               ->  Seq Scan on data  (cost=0.00..51821.54 rows=947154 width=595) (actual time=9.67..54494.61
rows=947324loops=1) 
 Total runtime: 164147.78 msec
(43 rows)

For some runs, I was getting over 800,000 msec!!

Now, this seems very strange for only 129 rows. The inner select returns
exactly the 129 rows I am interested in, and now we only need to do a few
left joins to get the corresponding rows (which may or may not exist,
which is why the LEFT JOIN's). Note that the (<sid>,<cid>) is the primary
key for most of these joins, and so the LEFT JOIN should be able to
recognize that there will only be one or zero rows matching. In addition,
(<sid>,<cid>)  is the primary key for the <event> table. Similar
discussion with <signature> as <sig_id> is it's primary key.

Notes:
- The table <event> has 1,427,411 rows
- <event>.<timestamp> ranges from 2002-01-19 12:28:29-05 to
  2003-06-08 23:40:01-04
- The last 2 hours is a very small range assuming an even distribution
- All tables are using 100 bins for statistics
- A VACUUM ANALYZE was performed on the entire database before this test

As can be seem, the estimator seems to think that 237,893 rows will be
returned from the INNER SELECT out of 1,427,411 rows. Can anyone explain
how this number could be calculated? Even assuming a straight linear
assumption, this should fall into 2hours/(24hours/day*400days) rough
estimate which is (very roughly) 0.02% of the table.

So, I wrote a function. (Please, no remarks on style, it was a quick hack
to compare performance. I am sure it could be made better and more
optimized). It simply performs the inner SELECT, and then for each row,
does the outer select for the match.

CREATE TYPE hack_pull_type AS (
  sid int, cid bigint, timestamp timestamp with time zone,
  sig_name text,
  ip_src bigint, ip_dst bigint,
  icmp_type smallint, icmp_code smallint, icmp_csum int, icmp_id int, icmp_seq int,
  udp_sport int, udp_dport int, udp_len int, udp_csum int,
  tcp_sport int, tcp_dport int, tcp_seq bigint, tcp_ack bigint, tcp_off smallint,
  tcp_res smallint, tcp_flags smallint, tcp_win int, tcp_csum int, tcp_urp int,
  hostname text, interface text, data_payload text
);

CREATE OR REPLACE FUNCTION hack_pull_func() RETURNS SETOF hack_pull_type
AS '
DECLARE
  event_rec event%ROWTYPE;
  -- pull_rec hack_pull_type;
  pull_rec RECORD;
BEGIN
  FOR event_rec IN
    SELECT *
    FROM event
    WHERE timestamp > (SELECT now() - ''2 hours''::interval)
      AND EXISTS (SELECT 1 FROM hack_pull_sid WHERE sid = event.sid)
    ORDER BY timestamp desc, sid desc, cid desc
  LOOP
    SELECT
      event.sid, event.cid, event.timestamp,
      sig_name, ip_src, ip_dst,
      icmp_type, icmp_code, icmp_csum, icmp_id, icmp_seq,
      udp_sport, udp_dport, udp_len, udp_csum,
      tcp_sport, tcp_dport, tcp_seq, tcp_ack, tcp_off, tcp_res,
      tcp_flags, tcp_win, tcp_csum, tcp_urp,
      sensor.hostname, sensor.interface, data.data_payload
    INTO pull_rec
    FROM event
      left join signature on signature.sig_id = event.signature
      left join iphdr on iphdr.sid = event.sid and iphdr.cid = event.cid
      left join icmphdr on icmphdr.sid = event.sid and icmphdr.cid = event.cid
      left join udphdr on udphdr.sid = event.sid and udphdr.cid = event.cid
      left join tcphdr on tcphdr.sid = event.sid and tcphdr.cid = event.cid
      left join sensor on sensor.sid = event.sid
      left join data on data.sid = event.sid and data.cid = event.cid
    WHERE
      event.cid = event_rec.cid and event.sid = event_rec.sid
    LIMIT 1;
    RETURN NEXT pull_rec;
  END LOOP;

  RETURN;
END;
' LANGUAGE 'plpgsql';

=> explain analyze select * from hack_pull_func();
                                                     QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
 Function Scan on hack_pull_func  (cost=0.00..12.50 rows=1000 width=238) (actual time=5259.46..5259.62 rows=23 loops=1)
 Total runtime: 5259.70 msec
(2 rows)

=> explain analyze select * from hack_pull_func();
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Function Scan on hack_pull_func  (cost=0.00..12.50 rows=1000 width=238) (actual time=38.02..38.18 rows=23 loops=1)
 Total runtime: 38.26 msec
(2 rows)

(Sorry, the data has changed as time moves forward and I type this email.
However, the numbers are reflective of the performance difference I
observed during testing).

So, what is happening? How come my function (ugly as it may be), is orders
of magnitude faster than what should be a clean query in SQL? Why is the
estimator so far off the real issue?

Is there a way of writing the query to get the good performance? Note that
there are a number of other queries I run which exhibit similar
performance/estimation issues, and I wonder if I will need to rwrite all
of them into custom functions (thanks for the SETOF piece as this makes it
possible to write this stuff now).

Thanks!
Ed



pgsql-general by date:

Previous
From: The Hermit Hacker
Date:
Subject: Archives re-generating ...
Next
From: Murthy Kambhampaty
Date:
Subject: Re: Backups and restores.