Home > mailing lists

Re: Subselect query enhancement - Mailing list pgsql-performance

From	Richard Huxton
Subject	Re: Subselect query enhancement
Date	February 2, 2007 05:51:53
Msg-id	45C309AE.1030506@archonet.com Whole thread
In response to	Re: Subselect query enhancement ("Michael Artz" <mlartz@gmail.com>)
List	pgsql-performance

Tree view

Michael Artz wrote:
> Here are some numbers for 3 different queries using a very selective
> query (port = 2222).  I'm thinking that, since the row estimates are
> different from the actuals (2 vs 2000), that this particular port
> didn't make it into the statistics ... is that true?  Does this
> matter?  If this isn't enough data, I can type up some more.

SELECT * FROM pg_stats WHERE tablename='foo';
This will show you frequency-stats on each column (as generated by
analyse). You're interested in n_distinct, most_common_vals,
most_common_freqs.

> One thing that I forgot to mention is that the network_events and
> host_events tables are partitioned by first octet of the IP, so when I
> say "various scans of ..." that means that there is a scan of each of
> the partitions, the type determined by the planner and the statistics,
> I assume.

So you've got xxx_events tables partitioned by ip, but ip_info is one
table? Do you do a lot of scans across the bottom 3 bytes of the IP? If
not, I'm not clear what we're gaining from the partitioning.

> ** Dual subselects:
> SELECT * FROM ip_info
> WHERE ip IN (SELECT ip FROM network_events WHERE port = 2222)
>       OR ip IN (SELECT ip FROM host_events WHERE port = 2222);
>
> Seq scan on ip_info (cost=2776..354575 rows=9312338 width=72) (actual
> time=34..8238 rows=234 loops=1)
> Filter:  ... AND ((hashed_subplan) OR (hashed_subplan))

Well, the estimate here is rubbish - 9.3 million rows whereas we
actually get 234. Now we know you're likely to get a lot of overlap, and
the planner might not realise that. Still - that looks very bad. Of
course, because it's expecting so many rows a seq-scan of ip_info looks
like a good choice to it.

> ** Single subselect:
>
> SELECT * FROM ip_info
> WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE port = 2222
>             UNION
>             SELECT DISTINCT ip FROM host_events WHERE port = 2222);
>
> Nested Loop (cost=2841..2856 rows=2 width=72) (actual time=55..106
> rows=2349 loops=1)

This is clearly a lot better, Not sure whether the DISTINCT in each
subquery works or not.

> ** The join:
>
> SELECT distinct ip_info.*
>  FROM ip_info RIGHT JOIN network_events USING (ip)
>       RIGHT JOIN  host_events USING (ip)
>  WHERE (network_events.port=2222 OR host_events.port=2222)
>
> Unique (cost=9238..9367 rows=1965 width=72) (actual time=61..61 rows=52
> loops=1)
>  -> Sort (cost=9238..9288 rows=1965 width=72) (actual time=61..61
> rows=63 loops=1)

OK, so what do the plans look like for port=80 or something larger like
that?

Then try adding an index to the various host/network_events tables
CREATE INDEX ... ON ... (ip) WHERE port=80;

--
   Richard Huxton
   Archonet Ltd

pgsql-performance by date:

From: Ben
Date: 01 February 2007, 20:41:47
Subject: drive configuration for a new server

From: Kirk Wythers
Date: 02 February 2007, 10:35:25
Subject: trouble with a join on OS X

Re: Subselect query enhancement - Mailing list pgsql-performance

Previous

Next