Re: Subselect query enhancement - Mailing list pgsql-performance
From | Richard Huxton |
---|---|
Subject | Re: Subselect query enhancement |
Date | |
Msg-id | 45C309AE.1030506@archonet.com Whole thread Raw |
In response to | Re: Subselect query enhancement ("Michael Artz" <mlartz@gmail.com>) |
List | pgsql-performance |
Michael Artz wrote: > Here are some numbers for 3 different queries using a very selective > query (port = 2222). I'm thinking that, since the row estimates are > different from the actuals (2 vs 2000), that this particular port > didn't make it into the statistics ... is that true? Does this > matter? If this isn't enough data, I can type up some more. SELECT * FROM pg_stats WHERE tablename='foo'; This will show you frequency-stats on each column (as generated by analyse). You're interested in n_distinct, most_common_vals, most_common_freqs. > One thing that I forgot to mention is that the network_events and > host_events tables are partitioned by first octet of the IP, so when I > say "various scans of ..." that means that there is a scan of each of > the partitions, the type determined by the planner and the statistics, > I assume. So you've got xxx_events tables partitioned by ip, but ip_info is one table? Do you do a lot of scans across the bottom 3 bytes of the IP? If not, I'm not clear what we're gaining from the partitioning. > ** Dual subselects: > SELECT * FROM ip_info > WHERE ip IN (SELECT ip FROM network_events WHERE port = 2222) > OR ip IN (SELECT ip FROM host_events WHERE port = 2222); > > Seq scan on ip_info (cost=2776..354575 rows=9312338 width=72) (actual > time=34..8238 rows=234 loops=1) > Filter: ... AND ((hashed_subplan) OR (hashed_subplan)) Well, the estimate here is rubbish - 9.3 million rows whereas we actually get 234. Now we know you're likely to get a lot of overlap, and the planner might not realise that. Still - that looks very bad. Of course, because it's expecting so many rows a seq-scan of ip_info looks like a good choice to it. > ** Single subselect: > > SELECT * FROM ip_info > WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE port = 2222 > UNION > SELECT DISTINCT ip FROM host_events WHERE port = 2222); > > Nested Loop (cost=2841..2856 rows=2 width=72) (actual time=55..106 > rows=2349 loops=1) This is clearly a lot better, Not sure whether the DISTINCT in each subquery works or not. > ** The join: > > SELECT distinct ip_info.* > FROM ip_info RIGHT JOIN network_events USING (ip) > RIGHT JOIN host_events USING (ip) > WHERE (network_events.port=2222 OR host_events.port=2222) > > Unique (cost=9238..9367 rows=1965 width=72) (actual time=61..61 rows=52 > loops=1) > -> Sort (cost=9238..9288 rows=1965 width=72) (actual time=61..61 > rows=63 loops=1) OK, so what do the plans look like for port=80 or something larger like that? Then try adding an index to the various host/network_events tables CREATE INDEX ... ON ... (ip) WHERE port=80; -- Richard Huxton Archonet Ltd
pgsql-performance by date: