Thread: Subselect query enhancement
I'm needing help determining the best all-around query for the following situation. I have primary table that holds ip information and two other tables that hold event data for the specific IP in with a one-to-many mapping between them, ie: CREATE TABLE ip_info ( ip IP4, --other data ); CREATE TABLE network_events ( ip IP4 NOT NULL REFERENCES ip_info(ip), name VARCHAR, port INTEGER, --other data ); CREATE TABLE host_events ( ip IP4 NOT NULL REFERENCES ip_info(ip), name VARCHAR port INTEGER, --other data ); There is quite a bit of commonality between the network_events and host_events schemas, but they do not currently share an ancestor. ip_info has about 13 million rows, the network_events table has about 30 million rows, and the host_events table has about 7 million rows. There are indexes on all the rows. The query that I would like to execute is to select all the rows of ip_info that have either network or host events that meet some criteria, i.e. name='blah'. I have 3 different possibilities that I have thought of to execute this. First, 2 'ip IN (SELECT ...)' statements joined by an OR: SELECT * FROM ip_info WHERE ip IN (SELECT ip FROM network_events WHERE name='blah') OR ip IN (SELECT ip FROM host_events WHERE name='blah'); Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement: SELECT * FROM ip_info WHERE ip IN (SELECT ip FROM network_events WHERE name='blah' UNION SELECT ip FROM host_events WHERE name='blah'); Or, finally, the UNION statment with DISTINCTs: SELECT * FROM ip_info WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah' UNION SELECT DISTINCT ip FROM host_events WHERE name='blah'); From what I have read, the UNION statement does an implicit DISTINCT, but I thought that doing it on each of the tables would result in slightly faster execution. Can you think of any other ways to implement the previous query? I have explained/analyzed all the queries but, unfortunately, they are on an isolated computer. The gist is that, for relatively low-incidence values of name, the UNION performs better, but for queries on a common name, the dual-subselect query performs better. The explains look something like: Dual-subselect: Seq scan on ip_info Filter: ... AND ((hashed_subplan) OR (hashed_subplan)) Subplan -> Result -> Append -> various scans on host_events -> Result -> Append -> various scans on network_events UNION SELECT DISTINCT: Nested Loop -> Unique -> Sort -> Append -> Unique -> Sort -> Result -> Append -> various scans on host_events -> Unique -> Sort -> Result -> Append -> various scans on network_events If it would help to have more information, I could retype some of numbers in the explain. Any ideas? Thanks, -Mike
Michael Artz wrote: > I'm needing help determining the best all-around query for the > following situation. Not sure whether such a beast exists, but... > I have primary table that holds ip information > and two other tables that hold event data for the specific IP in with > a one-to-many mapping between them, ie: [snip] > There is quite a bit of commonality between the network_events and > host_events schemas, but they do not currently share an ancestor. > ip_info has about 13 million rows, the network_events table has about > 30 million rows, and the host_events table has about 7 million rows. > There are indexes on all the rows. What indexes though. Do you have (name,ip) on the two event tables? How selective is "name" - are there many different values or just a few? If lots, it might be worth increasing the statistics gathered on that column (ALTER COLUMN ... SET STATISTICS). http://www.postgresql.org/docs/8.2/static/sql-altertable.html > The query that I would like to execute is to select all the rows of > ip_info that have either network or host events that meet some > criteria, i.e. name='blah'. I have 3 different possibilities that I > have thought of to execute this. > > First, 2 'ip IN (SELECT ...)' statements joined by an OR: > > SELECT * FROM ip_info > WHERE ip IN (SELECT ip FROM network_events WHERE name='blah') > OR ip IN (SELECT ip FROM host_events WHERE name='blah'); > > Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement: > > SELECT * FROM ip_info > WHERE ip IN (SELECT ip FROM network_events WHERE name='blah' > UNION > SELECT ip FROM host_events WHERE name='blah'); > > Or, finally, the UNION statment with DISTINCTs: > > SELECT * FROM ip_info > WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah' > UNION > SELECT DISTINCT ip FROM host_events WHERE name='blah'); > > From what I have read, the UNION statement does an implicit DISTINCT, > but I thought that doing it on each of the tables would result in > slightly faster execution. Can you think of any other ways to > implement the previous query? You're right about removing duplicates. Not sure whether the DISTINCTs on the sub-selects are helping or hindering. It'll probably depend on your hardware, config, number of rows etc. The only other way I can think of for this query is to UNION two JOINs. Might interact well with the (name,ip) index I mentioned above. > I have explained/analyzed all the queries but, unfortunately, they are > on an isolated computer. The gist is that, for relatively > low-incidence values of name, the UNION performs better, but for > queries on a common name, the dual-subselect query performs better. Difficult to say much without seeing the full explain analyse. Did the row estimates look reasonable? -- Richard Huxton Archonet Ltd
I've found that doing joins seems to produce better results on the big tables queries I use. This is not always the case though. How about this option: SELECT distinct ip_info.* FROM ip_info RIGHT JOIN network_events USING (ip) RIGHT JOIN host_events USING (ip) WHERE (network_events.name='blah' OR host_events.name = 'blah') AND ip_info.ip IS NOT NULL; That gets rid of the sub-queries your using that look pretty costly. Michael Artz wrote: > I'm needing help determining the best all-around query for the > following situation. I have primary table that holds ip information > and two other tables that hold event data for the specific IP in with > a one-to-many mapping between them, ie: > > CREATE TABLE ip_info ( > ip IP4, > --other data > ); > > CREATE TABLE network_events ( > ip IP4 NOT NULL REFERENCES ip_info(ip), > name VARCHAR, > port INTEGER, > --other data > ); > > CREATE TABLE host_events ( > ip IP4 NOT NULL REFERENCES ip_info(ip), > name VARCHAR > port INTEGER, > --other data > ); > > There is quite a bit of commonality between the network_events and > host_events schemas, but they do not currently share an ancestor. > ip_info has about 13 million rows, the network_events table has about > 30 million rows, and the host_events table has about 7 million rows. > There are indexes on all the rows. > > The query that I would like to execute is to select all the rows of > ip_info that have either network or host events that meet some > criteria, i.e. name='blah'. I have 3 different possibilities that I > have thought of to execute this. > > First, 2 'ip IN (SELECT ...)' statements joined by an OR: > > SELECT * FROM ip_info > WHERE ip IN (SELECT ip FROM network_events WHERE name='blah') > OR ip IN (SELECT ip FROM host_events WHERE name='blah'); > > Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement: > > SELECT * FROM ip_info > WHERE ip IN (SELECT ip FROM network_events WHERE name='blah' > UNION > SELECT ip FROM host_events WHERE name='blah'); > > Or, finally, the UNION statment with DISTINCTs: > > SELECT * FROM ip_info > WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah' > UNION > SELECT DISTINCT ip FROM host_events WHERE name='blah'); > >> From what I have read, the UNION statement does an implicit DISTINCT, > but I thought that doing it on each of the tables would result in > slightly faster execution. Can you think of any other ways to > implement the previous query? > > I have explained/analyzed all the queries but, unfortunately, they are > on an isolated computer. The gist is that, for relatively > low-incidence values of name, the UNION performs better, but for > queries on a common name, the dual-subselect query performs better. > > The explains look something like: > Dual-subselect: > Seq scan on ip_info > Filter: ... AND ((hashed_subplan) OR (hashed_subplan)) > Subplan > -> Result > -> Append > -> various scans on host_events > -> Result > -> Append > -> various scans on network_events > > UNION SELECT DISTINCT: > > Nested Loop > -> Unique > -> Sort > -> Append > -> Unique > -> Sort > -> Result > -> Append > -> various scans on host_events > -> Unique > -> Sort > -> Result > -> Append > -> various scans on network_events > > If it would help to have more information, I could retype some of > numbers in the explain. > > Any ideas? > > Thanks, > -Mike > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- *Edward Allen* Software Engineer Black Duck Software, Inc. tallen@blackducksoftware.com <mailto:tallen@blackducksoftware.com> T +1.781.891.5100 x133 F +1.781.891.5145 http://www.blackducksoftware.com
> How about this option: > > SELECT distinct ip_info.* FROM ip_info RIGHT JOIN network_events USING > (ip) RIGHT JOIN host_events USING (ip) WHERE > (network_events.name='blah' OR host_events.name = 'blah') AND > ip_info.ip IS NOT NULL; Nah, that seems to be much much worse. The other queries usually return in 1-2 minutes, this one has been running for 30 minutes and has still not returned -Mike
On Thu, 1 Feb 2007 11:42:03 -0500 "Michael Artz" <mlartz@gmail.com> wrote: > I'm needing help determining the best all-around query for the > following situation. I have primary table that holds ip information > and two other tables that hold event data for the specific IP in with > a one-to-many mapping between them, ie: > > CREATE TABLE ip_info ( > ip IP4, > --other data > ); > > CREATE TABLE network_events ( > ip IP4 NOT NULL REFERENCES ip_info(ip), > name VARCHAR, > port INTEGER, > --other data > ); > > CREATE TABLE host_events ( > ip IP4 NOT NULL REFERENCES ip_info(ip), > name VARCHAR > port INTEGER, > --other data > ); It would probably help to have an index on that column for all three tables, then I would wager using joins will be the speed winner. --------------------------------- Frank Wiles <frank@wiles.org> http://www.wiles.org ---------------------------------
> > I have primary table that holds ip information > > and two other tables that hold event data for the specific IP in with > > a one-to-many mapping between them, ie: > [snip] > > There is quite a bit of commonality between the network_events and > > host_events schemas, but they do not currently share an ancestor. > > ip_info has about 13 million rows, the network_events table has about > > 30 million rows, and the host_events table has about 7 million rows. > > There are indexes on all the rows. > > What indexes though. Do you have (name,ip) on the two event tables? All the columns are indexed individually. The tables are completely static, as I reload the whole DB with new data every day. > How selective is "name" - are there many different values or just a few? > If lots, it might be worth increasing the statistics gathered on that > column (ALTER COLUMN ... SET STATISTICS). > http://www.postgresql.org/docs/8.2/static/sql-altertable.html I guess that is the heart of my question. "name" is not very selective (there are only 20 or so choices) however other columns are fairly selective for certain cases, such as 'port'. When querying on and unusual port, the query is very fast, and the single UNIONed subselect returns quickly. When 'port' is not very selective (like port = '80', which is roughly 1/2 of the rows in the DB), the dual subselect query wins, hands-down. And I have altered the statistics via the config file: default_statistics_target = 100 Perhaps this should be even higher for certain columns? > > The query that I would like to execute is to select all the rows of > > ip_info that have either network or host events that meet some > > criteria, i.e. name='blah'. I have 3 different possibilities that I > > have thought of to execute this. > > > > First, 2 'ip IN (SELECT ...)' statements joined by an OR: > > > > SELECT * FROM ip_info > > WHERE ip IN (SELECT ip FROM network_events WHERE name='blah') > > OR ip IN (SELECT ip FROM host_events WHERE name='blah'); > > > > Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement: > > > > SELECT * FROM ip_info > > WHERE ip IN (SELECT ip FROM network_events WHERE name='blah' > > UNION > > SELECT ip FROM host_events WHERE name='blah'); > > > > Or, finally, the UNION statment with DISTINCTs: > > > > SELECT * FROM ip_info > > WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah' > > UNION > > SELECT DISTINCT ip FROM host_events WHERE name='blah'); > > > > From what I have read, the UNION statement does an implicit DISTINCT, > > but I thought that doing it on each of the tables would result in > > slightly faster execution. Can you think of any other ways to > > implement the previous query? > > You're right about removing duplicates. Not sure whether the DISTINCTs > on the sub-selects are helping or hindering. It'll probably depend on > your hardware, config, number of rows etc. > > The only other way I can think of for this query is to UNION two JOINs. > Might interact well with the (name,ip) index I mentioned above. Nah, that did very poorly. > > I have explained/analyzed all the queries but, unfortunately, they are > > on an isolated computer. The gist is that, for relatively > > low-incidence values of name, the UNION performs better, but for > > queries on a common name, the dual-subselect query performs better. > > Difficult to say much without seeing the full explain analyse. Did the > row estimates look reasonable? hmm, I think so, but I'm not that good in reading the outputs. I'll see if I can retype some of the interesting bits of the explain analyze.
Michael Artz wrote: >> > I have primary table that holds ip information >> > and two other tables that hold event data for the specific IP in with >> > a one-to-many mapping between them, ie: >> [snip] >> > There is quite a bit of commonality between the network_events and >> > host_events schemas, but they do not currently share an ancestor. >> > ip_info has about 13 million rows, the network_events table has about >> > 30 million rows, and the host_events table has about 7 million rows. >> > There are indexes on all the rows. >> >> What indexes though. Do you have (name,ip) on the two event tables? > > All the columns are indexed individually. The tables are completely > static, as I reload the whole DB with new data every day. The point of a (name,ip) index would be to let you read off ip numbers in order easily. >> How selective is "name" - are there many different values or just a few? >> If lots, it might be worth increasing the statistics gathered on that >> column (ALTER COLUMN ... SET STATISTICS). >> http://www.postgresql.org/docs/8.2/static/sql-altertable.html > > I guess that is the heart of my question. "name" is not very > selective (there are only 20 or so choices) however other columns are > fairly selective for certain cases, such as 'port'. When querying on > and unusual port, the query is very fast, and the single UNIONed > subselect returns quickly. When 'port' is not very selective (like > port = '80', which is roughly 1/2 of the rows in the DB), the dual > subselect query wins, hands-down. > > And I have altered the statistics via the config file: > default_statistics_target = 100 > Perhaps this should be even higher for certain columns? You're probably better off leaving it at 10 and upping it for the vital columns. 25 for names should be a good choice. You could try partial indexes for those cases where you have particularly common values of name/port: CREATE INDEX idx1 ON host_events (ip) WHERE port=80; -- Richard Huxton Archonet Ltd
Here are some numbers for 3 different queries using a very selective query (port = 2222). I'm thinking that, since the row estimates are different from the actuals (2 vs 2000), that this particular port didn't make it into the statistics ... is that true? Does this matter? If this isn't enough data, I can type up some more. One thing that I forgot to mention is that the network_events and host_events tables are partitioned by first octet of the IP, so when I say "various scans of ..." that means that there is a scan of each of the partitions, the type determined by the planner and the statistics, I assume. ** Dual subselects: SELECT * FROM ip_info WHERE ip IN (SELECT ip FROM network_events WHERE port = 2222) OR ip IN (SELECT ip FROM host_events WHERE port = 2222); Seq scan on ip_info (cost=2776..354575 rows=9312338 width=72) (actual time=34..8238 rows=234 loops=1) Filter: ... AND ((hashed_subplan) OR (hashed_subplan)) Subplan -> Result (cost=0..849 rows=459 width=4) (actual time=0.176..2.310 rows=72 loops=1) -> Append (cost=0.00..849 rows=459 width=4) (actual time=0.173..2.095 rows=72 loops=1) -> various scans on host_events -> Result (cost=0..1923 rows=856 width=4) (actual time=0.072..24.614 rows=2393 loops=1) -> Append (cost=0..1923 rows=856 width=4) (actual time=0.069..27 rows=2393 loops=1) -> various scans on network_events ** Single subselect: SELECT * FROM ip_info WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE port = 2222 UNION SELECT DISTINCT ip FROM host_events WHERE port = 2222); Nested Loop (cost=2841..2856 rows=2 width=72) (actual time=55..106 rows=2349 loops=1) -> Unique (cost=2841..2856 rows=2 width=72) (actual time=55..66 rows=2349 loops=1) -> Sort (cost=2841..2841 rows=2 width=4) (actual time=55..58 rows=2401 loops=1) -> Append (cost=1956..2841 rows=2 width=4) (actual time=29..50 rows=2401 loops=1) -> Unique (cost=1956..1959 rows=2 width=4) (actual time=29..50 rows=2401 loops=1) -> Sort -> Result -> Append -> various scans on network_events -> Unique (cost=869..871 rows=1 width=4) (actual time=2.9..3.3 rows=70 loops=1) -> Sort -> Result -> Append -> various scans on host_events ** The join: SELECT distinct ip_info.* FROM ip_info RIGHT JOIN network_events USING (ip) RIGHT JOIN host_events USING (ip) WHERE (network_events.port=2222 OR host_events.port=2222) Unique (cost=9238..9367 rows=1965 width=72) (actual time=61..61 rows=52 loops=1) -> Sort (cost=9238..9288 rows=1965 width=72) (actual time=61..61 rows=63 loops=1) -> Hash Join (cost=850..9176 rows=1965 width=76) (actual time=0..54 rows=2393 loops=1) -> Nested Loop Left Join (cost=0..8205 rows=856 width=76) (actual time=0..54 rows=2393 loops=1) -> Append -> various scans of network_events -> Index Scan of ip_info (cost=0..7 rows=1 width=72) (actual time=0..0 rows=1 loops 2393) ->Hash (cost=849..849 rows=459 width=4) (actual time=0..2 rows=72 loops=1) -> Append ->various scans of host_events On 2/1/07, Michael Artz <mlartz@gmail.com> wrote: > > > I have primary table that holds ip information > > > and two other tables that hold event data for the specific IP in with > > > a one-to-many mapping between them, ie: > > [snip] > > > There is quite a bit of commonality between the network_events and > > > host_events schemas, but they do not currently share an ancestor. > > > ip_info has about 13 million rows, the network_events table has about > > > 30 million rows, and the host_events table has about 7 million rows. > > > There are indexes on all the rows. > > > > What indexes though. Do you have (name,ip) on the two event tables? > > All the columns are indexed individually. The tables are completely > static, as I reload the whole DB with new data every day. > > > How selective is "name" - are there many different values or just a few? > > If lots, it might be worth increasing the statistics gathered on that > > column (ALTER COLUMN ... SET STATISTICS). > > http://www.postgresql.org/docs/8.2/static/sql-altertable.html > > I guess that is the heart of my question. "name" is not very > selective (there are only 20 or so choices) however other columns are > fairly selective for certain cases, such as 'port'. When querying on > and unusual port, the query is very fast, and the single UNIONed > subselect returns quickly. When 'port' is not very selective (like > port = '80', which is roughly 1/2 of the rows in the DB), the dual > subselect query wins, hands-down. > > And I have altered the statistics via the config file: > default_statistics_target = 100 > Perhaps this should be even higher for certain columns? > > > > The query that I would like to execute is to select all the rows of > > > ip_info that have either network or host events that meet some > > > criteria, i.e. name='blah'. I have 3 different possibilities that I > > > have thought of to execute this. > > > > > > First, 2 'ip IN (SELECT ...)' statements joined by an OR: > > > > > > SELECT * FROM ip_info > > > WHERE ip IN (SELECT ip FROM network_events WHERE name='blah') > > > OR ip IN (SELECT ip FROM host_events WHERE name='blah'); > > > > > > Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement: > > > > > > SELECT * FROM ip_info > > > WHERE ip IN (SELECT ip FROM network_events WHERE name='blah' > > > UNION > > > SELECT ip FROM host_events WHERE name='blah'); > > > > > > Or, finally, the UNION statment with DISTINCTs: > > > > > > SELECT * FROM ip_info > > > WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah' > > > UNION > > > SELECT DISTINCT ip FROM host_events WHERE name='blah'); > > > > > > From what I have read, the UNION statement does an implicit DISTINCT, > > > but I thought that doing it on each of the tables would result in > > > slightly faster execution. Can you think of any other ways to > > > implement the previous query? > > > > You're right about removing duplicates. Not sure whether the DISTINCTs > > on the sub-selects are helping or hindering. It'll probably depend on > > your hardware, config, number of rows etc. > > > > The only other way I can think of for this query is to UNION two JOINs. > > Might interact well with the (name,ip) index I mentioned above. > > Nah, that did very poorly. > > > > I have explained/analyzed all the queries but, unfortunately, they are > > > on an isolated computer. The gist is that, for relatively > > > low-incidence values of name, the UNION performs better, but for > > > queries on a common name, the dual-subselect query performs better. > > > > Difficult to say much without seeing the full explain analyse. Did the > > row estimates look reasonable? > > hmm, I think so, but I'm not that good in reading the outputs. I'll > see if I can retype some of the interesting bits of the explain > analyze. >
>> How about this option: >> >> SELECT distinct ip_info.* FROM ip_info RIGHT JOIN network_events USING >> (ip) RIGHT JOIN host_events USING (ip) WHERE >> (network_events.name='blah' OR host_events.name = 'blah') AND >> ip_info.ip IS NOT NULL; MA> Nah, that seems to be much much worse. The other queries usually MA> return in 1-2 minutes, this one has been running for 30 minutes and MA> has still not returned I find that an OR involving two different fields (in this case even different tables) is faster when replaced by the equivalent UNION. In this case--- SELECT distinct ip_info.* FROM ip_info RIGHT JOIN network_events USING (ip) WHERE network_events.name='blah' AND ip_info.ip IS NOT NULL UNION SELECT distinct ip_info.* FROM ip_info RIGHT JOIN host_events USING (ip) WHERE host_events.name = 'blah' AND ip_info.ip IS NOT NULL; Moreover, at least through 8.1, GROUP BY is faster than DISTINCT.
Michael Artz wrote: > Here are some numbers for 3 different queries using a very selective > query (port = 2222). I'm thinking that, since the row estimates are > different from the actuals (2 vs 2000), that this particular port > didn't make it into the statistics ... is that true? Does this > matter? If this isn't enough data, I can type up some more. SELECT * FROM pg_stats WHERE tablename='foo'; This will show you frequency-stats on each column (as generated by analyse). You're interested in n_distinct, most_common_vals, most_common_freqs. > One thing that I forgot to mention is that the network_events and > host_events tables are partitioned by first octet of the IP, so when I > say "various scans of ..." that means that there is a scan of each of > the partitions, the type determined by the planner and the statistics, > I assume. So you've got xxx_events tables partitioned by ip, but ip_info is one table? Do you do a lot of scans across the bottom 3 bytes of the IP? If not, I'm not clear what we're gaining from the partitioning. > ** Dual subselects: > SELECT * FROM ip_info > WHERE ip IN (SELECT ip FROM network_events WHERE port = 2222) > OR ip IN (SELECT ip FROM host_events WHERE port = 2222); > > Seq scan on ip_info (cost=2776..354575 rows=9312338 width=72) (actual > time=34..8238 rows=234 loops=1) > Filter: ... AND ((hashed_subplan) OR (hashed_subplan)) Well, the estimate here is rubbish - 9.3 million rows whereas we actually get 234. Now we know you're likely to get a lot of overlap, and the planner might not realise that. Still - that looks very bad. Of course, because it's expecting so many rows a seq-scan of ip_info looks like a good choice to it. > ** Single subselect: > > SELECT * FROM ip_info > WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE port = 2222 > UNION > SELECT DISTINCT ip FROM host_events WHERE port = 2222); > > Nested Loop (cost=2841..2856 rows=2 width=72) (actual time=55..106 > rows=2349 loops=1) This is clearly a lot better, Not sure whether the DISTINCT in each subquery works or not. > ** The join: > > SELECT distinct ip_info.* > FROM ip_info RIGHT JOIN network_events USING (ip) > RIGHT JOIN host_events USING (ip) > WHERE (network_events.port=2222 OR host_events.port=2222) > > Unique (cost=9238..9367 rows=1965 width=72) (actual time=61..61 rows=52 > loops=1) > -> Sort (cost=9238..9288 rows=1965 width=72) (actual time=61..61 > rows=63 loops=1) OK, so what do the plans look like for port=80 or something larger like that? Then try adding an index to the various host/network_events tables CREATE INDEX ... ON ... (ip) WHERE port=80; -- Richard Huxton Archonet Ltd