Re: Subselect query enhancement - Mailing list pgsql-performance
From | Ted Allen |
---|---|
Subject | Re: Subselect query enhancement |
Date | |
Msg-id | 45C22353.4020902@blackducksoftware.com Whole thread Raw |
In response to | Subselect query enhancement ("Michael Artz" <mlartz@gmail.com>) |
Responses |
Re: Subselect query enhancement
|
List | pgsql-performance |
I've found that doing joins seems to produce better results on the big tables queries I use. This is not always the case though. How about this option: SELECT distinct ip_info.* FROM ip_info RIGHT JOIN network_events USING (ip) RIGHT JOIN host_events USING (ip) WHERE (network_events.name='blah' OR host_events.name = 'blah') AND ip_info.ip IS NOT NULL; That gets rid of the sub-queries your using that look pretty costly. Michael Artz wrote: > I'm needing help determining the best all-around query for the > following situation. I have primary table that holds ip information > and two other tables that hold event data for the specific IP in with > a one-to-many mapping between them, ie: > > CREATE TABLE ip_info ( > ip IP4, > --other data > ); > > CREATE TABLE network_events ( > ip IP4 NOT NULL REFERENCES ip_info(ip), > name VARCHAR, > port INTEGER, > --other data > ); > > CREATE TABLE host_events ( > ip IP4 NOT NULL REFERENCES ip_info(ip), > name VARCHAR > port INTEGER, > --other data > ); > > There is quite a bit of commonality between the network_events and > host_events schemas, but they do not currently share an ancestor. > ip_info has about 13 million rows, the network_events table has about > 30 million rows, and the host_events table has about 7 million rows. > There are indexes on all the rows. > > The query that I would like to execute is to select all the rows of > ip_info that have either network or host events that meet some > criteria, i.e. name='blah'. I have 3 different possibilities that I > have thought of to execute this. > > First, 2 'ip IN (SELECT ...)' statements joined by an OR: > > SELECT * FROM ip_info > WHERE ip IN (SELECT ip FROM network_events WHERE name='blah') > OR ip IN (SELECT ip FROM host_events WHERE name='blah'); > > Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement: > > SELECT * FROM ip_info > WHERE ip IN (SELECT ip FROM network_events WHERE name='blah' > UNION > SELECT ip FROM host_events WHERE name='blah'); > > Or, finally, the UNION statment with DISTINCTs: > > SELECT * FROM ip_info > WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah' > UNION > SELECT DISTINCT ip FROM host_events WHERE name='blah'); > >> From what I have read, the UNION statement does an implicit DISTINCT, > but I thought that doing it on each of the tables would result in > slightly faster execution. Can you think of any other ways to > implement the previous query? > > I have explained/analyzed all the queries but, unfortunately, they are > on an isolated computer. The gist is that, for relatively > low-incidence values of name, the UNION performs better, but for > queries on a common name, the dual-subselect query performs better. > > The explains look something like: > Dual-subselect: > Seq scan on ip_info > Filter: ... AND ((hashed_subplan) OR (hashed_subplan)) > Subplan > -> Result > -> Append > -> various scans on host_events > -> Result > -> Append > -> various scans on network_events > > UNION SELECT DISTINCT: > > Nested Loop > -> Unique > -> Sort > -> Append > -> Unique > -> Sort > -> Result > -> Append > -> various scans on host_events > -> Unique > -> Sort > -> Result > -> Append > -> various scans on network_events > > If it would help to have more information, I could retype some of > numbers in the explain. > > Any ideas? > > Thanks, > -Mike > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- *Edward Allen* Software Engineer Black Duck Software, Inc. tallen@blackducksoftware.com <mailto:tallen@blackducksoftware.com> T +1.781.891.5100 x133 F +1.781.891.5145 http://www.blackducksoftware.com
pgsql-performance by date: