Re: Subselect query enhancement - Mailing list pgsql-performance

From Ted Allen
Subject Re: Subselect query enhancement
Date
Msg-id 45C22353.4020902@blackducksoftware.com
Whole thread Raw
In response to Subselect query enhancement  ("Michael Artz" <mlartz@gmail.com>)
Responses Re: Subselect query enhancement
List pgsql-performance
I've found that doing joins seems to produce better results on the big
tables queries I use.  This is not always the case though.

How about this option:

SELECT distinct ip_info.* FROM ip_info RIGHT JOIN network_events USING
(ip) RIGHT JOIN  host_events USING (ip) WHERE
(network_events.name='blah' OR host_events.name = 'blah')  AND
ip_info.ip IS NOT NULL;

That gets rid of the sub-queries your using that look pretty costly.

Michael Artz wrote:
> I'm needing help determining the best all-around query for the
> following situation.  I have primary table that holds ip information
> and two other tables that hold event data for the specific IP in with
> a one-to-many mapping between them, ie:
>
> CREATE TABLE ip_info (
>    ip IP4,
>    --other data
> );
>
> CREATE TABLE network_events (
>    ip IP4 NOT NULL REFERENCES ip_info(ip),
>    name VARCHAR,
>    port INTEGER,
>    --other data
> );
>
> CREATE TABLE host_events (
>    ip IP4 NOT NULL REFERENCES ip_info(ip),
>    name VARCHAR
>    port INTEGER,
>    --other data
> );
>
> There is quite a bit of commonality between the network_events and
> host_events schemas, but they do not currently share an ancestor.
> ip_info has about 13 million rows, the network_events table has about
> 30 million rows, and the host_events table has about 7 million rows.
> There are indexes on all the rows.
>
> The query that I would like to execute is to select all the rows of
> ip_info that have either  network or host events that meet some
> criteria, i.e. name='blah'.  I have 3 different possibilities that I
> have thought of to execute this.
>
> First, 2 'ip IN (SELECT ...)' statements joined by an OR:
>
> SELECT * FROM ip_info
>  WHERE ip IN (SELECT ip FROM network_events WHERE name='blah')
>        OR ip IN (SELECT ip FROM host_events WHERE name='blah');
>
> Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement:
>
> SELECT * FROM ip_info
>  WHERE ip IN (SELECT ip FROM network_events WHERE name='blah'
>              UNION
>              SELECT ip FROM host_events WHERE name='blah');
>
> Or, finally, the UNION statment with DISTINCTs:
>
> SELECT * FROM ip_info
>  WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah'
>              UNION
>              SELECT DISTINCT ip FROM host_events WHERE name='blah');
>
>> From what I have read, the UNION statement does an implicit DISTINCT,
> but I thought that doing it on each of the tables would result in
> slightly faster execution.  Can you think of any other ways to
> implement the previous query?
>
> I have explained/analyzed all the queries but, unfortunately, they are
> on an isolated computer.  The gist is that, for relatively
> low-incidence values of name, the UNION performs better, but for
> queries on a common name, the dual-subselect query performs better.
>
> The explains look something like:
> Dual-subselect:
> Seq scan on ip_info
>  Filter:  ... AND ((hashed_subplan) OR (hashed_subplan))
>  Subplan
>    -> Result
>      -> Append
>        -> various scans on host_events
>    -> Result
>      -> Append
>        -> various scans on network_events
>
> UNION SELECT DISTINCT:
>
> Nested Loop
>  -> Unique
>    -> Sort
>      -> Append
>        -> Unique
>          -> Sort
>            -> Result
>              -> Append
>                -> various scans on host_events
>        -> Unique
>          -> Sort
>            -> Result
>              -> Append
>                -> various scans on network_events
>
> If it would help to have more information, I could retype some of
> numbers in the explain.
>
> Any ideas?
>
> Thanks,
> -Mike
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>


--

*Edward Allen*
Software Engineer
Black Duck Software, Inc.

tallen@blackducksoftware.com <mailto:tallen@blackducksoftware.com>
T +1.781.891.5100 x133
F +1.781.891.5145
http://www.blackducksoftware.com


pgsql-performance by date:

Previous
From: Richard Huxton
Date:
Subject: Re: Subselect query enhancement
Next
From: "Michael Artz"
Date:
Subject: Re: Subselect query enhancement