Subselect query enhancement - Mailing list pgsql-performance

From Michael Artz
Subject Subselect query enhancement
Date
Msg-id e9c163070702010842ja639378v75e8c0eb757d9b72@mail.gmail.com
Whole thread Raw
Responses Re: Subselect query enhancement  (Richard Huxton <dev@archonet.com>)
Re: Subselect query enhancement  (Ted Allen <tallen@blackducksoftware.com>)
Re: Subselect query enhancement  (Frank Wiles <frank@wiles.org>)
List pgsql-performance
I'm needing help determining the best all-around query for the
following situation.  I have primary table that holds ip information
and two other tables that hold event data for the specific IP in with
a one-to-many mapping between them, ie:

CREATE TABLE ip_info (
    ip IP4,
    --other data
);

CREATE TABLE network_events (
    ip IP4 NOT NULL REFERENCES ip_info(ip),
    name VARCHAR,
    port INTEGER,
    --other data
);

CREATE TABLE host_events (
    ip IP4 NOT NULL REFERENCES ip_info(ip),
    name VARCHAR
    port INTEGER,
    --other data
);

There is quite a bit of commonality between the network_events and
host_events schemas, but they do not currently share an ancestor.
ip_info has about 13 million rows, the network_events table has about
30 million rows, and the host_events table has about 7 million rows.
There are indexes on all the rows.

The query that I would like to execute is to select all the rows of
ip_info that have either  network or host events that meet some
criteria, i.e. name='blah'.  I have 3 different possibilities that I
have thought of to execute this.

First, 2 'ip IN (SELECT ...)' statements joined by an OR:

SELECT * FROM ip_info
  WHERE ip IN (SELECT ip FROM network_events WHERE name='blah')
        OR ip IN (SELECT ip FROM host_events WHERE name='blah');

Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement:

SELECT * FROM ip_info
  WHERE ip IN (SELECT ip FROM network_events WHERE name='blah'
              UNION
              SELECT ip FROM host_events WHERE name='blah');

Or, finally, the UNION statment with DISTINCTs:

SELECT * FROM ip_info
  WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah'
              UNION
              SELECT DISTINCT ip FROM host_events WHERE name='blah');

From what I have read, the UNION statement does an implicit DISTINCT,
but I thought that doing it on each of the tables would result in
slightly faster execution.  Can you think of any other ways to
implement the previous query?

I have explained/analyzed all the queries but, unfortunately, they are
on an isolated computer.  The gist is that, for relatively
low-incidence values of name, the UNION performs better, but for
queries on a common name, the dual-subselect query performs better.

The explains look something like:
Dual-subselect:
Seq scan on ip_info
  Filter:  ... AND ((hashed_subplan) OR (hashed_subplan))
  Subplan
    -> Result
      -> Append
        -> various scans on host_events
    -> Result
      -> Append
        -> various scans on network_events

UNION SELECT DISTINCT:

Nested Loop
  -> Unique
    -> Sort
      -> Append
        -> Unique
          -> Sort
            -> Result
              -> Append
                -> various scans on host_events
        -> Unique
          -> Sort
            -> Result
              -> Append
                -> various scans on network_events

If it would help to have more information, I could retype some of
numbers in the explain.

Any ideas?

Thanks,
-Mike

pgsql-performance by date:

Previous
From: Mark Stosberg
Date:
Subject: Using statement_timeout as a performance tool?
Next
From: Richard Huxton
Date:
Subject: Re: Subselect query enhancement