Re: Subselect query enhancement - Mailing list pgsql-performance

From Richard Huxton
Subject Re: Subselect query enhancement
Date
Msg-id 45C22223.1090909@archonet.com
Whole thread Raw
In response to Subselect query enhancement  ("Michael Artz" <mlartz@gmail.com>)
Responses Re: Subselect query enhancement  ("Michael Artz" <mlartz@gmail.com>)
List pgsql-performance
Michael Artz wrote:
> I'm needing help determining the best all-around query for the
> following situation.

Not sure whether such a beast exists, but...

 > I have primary table that holds ip information
> and two other tables that hold event data for the specific IP in with
> a one-to-many mapping between them, ie:
[snip]
> There is quite a bit of commonality between the network_events and
> host_events schemas, but they do not currently share an ancestor.
> ip_info has about 13 million rows, the network_events table has about
> 30 million rows, and the host_events table has about 7 million rows.
> There are indexes on all the rows.

What indexes though. Do you have (name,ip) on the two event tables?

How selective is "name" - are there many different values or just a few?
If lots, it might be worth increasing the statistics gathered on that
column (ALTER COLUMN ... SET STATISTICS).
http://www.postgresql.org/docs/8.2/static/sql-altertable.html

> The query that I would like to execute is to select all the rows of
> ip_info that have either  network or host events that meet some
> criteria, i.e. name='blah'.  I have 3 different possibilities that I
> have thought of to execute this.
>
> First, 2 'ip IN (SELECT ...)' statements joined by an OR:
>
> SELECT * FROM ip_info
>  WHERE ip IN (SELECT ip FROM network_events WHERE name='blah')
>        OR ip IN (SELECT ip FROM host_events WHERE name='blah');
>
> Next, 1 'ip IN (SELECT ... UNION SELECT ...) statement:
>
> SELECT * FROM ip_info
>  WHERE ip IN (SELECT ip FROM network_events WHERE name='blah'
>              UNION
>              SELECT ip FROM host_events WHERE name='blah');
>
> Or, finally, the UNION statment with DISTINCTs:
>
> SELECT * FROM ip_info
>  WHERE ip IN (SELECT DISTINCT ip FROM network_events WHERE name='blah'
>              UNION
>              SELECT DISTINCT ip FROM host_events WHERE name='blah');
>
>  From what I have read, the UNION statement does an implicit DISTINCT,
> but I thought that doing it on each of the tables would result in
> slightly faster execution.  Can you think of any other ways to
> implement the previous query?

You're right about removing duplicates. Not sure whether the DISTINCTs
on the sub-selects are helping or hindering. It'll probably depend on
your hardware, config, number of rows etc.

The only other way I can think of for this query is to UNION two JOINs.
Might interact well with the (name,ip) index I mentioned above.

> I have explained/analyzed all the queries but, unfortunately, they are
> on an isolated computer.  The gist is that, for relatively
> low-incidence values of name, the UNION performs better, but for
> queries on a common name, the dual-subselect query performs better.

Difficult to say much without seeing the full explain analyse. Did the
row estimates look reasonable?

--
   Richard Huxton
   Archonet Ltd

pgsql-performance by date:

Previous
From: "Michael Artz"
Date:
Subject: Subselect query enhancement
Next
From: Ted Allen
Date:
Subject: Re: Subselect query enhancement