Thread: Poor Performance with Distinct Subqueries with EXISTS and EXCEPT

Poor Performance with Distinct Subqueries with EXISTS and EXCEPT

From
Thomas F.O'Connell
Date:
I'm trying to do some research and reporting for an email application
by domain name. This has led to a confounding attempt to do any of the
legwork in SQL via postgres.

Here is my foundational query:

SELECT DISTINCT split_part( u.email, '@', 2 )
FROM user AS u, message AS m
WHERE u.id = m.user_id
AND NOT EXISTS (
    SELECT 1
    FROM message_action ma
    WHERE ma.message_id = m.id
)
EXCEPT
SELECT DISTINCT split_part( u.email, '@', 2 )
FROM user AS u, message AS m
WHERE u.id = m.user_id
AND EXISTS (
    SELECT 1
    FROM message_action ma
    WHERE ma.message_id = m.id
)

This is designed to give me unique domain names for all users who have
not committed an action on an email. The way I figure this needed to
work was to get all members joined to messages that didn't have an
action. That's the part of the query before the EXCEPT. The EXISTS
checks for an action on the message.

Now, since I'm actually interested in unique domain names rather than
unique users, I need to get all the unique domain names corresponding
to users who have acted on a message. That's what the part of the query
after the EXCEPT is.

This query performs abysmally for even small numbers of users and
messages (each on the order of 1-10 thousand). Honestly, I have not
gotten it to finish for even these small cases. In one situation on a
development database, it filled up $PGDATA/base. This is another
instance of my surprising myself with my ability to generate
slow-running queries where I don't fully understand the interaction
between postgres and what I think I'm asking of postgres in SQL.

I'd love to deepen my knowledge of joins in general, so if anyone has
any suggestions for improvements on the EXISTS checks, then I'm anxious
to learn alternatives. One thing I don't fully understand is why all
the scans are Seq Scans. These IDs are all integer primary/foreign keys
(with indexes).

Here's an example of a plan for a user with no messages:


          QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------
  SetOp Except  (cost=479.01..479.02 rows=1 width=24) (actual
time=176.258..176.258 rows=0 loops=1)
    ->  Sort  (cost=479.01..479.02 rows=2 width=24) (actual
time=176.253..176.253 rows=0 loops=1)
          Sort Key: split_part
          ->  Append  (cost=239.48..479.00 rows=2 width=24) (actual
time=176.230..176.230 rows=0 loops=1)
                ->  Subquery Scan "*SELECT* 1"  (cost=239.48..239.50
rows=1 width=24) (actual time=90.390..90.390 rows=0 loops=1)
                      ->  Unique  (cost=239.48..239.49 rows=1 width=24)
(actual time=90.386..90.386 rows=0 loops=1)
                            ->  Sort  (cost=239.48..239.49 rows=1
width=24) (actual time=90.382..90.382 rows=0 loops=1)
                                  Sort Key: split_part((u.email)::text,
'@'::text, 2)
                                  ->  Nested Loop  (cost=0.00..239.47
rows=1 width=24) (actual time=90.370..90.370 rows=0 loops=1)
                                        ->  Seq Scan on user u
(cost=0.00..239.46 rows=1 width=24) (actual time=0.096..36.555
rows=10117 loops=1)
                                              Filter: (id = user_id)
                                        ->  Seq Scan on message m
(cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0
loops=10117)
                                              Filter: (NOT (subplan))
                                              SubPlan
                                                ->  Seq Scan on
message_action ma  (cost=0.00..0.00 rows=1 width=0) (never executed)
                                                      Filter:
(message_id = $0)
                ->  Subquery Scan "*SELECT* 2"  (cost=239.48..239.50
rows=1 width=24) (actual time=85.835..85.835 rows=0 loops=1)
                      ->  Unique  (cost=239.48..239.49 rows=1 width=24)
(actual time=85.830..85.830 rows=0 loops=1)
                            ->  Sort  (cost=239.48..239.49 rows=1
width=24) (actual time=85.828..85.828 rows=0 loops=1)
                                  Sort Key: split_part((u.email)::text,
'@'::text, 2)
                                  ->  Nested Loop  (cost=0.00..239.47
rows=1 width=24) (actual time=85.792..85.792 rows=0 loops=1)
                                        ->  Seq Scan on user u
(cost=0.00..239.46 rows=1 width=24) (actual time=0.017..32.322
rows=10117 loops=1)
                                              Filter: (id = user_id)
                                        ->  Seq Scan on message m
(cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0
loops=10117)
                                              Filter: (subplan)
                                              SubPlan
                                                ->  Seq Scan on
message_action ma  (cost=0.00..0.00 rows=1 width=0) (never executed)
                                                      Filter:
(message_id = $0)
  Total runtime: 176.691 ms
(29 rows)

Here's version information: PostgreSQL 7.4.6 on i686-pc-linux-gnu,
compiled by GCC 2.95.4

Thanks.

-tfo


Re: Poor Performance with Distinct Subqueries with EXISTS and EXCEPT

From
Pierre-Frédéric Caillaud
Date:
> Now, since I'm actually interested in unique domain names rather than
> unique users, I need to get all the unique domain names corresponding to
> users who have acted on a message. That's what the part of the query
> after the EXCEPT is.

    I don't understand this part at all. What does it mean ?
    I may be mistaken, but you may be doing the same thing twice : you're
basically writing :

    SELECT DISTINCT X WHERE Y EXCEPT SELECT DISTINCT X WHERE NOT Y
    Is this not a way to get an empty result set ?

    Let's re-take your query from the start. At each step you should explain
analyze the query to check if it runs smoothly.

    1. You want the messages which have no actions. Rather than a subselect,
I'd use a LEFT JOIN :

    untested syntax :
    SELECT m.id FROM message m LEFT JOIN message_action ma ON
m.id=ma.messages_id WHERE ma.messages_id IS NULL;

    On my machine, I have a zones table with 3000 rows and a cities table
with 2 million rows, each place having a zone_id :

EXPLAIN ANALYZE SELECT z.zone_id FROM geo.zones z LEFT JOIN geo.cities c
ON c.zone_id=z.zone_id WHERE c.id IS NULL;
  Merge Left Join  (cost=0.00..142063.06 rows=3663 width=4) (actual
time=8726.203..8726.203 rows=0 loops=1)
    Merge Cond: ("outer".zone_id = "inner".zone_id)
    Filter: ("inner".id IS NULL)
    ->  Index Scan using zones_pkey on zones z  (cost=0.00..99.10 rows=3663
width=4) (actual time=15.027..43.987 rows=3663 loops=1)
    ->  Index Scan using cities_zones_idx on cities c
(cost=0.00..116030.55 rows=2073935 width=8) (actual time=25.164..5823.496
rows=2073935 loops=1)
  Total runtime: 8726.327 ms
(6 lignes)

    8 seconds, this gives you an idea with that many records.
    You should check your indexes are used !

    Now you have the messages which have no actions, you must get the user
email domains :

    SELECT split_part( u.email, '@', 2 ) as domain
        FROM users u, message m
        LEFT JOIN message_action ma ON m.id=ma.messages_id
        WHERE u.id=m.user_id
            AND ma.messages_id IS NULL;

    Can you time this query ? Are the indexes used ?
    Now, let's remove the duplicates :

    SELECT split_part( u.email, '@', 2 ) as domain
        FROM users u, message m
        LEFT JOIN message_action ma ON m.id=ma.messages_id
        WHERE u.id=m.user_id
            AND ma.messages_id IS NULL
        GROUP By domain;

    GROUP BY is faster than DISTINCT (in some cases).

    How does it go ?

On Wed, 1 Dec 2004 00:11:35 -0600, Thomas F.O'Connell
<tfo@alumni.brown.edu> wrote:

> I'm trying to do some research and reporting for an email application by
> domain name. This has led to a confounding attempt to do any of the
> legwork in SQL via postgres.
>
> Here is my foundational query:
>
> SELECT DISTINCT split_part( u.email, '@', 2 )
> FROM user AS u, message AS m
> WHERE u.id = m.user_id
> AND NOT EXISTS (
>     SELECT 1
>     FROM message_action ma
>     WHERE ma.message_id = m.id
> )
> EXCEPT
> SELECT DISTINCT split_part( u.email, '@', 2 )
> FROM user AS u, message AS m
> WHERE u.id = m.user_id
> AND EXISTS (
>     SELECT 1
>     FROM message_action ma
>     WHERE ma.message_id = m.id
> )
>
> This is designed to give me unique domain names for all users who have
> not committed an action on an email. The way I figure this needed to
> work was to get all members joined to messages that didn't have an
> action. That's the part of the query before the EXCEPT. The EXISTS
> checks for an action on the message.
>
>
> This query performs abysmally for even small numbers of users and
> messages (each on the order of 1-10 thousand). Honestly, I have not
> gotten it to finish for even these small cases. In one situation on a
> development database, it filled up $PGDATA/base. This is another
> instance of my surprising myself with my ability to generate
> slow-running queries where I don't fully understand the interaction
> between postgres and what I think I'm asking of postgres in SQL.
>
> I'd love to deepen my knowledge of joins in general, so if anyone has
> any suggestions for improvements on the EXISTS checks, then I'm anxious
> to learn alternatives. One thing I don't fully understand is why all the
> scans are Seq Scans. These IDs are all integer primary/foreign keys
> (with indexes).
>
> Here's an example of a plan for a user with no messages:
>
>
>           QUERY PLAN
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> --------------------------
>   SetOp Except  (cost=479.01..479.02 rows=1 width=24) (actual
> time=176.258..176.258 rows=0 loops=1)
>     ->  Sort  (cost=479.01..479.02 rows=2 width=24) (actual
> time=176.253..176.253 rows=0 loops=1)
>           Sort Key: split_part
>           ->  Append  (cost=239.48..479.00 rows=2 width=24) (actual
> time=176.230..176.230 rows=0 loops=1)
>                 ->  Subquery Scan "*SELECT* 1"  (cost=239.48..239.50
> rows=1 width=24) (actual time=90.390..90.390 rows=0 loops=1)
>                       ->  Unique  (cost=239.48..239.49 rows=1 width=24)
> (actual time=90.386..90.386 rows=0 loops=1)
>                             ->  Sort  (cost=239.48..239.49 rows=1
> width=24) (actual time=90.382..90.382 rows=0 loops=1)
>                                   Sort Key: split_part((u.email)::text,
> '@'::text, 2)
>                                   ->  Nested Loop  (cost=0.00..239.47
> rows=1 width=24) (actual time=90.370..90.370 rows=0 loops=1)
>                                         ->  Seq Scan on user u
> (cost=0.00..239.46 rows=1 width=24) (actual time=0.096..36.555
> rows=10117 loops=1)
>                                               Filter: (id = user_id)
>                                         ->  Seq Scan on message m
> (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0
> loops=10117)
>                                               Filter: (NOT (subplan))
>                                               SubPlan
>                                                 ->  Seq Scan on
> message_action ma  (cost=0.00..0.00 rows=1 width=0) (never executed)
>                                                       Filter:
> (message_id = $0)
>                 ->  Subquery Scan "*SELECT* 2"  (cost=239.48..239.50
> rows=1 width=24) (actual time=85.835..85.835 rows=0 loops=1)
>                       ->  Unique  (cost=239.48..239.49 rows=1 width=24)
> (actual time=85.830..85.830 rows=0 loops=1)
>                             ->  Sort  (cost=239.48..239.49 rows=1
> width=24) (actual time=85.828..85.828 rows=0 loops=1)
>                                   Sort Key: split_part((u.email)::text,
> '@'::text, 2)
>                                   ->  Nested Loop  (cost=0.00..239.47
> rows=1 width=24) (actual time=85.792..85.792 rows=0 loops=1)
>                                         ->  Seq Scan on user u
> (cost=0.00..239.46 rows=1 width=24) (actual time=0.017..32.322
> rows=10117 loops=1)
>                                               Filter: (id = user_id)
>                                         ->  Seq Scan on message m
> (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0
> loops=10117)
>                                               Filter: (subplan)
>                                               SubPlan
>                                                 ->  Seq Scan on
> message_action ma  (cost=0.00..0.00 rows=1 width=0) (never executed)
>                                                       Filter:
> (message_id = $0)
>   Total runtime: 176.691 ms
> (29 rows)
>
> Here's version information: PostgreSQL 7.4.6 on i686-pc-linux-gnu,
> compiled by GCC 2.95.4
>
> Thanks.
>
> -tfo
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>



Re: Poor Performance with Distinct Subqueries with EXISTS and EXCEPT

From
Tom Lane
Date:
=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?= <lists@boutiquenumerique.com> writes:
>     I may be mistaken, but you may be doing the same thing twice : you're
> basically writing :

>     SELECT DISTINCT X WHERE Y EXCEPT SELECT DISTINCT X WHERE NOT Y
>     Is this not a way to get an empty result set ?

No, because some X values may appear in rows where Y, and also in rows
where NOT Y.

The DISTINCTs are wastes of time, though, because EXCEPT implies
elimination of duplicates.

            regards, tom lane

Re: Poor Performance with Distinct Subqueries with EXISTS and EXCEPT

From
Thomas F.O'Connell
Date:
I wasn't sure whether EXCEPT would create a unique set from among the
results of both queries.

As in, if the first part of the query (before the EXCEPT clause),
without the DISTINCT, yielded

yahoo.com
yahoo.com

would the query reduce that to a single yahoo.com regardless of whether
it showed up in the EXCEPT clause?

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Dec 2, 2004, at 10:26 AM, Tom Lane wrote:

> =?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?=
> <lists@boutiquenumerique.com> writes:
>>     I may be mistaken, but you may be doing the same thing twice : you're
>> basically writing :
>
>>     SELECT DISTINCT X WHERE Y EXCEPT SELECT DISTINCT X WHERE NOT Y
>>     Is this not a way to get an empty result set ?
>
> No, because some X values may appear in rows where Y, and also in rows
> where NOT Y.
>
> The DISTINCTs are wastes of time, though, because EXCEPT implies
> elimination of duplicates.
>
>             regards, tom lane


Re: Poor Performance with Distinct Subqueries with EXISTS and EXCEPT

From
Thomas F.O'Connell
Date:
Pierre,

Your re-write makes a lot of sense. Thanks! It's not using indexes for
some reason, and discovering why will be my next challenge.

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Dec 2, 2004, at 6:42 AM, Pierre-Frédéric Caillaud wrote:

>     Let's re-take your query from the start. At each step you should
> explain analyze the query to check if it runs smoothly.
>
>     1. You want the messages which have no actions. Rather than a
> subselect, I'd use a LEFT JOIN :
>
>     untested syntax :
>     SELECT m.id FROM message m LEFT JOIN message_action ma ON
> m.id=ma.messages_id WHERE ma.messages_id IS NULL;
>
>     On my machine, I have a zones table with 3000 rows and a cities table
> with 2 million rows, each place having a zone_id :
>
> EXPLAIN ANALYZE SELECT z.zone_id FROM geo.zones z LEFT JOIN geo.cities
> c ON c.zone_id=z.zone_id WHERE c.id IS NULL;
>  Merge Left Join  (cost=0.00..142063.06 rows=3663 width=4) (actual
> time=8726.203..8726.203 rows=0 loops=1)
>    Merge Cond: ("outer".zone_id = "inner".zone_id)
>    Filter: ("inner".id IS NULL)
>    ->  Index Scan using zones_pkey on zones z  (cost=0.00..99.10
> rows=3663 width=4) (actual time=15.027..43.987 rows=3663 loops=1)
>    ->  Index Scan using cities_zones_idx on cities c
> (cost=0.00..116030.55 rows=2073935 width=8) (actual
> time=25.164..5823.496 rows=2073935 loops=1)
>  Total runtime: 8726.327 ms
> (6 lignes)
>
>     8 seconds, this gives you an idea with that many records.
>     You should check your indexes are used !
>
>     Now you have the messages which have no actions, you must get the
> user email domains :
>
>     SELECT split_part( u.email, '@', 2 ) as domain
>         FROM users u, message m
>         LEFT JOIN message_action ma ON m.id=ma.messages_id
>         WHERE u.id=m.user_id
>             AND ma.messages_id IS NULL;
>
>     Can you time this query ? Are the indexes used ?
>     Now, let's remove the duplicates :
>
>     SELECT split_part( u.email, '@', 2 ) as domain
>         FROM users u, message m
>         LEFT JOIN message_action ma ON m.id=ma.messages_id
>         WHERE u.id=m.user_id
>             AND ma.messages_id IS NULL
>         GROUP By domain;
>
>     GROUP BY is faster than DISTINCT (in some cases).
>
>     How does it go ?

Re: Poor Performance with Distinct Subqueries with EXISTS and EXCEPT

From
Thomas F.O'Connell
Date:
Interestingly, I tried the new version with and without enable_seqscan
on, and the version without indexes performs better because, I think,
it returns more rows than an index lookup would enhance.

Thanks again for your help. This is certainly an improvement over my
original version.

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Dec 2, 2004, at 6:42 AM, Pierre-Frédéric Caillaud wrote:

> Let's re-take your query from the start. At each step you should
> explain analyze the query to check if it runs smoothly.
>
>     1. You want the messages which have no actions. Rather than a
> subselect, I'd use a LEFT JOIN :
>
>     untested syntax :
>     SELECT m.id FROM message m LEFT JOIN message_action ma ON
> m.id=ma.messages_id WHERE ma.messages_id IS NULL;
>
>     On my machine, I have a zones table with 3000 rows and a cities table
> with 2 million rows, each place having a zone_id :
>
> EXPLAIN ANALYZE SELECT z.zone_id FROM geo.zones z LEFT JOIN geo.cities
> c ON c.zone_id=z.zone_id WHERE c.id IS NULL;
>  Merge Left Join  (cost=0.00..142063.06 rows=3663 width=4) (actual
> time=8726.203..8726.203 rows=0 loops=1)
>    Merge Cond: ("outer".zone_id = "inner".zone_id)
>    Filter: ("inner".id IS NULL)
>    ->  Index Scan using zones_pkey on zones z  (cost=0.00..99.10
> rows=3663 width=4) (actual time=15.027..43.987 rows=3663 loops=1)
>    ->  Index Scan using cities_zones_idx on cities c
> (cost=0.00..116030.55 rows=2073935 width=8) (actual
> time=25.164..5823.496 rows=2073935 loops=1)
>  Total runtime: 8726.327 ms
> (6 lignes)
>
>     8 seconds, this gives you an idea with that many records.
>     You should check your indexes are used !
>
>     Now you have the messages which have no actions, you must get the
> user email domains :
>
>     SELECT split_part( u.email, '@', 2 ) as domain
>         FROM users u, message m
>         LEFT JOIN message_action ma ON m.id=ma.messages_id
>         WHERE u.id=m.user_id
>             AND ma.messages_id IS NULL;
>
>     Can you time this query ? Are the indexes used ?
>     Now, let's remove the duplicates :
>
>     SELECT split_part( u.email, '@', 2 ) as domain
>         FROM users u, message m
>         LEFT JOIN message_action ma ON m.id=ma.messages_id
>         WHERE u.id=m.user_id
>             AND ma.messages_id IS NULL
>         GROUP By domain;
>
>     GROUP BY is faster than DISTINCT (in some cases).
>
>     How does it go ?