Thread: Slow query with sub-select

Slow query with sub-select

From
- -
Date:
The following query seems to take ages despite the EXPLAIN stating that an index is used.
Also, the condition (WHERE t.mid = q.mid) should be a one-to-one mapping, should it not? In this case the mapping is to 3641527 rows.

Table q has no indexes and not referenced by other tables.  
Table t has an index on column mid.

Does anyone know why the query is slow?


SELECT COUNT(*) FROM q
      WHERE NOT EXISTS (SELECT 1
                          FROM t AS t
                         WHERE t.mid = q.mid);


                                              QUERY PLAN                                                
< font class="Apple-style-span" face="Tahoma" size="2">---------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10021304028.93..10021304028.94 rows=1 width=0)
   ->  Hash Anti Join  (cost=10007145480.54..10021028896.24 rows=110053073 width=0)
         Hash Cond: ((q.mid)::text = (t.mid)::text)
         ->  Seq Scan on q (cost=10000000000.00..10007993328.46 rows=220106146 width=38)
         ->  Hash  (cost=7083958.46..7083958.46 rows=364 1527 width=10)
               ->  Index Scan using t_pkey on t  (cost=0.00..7083958.46 rows=3641527 width=10)
(6 rows)

Re: Slow query with sub-select

From
David Johnston
Date:


On Jul 16, 2011, at 6:32, - - <loh.law@hotmail.com> wrote:

The following query seems to take ages despite the EXPLAIN stating that an index is used.
Also, the condition (WHERE t.mid = q.mid) should be a one-to-one mapping, should it not? In this case the mapping is to 3641527 rows.

Table q has no indexes and not referenced by other tables.  
Table t has an index on column mid.

Does anyone know why the query is slow?


SELECT COUNT(*) FROM q
      WHERE NOT EXISTS (SELECT 1
                          FROM t AS t
                         WHERE t.mid = q.mid);


                                              QUERY PLAN                                                
< font class="Apple-style-span" face="Tahoma" size="2">---------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10021304028.93..10021304028.94 rows=1 width=0)
   ->  Hash Anti Join  (cost=10007145480.54..10021028896.24 rows=110053073 width=0)
         Hash Cond: ((q.mid)::text = (t.mid)::text)
         ->  Seq Scan on q (cost=10000000000.00..10007993328.46 rows=220106146 width=38)
         ->  Hash  (cost=7083958.46..7083958.46 rows=364 1527 width=10)
               ->  Index Scan using t_pkey on t  (cost=0.00..7083958.46 rows=3641527 width=10)
(6 rows)

1. Indexes are not magical; their usage does not guarantee a fast query
2. It is slow because you have no non-join where condition and around 225 MILLION rows that need to be evaluated.
3. Also, you are using a correlated sub-query instead of a LEFT OUTER JOIN
4. You haven't provided table definitions with indexes and so whether q.mid=t.mid is a 1-to-1 optional relationship is unknowable.  Hell, since the names are meaningless we cannot even guess what kind of relationship the tables should have.  The generic "mid" field name has the same problem.

David J.


Re: Slow query with sub-select

From
- -
Date:


On Jul 16, 2011, at 6:32, - - <loh.law@hotmail.com> wrote:

The following query seems to take ages despite the EXPLAIN stating that an index is used.
Also, the condition (WHERE t.mid = q.mid) should be a one-to-one mapping, should it not? In this case the mapping is to 3641527 rows.

Table q has no indexes and not referenced by other tables.  
Table t has an index on column mid.

Does anyone know why the query is slow?


SELECT COUNT(*) FROM q
      WHERE NOT EXISTS (SELECT 1
                          FROM t AS t
                         WHERE t.mid = q.mid);


                                              QUERY PLAN                                                
< font class="Apple-style-span" face="Tahoma" size="2">---------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10021304028.93..10021304028.94 rows=1 width=0)
   ->  Hash Anti Join  (cost=10007145480.54..10021028896.24 rows=110053073 width=0)
         Hash Cond: ((q.mid)::text = (t.mid)::text)
         ->  Seq Scan on q (cost=10000000000.00..10007993328.46 rows=220106146 width=38)
         ->  Hash  (cost=7083958.46..7083958. 46 rows=364 1527 width=10)
               ->  Index Scan using t_pkey on t  (cost=0.00..7083958.46 rows=3641527 width=10)
(6 rows)

1. Indexes are not magical; their usage does not guarantee a fast query
2. It is slow because you have no non-join where condition and around 225 MILLION rows that need to be evaluated.
3. Also, you are using a correlated sub-query instead of a LEFT OUTER JOIN
4. You haven't provided table definitions with indexes and so whether q.mid=t.mid is a 1-to-1 optional relationship is unknowable.  Hell, since the names are meaningless we cannot even guess what kind of relationship the tables should have.  The generic "mid" field name has the same problem.

David J.



The weird thing is that before I updated my server the query was about 5 times faster.

I've googled and I think the problem lies with the under-estimation of the query planner about the number of rows in the nested table.
I will be trying the 'set enable_seqscan = false' solution to see if that'll improve.

Re: Slow query with sub-select

From
Tom Lane
Date:
- - <loh.law@hotmail.com> writes:
> The weird thing is that before I updated my server the query was about 5 times faster.
> I've googled and I think the problem lies with the under-estimation of the query planner about the number of rows in
thenested table.I will be trying the 'set enable_seqscan = false' solution to see if that'll improve.
        

You evidently already do have that turned off.  I'd suggest reverting
that change (ie, allow seqscan) and instead increase work_mem enough
so that the hash join can work without spilling to disk.  This query
is a perfect example of where indexes do not help, and trying to force
them to be used makes things slower not faster.

            regards, tom lane

Re: Slow query with sub-select

From
Michael Nolan
Date:


2011/7/16 - - <loh.law@hotmail.com>

The weird thing is that before I updated my server the query was about 5 times faster. 

Updated it from what to what, and how?  
--
Mike Nolan
nolan@tssi.com

Re: Slow query with sub-select

From
- -
Date:
> - - <loh.law@hotmail.com> writes:
> > The weird thing is that before I updated my server the query was about 5 times faster.
> > I've googled and I think the problem lies with the under-estimation of the query planner about the number of rows in the nested table.I will be trying the 'set enable_seqscan = false' solution to see if that'll improve.
>
> You evidently already do have that turned off. I'd suggest reverting
> that change (ie, allow seqscan) and instead increase work_mem enough
> so that the hash join can work without spilling to disk. This query
> is a perfect example of where indexes do not help, and trying to force
> them to be used makes things slower not faster.
>
> regards, tom lane


< div>
I have switched on seqscan and increased work_mem to 1GB ... but no luck so far.

The version I'm using is PostgreSQL 8.4.8 on i686-pc-linux-gnu, compiled by GCC gcc-4.5.real (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2, 32-bit
Before that I used an earlier minor version (8.4.x - I don't remember what x is but it was the one packaged in the version before Ubuntu Natty).


These are the relevant schemas.

< div>CREATE TABLE q (
  mid VARCHAR(10) NOT NULL
  ...
);

CREATE TABLE t (
  mid VARCHAR(10) NOT NULL PRIMARY KEY
  ...
);

I would like to count rows in q whose mid does not exist in t.

This is the query I used.

SELECT COUNT(*) FROM q
      WHERE NOT EXISTS (SELECT 1
                          FROM t
                         WHERE t.mid = q.mid);

Based on my understanding, I believe t he query will loop through each row in q (which has about 500m rows) and for each row it will check a one-to-one mapping against t (which has about 3m rows) by using an index scan on t (mid).

However, the EXPLAIN outputs for seqscan = on and seqscan = off, respectively, seem to indicate that it is not a one-to-one mapping of t.mid and q.mid.

I then switched the comparison operator in the where clause as follows:

SELECT COUNT(*) FROM q
      WHERE NOT EXISTS (SELECT 1
                          FROM t
                         WHERE q.mid = t.mid);

As there is no index on q (mid) this type of query should take a considerably longer time.  However, the EXPLAIN outputs seem to be the same.  
Here they are:

With seqscan = on
&nbs p;                                    QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Aggregate  (cost=18566199.92..18566199.93 rows=1 width=0)
   ->  Hash Anti Join  (cost=747023.15..18566199.91 rows=1 width=0)
         Hash Cond: ((q.mid)::text = (t.mid)::text)
         ->  Seq Scan on q  (cost=0.00..11451989.24 rows=565972224 width=10)
         ->  Hash  (cost=701775.29..701775.29 rows=3619829 width=10)
               ->  Seq Scan on t  (cost=0.00..701775.29 rows=3619829 width=10)
(6 rows)


With seqscan = off
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10024599995.73..10024599995.74 rows=1 width=0)
   ->  Hash Anti Join  (cost=10006780818.96..10024599995.72 rows=1 width=0)
         Hash Cond: ((q.mid)::text = (t.mid)::text)
         ->  Seq Scan on q  (cost=1000000 0000.00..10011451989.24 rows=565972224 width=10)
         ->  Hash  (cost=6735571.10..6735571.10 rows=3619829 width=10)
               ->  Index Scan using t_pkey on t  (cost=0.00..6735571.10 rows=3619829 width=10)
(6 rows)


Any help is greatly appreciated as this problem has been depressing me for two weeks.

Re: Slow query with sub-select

From
Rick Genter
Date:

On Jul 16, 2011, at 4:14 PM, - - wrote:

I would like to count rows in q whose mid does not exist in t.

I would write such a query like this:

SELECT COUNT(*)
   FROM q
      LEFT OUTER JOIN t
         ON (t.mid = q.mid)
WHERE t.mid IS NULL;

And I would make sure there was an index on t.mid. (And for 9.2, as I understand it, q.mid as well, since I believe in 9.2 PostgreSQL will be able to compute the result strictly from the indexes without hitting the base tables.)