Thread: High cost of ... where ... not in (select ...)

High cost of ... where ... not in (select ...)

From
Aaron Turner
Date:
I'm trying to figure out how to optimize this query (yes, I ran vacuum/analyze):

musecurity=# explain DELETE FROM muapp.pcap_store WHERE pcap_storeid
NOT IN (SELECT pcap_storeid FROM muapp.pcap_store_log);
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Seq Scan on pcap_store  (cost=4008.22..348521303.54 rows=106532 width=6)
   Filter: (NOT (subplan))
   SubPlan
     ->  Materialize  (cost=4008.22..6765.98 rows=205475 width=4)
           ->  Seq Scan on pcap_store_log  (cost=0.00..3099.75
rows=205475 width=4)
(5 rows)

musecurity=# \d muapp.pcap_store
                                               Table "muapp.pcap_store"
      Column       |          Type          |
      Modifiers
-------------------+------------------------+-------------------------------------------------------------------------
 pcap_storeid      | integer                | not null default
nextval('muapp.pcap_store_pcap_storeid_seq'::regclass)
 filename          | character varying(255) |
 test_run_dutid    | integer                | default 0
 userid            | integer                | not null default 0
 analysis_recordid | bigint                 |
 io_xml            | character varying(255) |
Indexes:
    "pcap_store_pkey" PRIMARY KEY, btree (pcap_storeid)
Foreign-key constraints:
    "pcap_store_analysis_recordid_fkey" FOREIGN KEY
(analysis_recordid) REFERENCES muapp.analysis(recordid) ON DELETE
CASCADE
    "pcap_store_test_run_dutid_fkey" FOREIGN KEY (test_run_dutid)
REFERENCES muapp.test_run_dut(test_run_dutid) ON DELETE CASCADE
    "pcap_store_userid_fkey" FOREIGN KEY (userid) REFERENCES
mucore."user"(recordid) ON DELETE CASCADE

As you see, the sequence scan on pcap_store is killing me, even though
there appears to be a perfectly good index.  Is there a better way
construct this query?

Thanks,
Aaron

--
Aaron Turner
http://synfin.net/
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin

Re: High cost of ... where ... not in (select ...)

From
Alvaro Herrera
Date:
Aaron Turner escribió:
> I'm trying to figure out how to optimize this query (yes, I ran vacuum/analyze):
>
> musecurity=# explain DELETE FROM muapp.pcap_store WHERE pcap_storeid
> NOT IN (SELECT pcap_storeid FROM muapp.pcap_store_log);

What PG version is this?


--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: High cost of ... where ... not in (select ...)

From
Aaron Turner
Date:
On Tue, Jun 16, 2009 at 2:37 PM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:
> Aaron Turner escribió:
>> I'm trying to figure out how to optimize this query (yes, I ran vacuum/analyze):
>>
>> musecurity=# explain DELETE FROM muapp.pcap_store WHERE pcap_storeid
>> NOT IN (SELECT pcap_storeid FROM muapp.pcap_store_log);
>
> What PG version is this?

Doh, just realized I didn't reply back to list.   It's version 8.3.3.

Also, pcap_storeid is unique in pcap_store_log


--
Aaron Turner
http://synfin.net/
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin

Re: High cost of ... where ... not in (select ...)

From
Robert Haas
Date:
On Tue, Jun 16, 2009 at 7:39 PM, Aaron Turner<synfinatic@gmail.com> wrote:
> On Tue, Jun 16, 2009 at 2:37 PM, Alvaro
> Herrera<alvherre@commandprompt.com> wrote:
>> Aaron Turner escribió:
>>> I'm trying to figure out how to optimize this query (yes, I ran vacuum/analyze):
>>>
>>> musecurity=# explain DELETE FROM muapp.pcap_store WHERE pcap_storeid
>>> NOT IN (SELECT pcap_storeid FROM muapp.pcap_store_log);
>>
>> What PG version is this?
>
> Doh, just realized I didn't reply back to list.   It's version 8.3.3.
>
> Also, pcap_storeid is unique in pcap_store_log

Speaking as one who has dealt with this frustration more than once,
you can typically get better performance with something like:

DELETE FROM muapp.pcap_store AS x
FROM muapp.pcap_store a
LEFT JOIN muapp.pcap_store_log b ON a.pcap_store_id = b.pcap_storeid
WHERE x.pcap_storeid = a.pcap_storeid AND b.pcap_storeid IS NULL

This is emphatically lame, but there you have it.  It's first of all
lame that we can't do a better job optimizing NOT-IN, at least when
the expression within the subselect is known to be not-null, and it's
secondly lame that the syntax of DELETE doesn't permit a LEFT JOIN
without a self-JOIN.

</rant>

...Robert

Re: High cost of ... where ... not in (select ...)

From
Aaron Turner
Date:
On Tue, Jun 16, 2009 at 5:30 PM, Robert Haas<robertmhaas@gmail.com> wrote:
> On Tue, Jun 16, 2009 at 7:39 PM, Aaron Turner<synfinatic@gmail.com> wrote:
>> On Tue, Jun 16, 2009 at 2:37 PM, Alvaro
>> Herrera<alvherre@commandprompt.com> wrote:
>>> Aaron Turner escribió:
>>>> I'm trying to figure out how to optimize this query (yes, I ran vacuum/analyze):
>>>>
>>>> musecurity=# explain DELETE FROM muapp.pcap_store WHERE pcap_storeid
>>>> NOT IN (SELECT pcap_storeid FROM muapp.pcap_store_log);
>>>
>>> What PG version is this?
>>
>> Doh, just realized I didn't reply back to list.   It's version 8.3.3.
>>
>> Also, pcap_storeid is unique in pcap_store_log
>
> Speaking as one who has dealt with this frustration more than once,
> you can typically get better performance with something like:
>
> DELETE FROM muapp.pcap_store AS x
> FROM muapp.pcap_store a
> LEFT JOIN muapp.pcap_store_log b ON a.pcap_store_id = b.pcap_storeid
> WHERE x.pcap_storeid = a.pcap_storeid AND b.pcap_storeid IS NULL

That's a syntax error on 8.3.3... I don't see anywhere in the docs
where the delete command allows for multiple FROM statements.  Perhaps
you meant:

 DELETE FROM muapp.pcap_store AS x
        USING muapp.pcap_store AS a
        LEFT JOIN muapp.pcap_store_log b ON a.pcap_storeid =
b.pcap_storeid WHERE x.pcap_storeid = a.pcap_storeid AND
b.pcap_storeid IS NULL;

Is that right?

> This is emphatically lame, but there you have it.  It's first of all
> lame that we can't do a better job optimizing NOT-IN, at least when
> the expression within the subselect is known to be not-null, and it's
> secondly lame that the syntax of DELETE doesn't permit a LEFT JOIN
> without a self-JOIN.

Wow, glad I asked... I never would of figured that out.

--
Aaron Turner
http://synfin.net/
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin

Re: High cost of ... where ... not in (select ...)

From
Robert Haas
Date:
On Tue, Jun 16, 2009 at 9:23 PM, Aaron Turner<synfinatic@gmail.com> wrote:
> On Tue, Jun 16, 2009 at 5:30 PM, Robert Haas<robertmhaas@gmail.com> wrote:
>> On Tue, Jun 16, 2009 at 7:39 PM, Aaron Turner<synfinatic@gmail.com> wrote:
>>> On Tue, Jun 16, 2009 at 2:37 PM, Alvaro
>>> Herrera<alvherre@commandprompt.com> wrote:
>>>> Aaron Turner escribió:
>>>>> I'm trying to figure out how to optimize this query (yes, I ran vacuum/analyze):
>>>>>
>>>>> musecurity=# explain DELETE FROM muapp.pcap_store WHERE pcap_storeid
>>>>> NOT IN (SELECT pcap_storeid FROM muapp.pcap_store_log);
>>>>
>>>> What PG version is this?
>>>
>>> Doh, just realized I didn't reply back to list.   It's version 8.3.3.
>>>
>>> Also, pcap_storeid is unique in pcap_store_log
>>
>> Speaking as one who has dealt with this frustration more than once,
>> you can typically get better performance with something like:
>>
>> DELETE FROM muapp.pcap_store AS x
>> FROM muapp.pcap_store a
>> LEFT JOIN muapp.pcap_store_log b ON a.pcap_store_id = b.pcap_storeid
>> WHERE x.pcap_storeid = a.pcap_storeid AND b.pcap_storeid IS NULL
>
> That's a syntax error on 8.3.3... I don't see anywhere in the docs
> where the delete command allows for multiple FROM statements.  Perhaps
> you meant:
>
>  DELETE FROM muapp.pcap_store AS x
>        USING muapp.pcap_store AS a
>        LEFT JOIN muapp.pcap_store_log b ON a.pcap_storeid =
> b.pcap_storeid WHERE x.pcap_storeid = a.pcap_storeid AND
> b.pcap_storeid IS NULL;
>
> Is that right?

Woops, yes, I think that's it.

(but I don't guarantee that it won't blow up your entire universe, so
test it carefully first)

...Robert

Re: High cost of ... where ... not in (select ...)

From
Aaron Turner
Date:
On Tue, Jun 16, 2009 at 6:36 PM, Robert Haas<robertmhaas@gmail.com> wrote:
> On Tue, Jun 16, 2009 at 9:23 PM, Aaron Turner<synfinatic@gmail.com> wrote:

>>  DELETE FROM muapp.pcap_store AS x
>>        USING muapp.pcap_store AS a
>>        LEFT JOIN muapp.pcap_store_log b ON a.pcap_storeid =
>> b.pcap_storeid WHERE x.pcap_storeid = a.pcap_storeid AND
>> b.pcap_storeid IS NULL;
>>
>> Is that right?
>
> Woops, yes, I think that's it.
>
> (but I don't guarantee that it won't blow up your entire universe, so
> test it carefully first)

Yeah, doing that now... taking a bit longer then I expected (took
~5min on rather slow hardware- everything is on a pair of 10K RAID1
drives), but the result seems correct.

                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Join  (cost=19229.08..29478.99 rows=106492 width=6)
   Hash Cond: (x.pcap_storeid = a.pcap_storeid)
   ->  Seq Scan on pcap_store x  (cost=0.00..5617.84 rows=212984 width=10)
   ->  Hash  (cost=17533.93..17533.93 rows=106492 width=4)
         ->  Hash Left Join  (cost=6371.19..17533.93 rows=106492 width=4)
               Hash Cond: (a.pcap_storeid = b.pcap_storeid)
               Filter: (b.pcap_storeid IS NULL)
               ->  Seq Scan on pcap_store a  (cost=0.00..5617.84
rows=212984 width=4)
               ->  Hash  (cost=3099.75..3099.75 rows=205475 width=4)
                     ->  Seq Scan on pcap_store_log b
(cost=0.00..3099.75 rows=205475 width=4)

I know the costs are just relative, but I assumed
cost=19229.08..29478.99 isn't 5 minutes of effort even on crappy
hardware.  Honestly, not complaining, 5 minutes is acceptable for this
query (it's a one time thing) just surprised is all.

Thanks for the help!

--
Aaron Turner
http://synfin.net/
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin

Re: High cost of ... where ... not in (select ...)

From
Aaron Turner
Date:
On Tue, Jun 16, 2009 at 6:36 PM, Robert Haas<robertmhaas@gmail.com> wrote:
> On Tue, Jun 16, 2009 at 9:23 PM, Aaron Turner<synfinatic@gmail.com> wrote:

>>  DELETE FROM muapp.pcap_store AS x
>>        USING muapp.pcap_store AS a
>>        LEFT JOIN muapp.pcap_store_log b ON a.pcap_storeid =
>> b.pcap_storeid WHERE x.pcap_storeid = a.pcap_storeid AND
>> b.pcap_storeid IS NULL;
>>
>> Is that right?
>
> Woops, yes, I think that's it.
>
> (but I don't guarantee that it won't blow up your entire universe, so
> test it carefully first)

Yeah, doing that now... taking a bit longer then I expected (took
~5min on rather slow hardware- everything is on a pair of 10K RAID1
drives), but the result seems correct.

                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Join  (cost=19229.08..29478.99 rows=106492 width=6)
   Hash Cond: (x.pcap_storeid = a.pcap_storeid)
   ->  Seq Scan on pcap_store x  (cost=0.00..5617.84 rows=212984 width=10)
   ->  Hash  (cost=17533.93..17533.93 rows=106492 width=4)
         ->  Hash Left Join  (cost=6371.19..17533.93 rows=106492 width=4)
               Hash Cond: (a.pcap_storeid = b.pcap_storeid)
               Filter: (b.pcap_storeid IS NULL)
               ->  Seq Scan on pcap_store a  (cost=0.00..5617.84
rows=212984 width=4)
               ->  Hash  (cost=3099.75..3099.75 rows=205475 width=4)
                     ->  Seq Scan on pcap_store_log b
(cost=0.00..3099.75 rows=205475 width=4)

I know the costs are just relative, but I assumed
cost=19229.08..29478.99 isn't 5 minutes of effort even on crappy
hardware.  Honestly, not complaining, 5 minutes is acceptable for this
query (it's a one time thing) just surprised is all.

Thanks for the help!

--
Aaron Turner
http://synfin.net/
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin

Re: High cost of ... where ... not in (select ...)

From
Tom Lane
Date:
Aaron Turner <synfinatic@gmail.com> writes:
> I know the costs are just relative, but I assumed
> cost=19229.08..29478.99 isn't 5 minutes of effort even on crappy
> hardware.

Very likely the bulk of the time is spent in the DELETE work proper,
not in the query to find the rows to be deleted.  In particular I wonder
if you have an unindexed foreign key referencing this table ...

            regards, tom lane