Thread: eliminating records not in (select id ... so SLOW?

eliminating records not in (select id ... so SLOW?

From
Ivan Sergio Borgonovo
Date:
I'm doing something like:

delete from table1 where id not in (select id from table2).

both id are indexed.

table1 contains ~1M record table2 contains ~ 600K record and id is
unique.

The box is a 2x dual core Xeon (below 2GHz) with 4Gb ram.
Default debian etch setup.

It has been working for over 2h now.

Is it normal?

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: eliminating records not in (select id ... so SLOW?

From
Lennin Caro
Date:


--- On Thu, 7/31/08, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:

> From: Ivan Sergio Borgonovo <mail@webthatworks.it>
> Subject: [GENERAL] eliminating records not in (select id ... so SLOW?
> To: "PostgreSQL" <pgsql-general@postgresql.org>
> Date: Thursday, July 31, 2008, 9:45 PM
> I'm doing something like:
>
> delete from table1 where id not in (select id from table2).
>
> both id are indexed.
>
> table1 contains ~1M record table2 contains ~ 600K record
> and id is
> unique.
>
> The box is a 2x dual core Xeon (below 2GHz) with 4Gb ram.
> Default debian etch setup.
>
> It has been working for over 2h now.
>
> Is it normal?
>
> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
you recently run vacuum ?

> --
> Sent via pgsql-general mailing list
> (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general





Re: eliminating records not in (select id ... so SLOW?

From
Ivan Sergio Borgonovo
Date:
On Thu, 31 Jul 2008 14:59:29 -0700 (PDT)
Lennin Caro <lennin.caro@yahoo.com> wrote:

> > The box is a 2x dual core Xeon (below 2GHz) with 4Gb ram.
> > Default debian etch setup.

> you recently run vacuum ?

The tables are pretty stable. I think no more than 20 records were
modified (update/insert/delete) during the whole history of the 2
tables.

autovacuum is running regularly.

The actual query running is:

begin;
 create index catalog_categoryitem_ItemsID_index on
   catalog_categoryitem using btree (ItemID);
 delete from catalog_categoryitem
   where ItemID not in (select ItemID from catalog_items);
commit;

That's what came back
Timing is on.
BEGIN
Time: 0.198 ms
CREATE INDEX
Time: 3987.991 ms

The query is still running...

As a reminder catalog_categoryitem should contain less than 1M
record.
catalog_items should contain a bit more than 600K record where
ItemID is unique (a pk actually).
PostgreSQL comes from the default install from Debian etch (8.1.X).
It's configuration hasn't been modified.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: eliminating records not in (select id ... so SLOW?

From
Tom Lane
Date:
Ivan Sergio Borgonovo <mail@webthatworks.it> writes:
> I'm doing something like:
> delete from table1 where id not in (select id from table2).
> table1 contains ~1M record table2 contains ~ 600K record and id is
> unique.

That's going to pretty much suck unless you've got work_mem set high
enough to allow a "hashed subplan" plan --- which is likely to require
tens of MB for this case, I don't recall exactly what the per-row
overhead is.  Experiment until EXPLAIN tells you it'll use a hashed
subplan.

BTW, don't bother with creating the index, it doesn't help for this.

            regards, tom lane

Re: eliminating records not in (select id ... so SLOW?

From
Ivan Sergio Borgonovo
Date:
On Thu, 31 Jul 2008 21:37:39 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Ivan Sergio Borgonovo <mail@webthatworks.it> writes:
> > I'm doing something like:
> > delete from table1 where id not in (select id from table2).
> > table1 contains ~1M record table2 contains ~ 600K record and id
> > is unique.

> That's going to pretty much suck unless you've got work_mem set
> high enough to allow a "hashed subplan" plan --- which is likely
> to require tens of MB for this case, I don't recall exactly what

Thanks.

> the per-row overhead is.  Experiment until EXPLAIN tells you it'll
> use a hashed subplan.

explain delete from catalog_categoryitem where ItemID not in (select
ItemID from catalog_items);

Well I reached 3Gb of work_mem and still I got:

"Seq Scan on catalog_categoryitem  (cost=31747.84..4019284477.13
rows=475532 width=6)"
"  Filter: (NOT (subplan))"
"  SubPlan"
"    ->  Materialize  (cost=31747.84..38509.51 rows=676167 width=8)"
"          ->  Seq Scan on catalog_items  (cost=0.00..31071.67
rows=676167 width=8)"

I've this too:
alter table catalog_items cluster on catalog_items_pkey;
should I drop it?

This is just a dev box. I loaded the 2 tables with 2 not coherent
set of data just to play with, before adding all the pk/fk I need.
I could just truncate the tables and reload them from coherent
sources.

But what if I *really* had to execute that query?
Any other magic I could play to speed it up?

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: eliminating records not in (select id ... so SLOW?

From
"Daniel Verite"
Date:
    Ivan Sergio Borgonovo wrote:

> But what if I *really* had to execute that query?
> Any other magic I could play to speed it up?

A trick that is sometimes spectacularly efficient is to rewrite the
query to use an outer join instead of NOT IN.

Try:

DELETE FROM table1 WHERE id IN
 (SELECT table1.id FROM table1 LEFT JOIN table2 ON table1.id=table2.id
WHERE table2.id IS NULL)


 Best regards,
--
 Daniel
 PostgreSQL-powered mail user agent and storage:
http://www.manitou-mail.org

Re: eliminating records not in (select id ... so SLOW?

From
Lennin Caro
Date:
ok try this


delete from catalog_categoryitem where not exists (select id from catalog_items where catalog_items.ItemID =
catalog_categoryitem.ItemID);


--- On Thu, 7/31/08, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote:

> From: Ivan Sergio Borgonovo <mail@webthatworks.it>
> Subject: Re: [GENERAL] eliminating records not in (select id ... so SLOW?
> To:
> Cc: "PostgreSQL" <pgsql-general@postgresql.org>
> Date: Thursday, July 31, 2008, 11:01 PM
> On Thu, 31 Jul 2008 14:59:29 -0700 (PDT)
> Lennin Caro <lennin.caro@yahoo.com> wrote:
>
> > > The box is a 2x dual core Xeon (below 2GHz) with
> 4Gb ram.
> > > Default debian etch setup.
>
> > you recently run vacuum ?
>
> The tables are pretty stable. I think no more than 20
> records were
> modified (update/insert/delete) during the whole history of
> the 2
> tables.
>
> autovacuum is running regularly.
>
> The actual query running is:
>
> begin;
>  create index catalog_categoryitem_ItemsID_index on
>    catalog_categoryitem using btree (ItemID);
>  delete from catalog_categoryitem
>    where ItemID not in (select ItemID from catalog_items);
> commit;
>
> That's what came back
> Timing is on.
> BEGIN
> Time: 0.198 ms
> CREATE INDEX
> Time: 3987.991 ms
>
> The query is still running...
>
> As a reminder catalog_categoryitem should contain less than
> 1M
> record.
> catalog_items should contain a bit more than 600K record
> where
> ItemID is unique (a pk actually).
> PostgreSQL comes from the default install from Debian etch
> (8.1.X).
> It's configuration hasn't been modified.
>
> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> --
> Sent via pgsql-general mailing list
> (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general





Re: eliminating records not in (select id ... so SLOW?

From
Tom Lane
Date:
Ivan Sergio Borgonovo <mail@webthatworks.it> writes:
> Well I reached 3Gb of work_mem and still I got:

> "Seq Scan on catalog_categoryitem  (cost=31747.84..4019284477.13
> rows=475532 width=6)"
> "  Filter: (NOT (subplan))"
> "  SubPlan"
> "    ->  Materialize  (cost=31747.84..38509.51 rows=676167 width=8)"
> "          ->  Seq Scan on catalog_items  (cost=0.00..31071.67
> rows=676167 width=8)"

Huh.  The only way I can see for that to happen is if the datatypes
involved aren't hashable.  What's the datatypes of the two columns
being compared, anyway?

            regards, tom lane

Re: eliminating records not in (select id ... so SLOW?

From
Ivan Sergio Borgonovo
Date:
On Fri, 01 Aug 2008 10:33:59 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Ivan Sergio Borgonovo <mail@webthatworks.it> writes:
> > Well I reached 3Gb of work_mem and still I got:
>
> > "Seq Scan on catalog_categoryitem  (cost=31747.84..4019284477.13
> > rows=475532 width=6)"
> > "  Filter: (NOT (subplan))"
> > "  SubPlan"
> > "    ->  Materialize  (cost=31747.84..38509.51 rows=676167
> > width=8)" "          ->  Seq Scan on catalog_items
> > (cost=0.00..31071.67 rows=676167 width=8)"
>
> Huh.  The only way I can see for that to happen is if the datatypes
> involved aren't hashable.  What's the datatypes of the two columns
> being compared, anyway?

That S in CS should mean sober!

thanks to svn I'd say you're right... one column was int the other
bigint.
Among other things I was just fixing that kind of mistakes.

If that could be the reason I'll report if things got better once I
finish to normalise the DB.

BTW does pg 8.3 save you from such kind of mistake being stricter
with auto cast?

Tom sorry for sending this just to your personal email.


--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: eliminating records not in (select id ... so SLOW?

From
Tom Lane
Date:
Ivan Sergio Borgonovo <mail@webthatworks.it> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Huh.  The only way I can see for that to happen is if the datatypes
>> involved aren't hashable.  What's the datatypes of the two columns
>> being compared, anyway?

> thanks to svn I'd say you're right... one column was int the other
> bigint.

Ah.  8.3 can hash certain cross-type comparisons (including that one)
but prior versions won't.

            regards, tom lane

Re: eliminating records not in (select id ... so SLOW?

From
Ivan Sergio Borgonovo
Date:
On Fri, 01 Aug 2008 10:33:59 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Ivan Sergio Borgonovo <mail@webthatworks.it> writes:
> > Well I reached 3Gb of work_mem and still I got:
>
> > "Seq Scan on catalog_categoryitem  (cost=31747.84..4019284477.13
> > rows=475532 width=6)"
> > "  Filter: (NOT (subplan))"
> > "  SubPlan"
> > "    ->  Materialize  (cost=31747.84..38509.51 rows=676167
> > width=8)" "          ->  Seq Scan on catalog_items
> > (cost=0.00..31071.67 rows=676167 width=8)"
>
> Huh.  The only way I can see for that to happen is if the datatypes
> involved aren't hashable.  What's the datatypes of the two columns
> being compared, anyway?

I changed both columns to bigint.
I added 2 indexes on the ItemID column of both tables and increased
work_mem to 3Gb [sic].
The query got executed in ~1300ms... but explain gave the same
output as the one above.

The problem is solved... but curious mind want to know.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it