Thread: eliminating records not in (select id ... so SLOW?
I'm doing something like: delete from table1 where id not in (select id from table2). both id are indexed. table1 contains ~1M record table2 contains ~ 600K record and id is unique. The box is a 2x dual core Xeon (below 2GHz) with 4Gb ram. Default debian etch setup. It has been working for over 2h now. Is it normal? -- Ivan Sergio Borgonovo http://www.webthatworks.it
--- On Thu, 7/31/08, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote: > From: Ivan Sergio Borgonovo <mail@webthatworks.it> > Subject: [GENERAL] eliminating records not in (select id ... so SLOW? > To: "PostgreSQL" <pgsql-general@postgresql.org> > Date: Thursday, July 31, 2008, 9:45 PM > I'm doing something like: > > delete from table1 where id not in (select id from table2). > > both id are indexed. > > table1 contains ~1M record table2 contains ~ 600K record > and id is > unique. > > The box is a 2x dual core Xeon (below 2GHz) with 4Gb ram. > Default debian etch setup. > > It has been working for over 2h now. > > Is it normal? > > -- > Ivan Sergio Borgonovo > http://www.webthatworks.it > you recently run vacuum ? > -- > Sent via pgsql-general mailing list > (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
On Thu, 31 Jul 2008 14:59:29 -0700 (PDT) Lennin Caro <lennin.caro@yahoo.com> wrote: > > The box is a 2x dual core Xeon (below 2GHz) with 4Gb ram. > > Default debian etch setup. > you recently run vacuum ? The tables are pretty stable. I think no more than 20 records were modified (update/insert/delete) during the whole history of the 2 tables. autovacuum is running regularly. The actual query running is: begin; create index catalog_categoryitem_ItemsID_index on catalog_categoryitem using btree (ItemID); delete from catalog_categoryitem where ItemID not in (select ItemID from catalog_items); commit; That's what came back Timing is on. BEGIN Time: 0.198 ms CREATE INDEX Time: 3987.991 ms The query is still running... As a reminder catalog_categoryitem should contain less than 1M record. catalog_items should contain a bit more than 600K record where ItemID is unique (a pk actually). PostgreSQL comes from the default install from Debian etch (8.1.X). It's configuration hasn't been modified. -- Ivan Sergio Borgonovo http://www.webthatworks.it
Ivan Sergio Borgonovo <mail@webthatworks.it> writes: > I'm doing something like: > delete from table1 where id not in (select id from table2). > table1 contains ~1M record table2 contains ~ 600K record and id is > unique. That's going to pretty much suck unless you've got work_mem set high enough to allow a "hashed subplan" plan --- which is likely to require tens of MB for this case, I don't recall exactly what the per-row overhead is. Experiment until EXPLAIN tells you it'll use a hashed subplan. BTW, don't bother with creating the index, it doesn't help for this. regards, tom lane
On Thu, 31 Jul 2008 21:37:39 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ivan Sergio Borgonovo <mail@webthatworks.it> writes: > > I'm doing something like: > > delete from table1 where id not in (select id from table2). > > table1 contains ~1M record table2 contains ~ 600K record and id > > is unique. > That's going to pretty much suck unless you've got work_mem set > high enough to allow a "hashed subplan" plan --- which is likely > to require tens of MB for this case, I don't recall exactly what Thanks. > the per-row overhead is. Experiment until EXPLAIN tells you it'll > use a hashed subplan. explain delete from catalog_categoryitem where ItemID not in (select ItemID from catalog_items); Well I reached 3Gb of work_mem and still I got: "Seq Scan on catalog_categoryitem (cost=31747.84..4019284477.13 rows=475532 width=6)" " Filter: (NOT (subplan))" " SubPlan" " -> Materialize (cost=31747.84..38509.51 rows=676167 width=8)" " -> Seq Scan on catalog_items (cost=0.00..31071.67 rows=676167 width=8)" I've this too: alter table catalog_items cluster on catalog_items_pkey; should I drop it? This is just a dev box. I loaded the 2 tables with 2 not coherent set of data just to play with, before adding all the pk/fk I need. I could just truncate the tables and reload them from coherent sources. But what if I *really* had to execute that query? Any other magic I could play to speed it up? thanks -- Ivan Sergio Borgonovo http://www.webthatworks.it
Ivan Sergio Borgonovo wrote: > But what if I *really* had to execute that query? > Any other magic I could play to speed it up? A trick that is sometimes spectacularly efficient is to rewrite the query to use an outer join instead of NOT IN. Try: DELETE FROM table1 WHERE id IN (SELECT table1.id FROM table1 LEFT JOIN table2 ON table1.id=table2.id WHERE table2.id IS NULL) Best regards, -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
ok try this delete from catalog_categoryitem where not exists (select id from catalog_items where catalog_items.ItemID = catalog_categoryitem.ItemID); --- On Thu, 7/31/08, Ivan Sergio Borgonovo <mail@webthatworks.it> wrote: > From: Ivan Sergio Borgonovo <mail@webthatworks.it> > Subject: Re: [GENERAL] eliminating records not in (select id ... so SLOW? > To: > Cc: "PostgreSQL" <pgsql-general@postgresql.org> > Date: Thursday, July 31, 2008, 11:01 PM > On Thu, 31 Jul 2008 14:59:29 -0700 (PDT) > Lennin Caro <lennin.caro@yahoo.com> wrote: > > > > The box is a 2x dual core Xeon (below 2GHz) with > 4Gb ram. > > > Default debian etch setup. > > > you recently run vacuum ? > > The tables are pretty stable. I think no more than 20 > records were > modified (update/insert/delete) during the whole history of > the 2 > tables. > > autovacuum is running regularly. > > The actual query running is: > > begin; > create index catalog_categoryitem_ItemsID_index on > catalog_categoryitem using btree (ItemID); > delete from catalog_categoryitem > where ItemID not in (select ItemID from catalog_items); > commit; > > That's what came back > Timing is on. > BEGIN > Time: 0.198 ms > CREATE INDEX > Time: 3987.991 ms > > The query is still running... > > As a reminder catalog_categoryitem should contain less than > 1M > record. > catalog_items should contain a bit more than 600K record > where > ItemID is unique (a pk actually). > PostgreSQL comes from the default install from Debian etch > (8.1.X). > It's configuration hasn't been modified. > > -- > Ivan Sergio Borgonovo > http://www.webthatworks.it > > > -- > Sent via pgsql-general mailing list > (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general
Ivan Sergio Borgonovo <mail@webthatworks.it> writes: > Well I reached 3Gb of work_mem and still I got: > "Seq Scan on catalog_categoryitem (cost=31747.84..4019284477.13 > rows=475532 width=6)" > " Filter: (NOT (subplan))" > " SubPlan" > " -> Materialize (cost=31747.84..38509.51 rows=676167 width=8)" > " -> Seq Scan on catalog_items (cost=0.00..31071.67 > rows=676167 width=8)" Huh. The only way I can see for that to happen is if the datatypes involved aren't hashable. What's the datatypes of the two columns being compared, anyway? regards, tom lane
On Fri, 01 Aug 2008 10:33:59 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ivan Sergio Borgonovo <mail@webthatworks.it> writes: > > Well I reached 3Gb of work_mem and still I got: > > > "Seq Scan on catalog_categoryitem (cost=31747.84..4019284477.13 > > rows=475532 width=6)" > > " Filter: (NOT (subplan))" > > " SubPlan" > > " -> Materialize (cost=31747.84..38509.51 rows=676167 > > width=8)" " -> Seq Scan on catalog_items > > (cost=0.00..31071.67 rows=676167 width=8)" > > Huh. The only way I can see for that to happen is if the datatypes > involved aren't hashable. What's the datatypes of the two columns > being compared, anyway? That S in CS should mean sober! thanks to svn I'd say you're right... one column was int the other bigint. Among other things I was just fixing that kind of mistakes. If that could be the reason I'll report if things got better once I finish to normalise the DB. BTW does pg 8.3 save you from such kind of mistake being stricter with auto cast? Tom sorry for sending this just to your personal email. -- Ivan Sergio Borgonovo http://www.webthatworks.it
Ivan Sergio Borgonovo <mail@webthatworks.it> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Huh. The only way I can see for that to happen is if the datatypes >> involved aren't hashable. What's the datatypes of the two columns >> being compared, anyway? > thanks to svn I'd say you're right... one column was int the other > bigint. Ah. 8.3 can hash certain cross-type comparisons (including that one) but prior versions won't. regards, tom lane
On Fri, 01 Aug 2008 10:33:59 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ivan Sergio Borgonovo <mail@webthatworks.it> writes: > > Well I reached 3Gb of work_mem and still I got: > > > "Seq Scan on catalog_categoryitem (cost=31747.84..4019284477.13 > > rows=475532 width=6)" > > " Filter: (NOT (subplan))" > > " SubPlan" > > " -> Materialize (cost=31747.84..38509.51 rows=676167 > > width=8)" " -> Seq Scan on catalog_items > > (cost=0.00..31071.67 rows=676167 width=8)" > > Huh. The only way I can see for that to happen is if the datatypes > involved aren't hashable. What's the datatypes of the two columns > being compared, anyway? I changed both columns to bigint. I added 2 indexes on the ItemID column of both tables and increased work_mem to 3Gb [sic]. The query got executed in ~1300ms... but explain gave the same output as the one above. The problem is solved... but curious mind want to know. -- Ivan Sergio Borgonovo http://www.webthatworks.it