Thread: correlated delete with "in" and "left outer join"
I'm using postgresl 7.3.2 and have a query that executes very slowly. There are 2 tables: Item and LogEvent. ItemID (an int4) is the primary key of Item, and is also a field in LogEvent. Some ItemIDs in LogEvent do not correspond to ItemIDs in Item, and periodically we need to purge the non-matching ItemIDs from LogEvent. The query is: delete from LogEvent where EventType != 'i' and ItemID in (select distinct e.ItemID from LogEvent e left outer join Item i on e.ItemID = i.ItemID where e.EventType != 'i' and i.ItemID is null); I understand that using "in" is not very efficient. Is there some other way to write this query without the "in"?
On Feb 27, 2004, at 11:26 AM, <mike@linkify.com> wrote: > I'm using postgresl 7.3.2 and have a query that executes very slowly. <snip /> > I understand that using "in" is not very efficient. > > Is there some other way to write this query without the "in"? NOT EXISTS ( ) is sometimes more efficient. If at all possible, upgrade to 7.4.1. One of the many things that have improved since 7.3.2 is the efficiency of queries using IN. Michael Glaesemann grzm myrealbox com
On Thu, 26 Feb 2004 mike@linkify.com wrote: > I'm using postgresl 7.3.2 and have a query that executes very slowly. > > There are 2 tables: Item and LogEvent. ItemID (an int4) is the primary key > of Item, and is also a field in LogEvent. Some ItemIDs in LogEvent do not > correspond to ItemIDs in Item, and periodically we need to purge the > non-matching ItemIDs from LogEvent. > > The query is: > > delete from LogEvent where EventType != 'i' and ItemID in > (select distinct e.ItemID from LogEvent e left outer join Item i > on e.ItemID = i.ItemID where e.EventType != 'i' and i.ItemID is null); > > I understand that using "in" is not very efficient. > > Is there some other way to write this query without the "in"? Perhaps delete from LogEvent where EventType != 'i' and not exists (select * from Item i where i.ItemID=LogEvent.ItemID);
Stephan Szabo wrote: > On Thu, 26 Feb 2004 mike@linkify.com wrote: > > >>I'm using postgresl 7.3.2 and have a query that executes very slowly. >> >>There are 2 tables: Item and LogEvent. ItemID (an int4) is the primary key >>of Item, and is also a field in LogEvent. Some ItemIDs in LogEvent do not >>correspond to ItemIDs in Item, and periodically we need to purge the >>non-matching ItemIDs from LogEvent. >> >>The query is: >> >>delete from LogEvent where EventType != 'i' and ItemID in >>(select distinct e.ItemID from LogEvent e left outer join Item i >>on e.ItemID = i.ItemID where e.EventType != 'i' and i.ItemID is null); >> >>I understand that using "in" is not very efficient. >> >>Is there some other way to write this query without the "in"? > > > Perhaps > delete from LogEvent where EventType != 'i' and not exists > (select * from Item i where i.ItemID=LogEvent.ItemID); Maybe I'm not reading his subquery correctly, but the left outer join will produce a row from LogEvent regardless of whether or not a matching row exists in Item, correct? So doesn't it reduce to: DELETE FROM LogEvent WHERE EventType <> 'i'; ??? Mike Mascari
Mike Mascari wrote: > Stephan Szabo wrote: > >> On Thu, 26 Feb 2004 mike@linkify.com wrote: >> >>> I'm using postgresl 7.3.2 and have a query that executes very slowly. >>> >>> There are 2 tables: Item and LogEvent. ItemID (an int4) is the >>> primary key >>> of Item, and is also a field in LogEvent. Some ItemIDs in LogEvent >>> do not >>> correspond to ItemIDs in Item, and periodically we need to purge the >>> non-matching ItemIDs from LogEvent. >> >> Perhaps >> delete from LogEvent where EventType != 'i' and not exists >> (select * from Item i where i.ItemID=LogEvent.ItemID); > > > Maybe I'm not reading his subquery correctly, but the left outer join > will produce a row from LogEvent regardless of whether or not a matching > row exists in Item, correct? So doesn't it reduce to: > > DELETE FROM LogEvent WHERE EventType <> 'i'; I failed to read what he was trying to accomplish and assumed the original query was precisely what he intended. My apologies... Mike Mascari
The subquery will always return a row from LogEvent, but that row's itemID will be null if theitemID doesn't match a row from Item. That's why the subquery has the "and i.ItemID is null". > Stephan Szabo wrote: > >> On Thu, 26 Feb 2004 mike@linkify.com wrote: >> >> >>>I'm using postgresl 7.3.2 and have a query that executes very slowly. >>> >>>There are 2 tables: Item and LogEvent. ItemID (an int4) is the >>>primary key of Item, and is also a field in LogEvent. Some ItemIDs in >>>LogEvent do not correspond to ItemIDs in Item, and periodically we >>>need to purge the non-matching ItemIDs from LogEvent. >>> >>>The query is: >>> >>>delete from LogEvent where EventType != 'i' and ItemID in >>>(select distinct e.ItemID from LogEvent e left outer join Item i on >>>e.ItemID = i.ItemID where e.EventType != 'i' and i.ItemID is null); >>> >>>I understand that using "in" is not very efficient. >>> >>>Is there some other way to write this query without the "in"? >> >> >> Perhaps >> delete from LogEvent where EventType != 'i' and not exists >> (select * from Item i where i.ItemID=LogEvent.ItemID); > > Maybe I'm not reading his subquery correctly, but the left outer > join will produce a row from LogEvent regardless of whether or not a > matching row exists in Item, correct? So doesn't it reduce to: > > DELETE FROM LogEvent WHERE EventType <> 'i'; > > ??? > > Mike Mascari
mike@linkify.com wrote: > The subquery will always return a row from LogEvent, but that row's itemID > will be null if the itemID doesn't match a row from Item. > That's why the subquery has the "and i.ItemID is null". You lost me. [test@lexus] \d foo Table "public.foo" Column | Type | Modifiers --------+---------+----------- key | integer | [test@lexus] \d bar Table "public.bar" Column | Type | Modifiers --------+---------+----------- key | integer | value | text | [test@lexus] select * from foo; key ----- 1 3 (2 rows) [test@lexus] select * from bar; key | value -----+------- 1 | Mike 2 | Joe (2 rows) [test@lexus] select f.key from foo f left outer join bar b on f.key = b.key and b.key is null; key ----- 1 3 (2 rows) To do what I think you believe to be happening w.r.t. outer joins, you'd have to have a subquery like: [test@lexus] select a.fookey test-# FROM test-# (SELECT foo.key AS fookey, bar.key as barkey FROM foo LEFT OUTER JOIN bar ON foo.key = bar.key) AS a test-# WHERE a.barkey IS NULL; fookey -------- 3 (1 row) Nevertheless, Stephan's solution matches your description of the problem and excutes the logical equivalent of the above much more rapidly... Mike Mascari
On Thu, Feb 26, 2004 at 06:26:19PM -0800, mike@linkify.com wrote: > I'm using postgresl 7.3.2 and have a query that executes very slowly. > > There are 2 tables: Item and LogEvent. ItemID (an int4) is the > primary key > of Item, and is also a field in LogEvent. Some ItemIDs in LogEvent do > not > correspond to ItemIDs in Item, and periodically we need to purge the > non-matching ItemIDs from LogEvent. delete from LogEvent where EventType!='i' and ItemID not in (select ItemID from Item); delete from LogEvent where EventType!='i' and not exists (select * from Item where Item.ItemID=LogEvent.ItemID); You might also use a foreign key, cascading delete, etc. As for the query style, I've had cases with the latest 7.4 where the "in" style wasn't optimized but the "exists" style was. It's the exact same query, and technically the optimizer should figure that out. Use "explain" to see if it's being optimized to use indexes or if it's just doing table scans. Michael -- Michael Darrin Chaney mdchaney@michaelchaney.com http://www.michaelchaney.com/
On Fri, 27 Feb 2004, Mike Mascari wrote: > To do what I think you believe to be happening w.r.t. outer joins, > you'd have to have a subquery like: > > [test@lexus] select a.fookey > test-# FROM > test-# (SELECT foo.key AS fookey, bar.key as barkey FROM foo LEFT > OUTER JOIN bar ON foo.key = bar.key) AS a > test-# WHERE a.barkey IS NULL; This AFAICS is pretty much what he did, except that he didn't alias the join which is okay I believe. He had one condition in on and two conditions in where. The original subquery looked like: select distinct e.ItemID from LogEvent e left outer join Item i on e.ItemID = i.ItemID where e.EventType != 'i' and i.ItemID is null
Stephan Szabo wrote: > On Fri, 27 Feb 2004, Mike Mascari wrote: > >>To do what I think you believe to be happening w.r.t. outer joins, >>you'd have to have a subquery like: >> >>[test@lexus] select a.fookey >>test-# FROM >>test-# (SELECT foo.key AS fookey, bar.key as barkey FROM foo LEFT >>OUTER JOIN bar ON foo.key = bar.key) AS a >>test-# WHERE a.barkey IS NULL; > > This AFAICS is pretty much what he did, except that he didn't alias the > join which is okay I believe. He had one condition in on and two > conditions in where. > > The original subquery looked like: > select distinct e.ItemID from LogEvent e left outer join Item i > on e.ItemID = i.ItemID where e.EventType != 'i' and i.ItemID is null That is indeed the original subquery. But the 'i.ItemID is null' condition doesn't change the IN list one iota. He was somehow expecting the subquery to yield records internally like: 1 NULL 2 NULL 3 3 and simultaneously have the condition 'i.ItemID is null' eliminate the third tuple. But that is not how the left outer join executes. The 'i.ItemID is null' condition is evaluated, probably always to false, which ensures that the left outer join will never find a matching row from the 'Item' relation and, if queried not as a subquery but stand-alone as: select distinct e.ItemID, i.ItemID from LogEvent e left outer join Item i on e.ItemID = i.ItemID where e.EventType != 'i' and i.ItemID is null would always yield a relation of the form: e.ItemID NULL for every e.ItemID whose e.EventType != 'i'. That ain't right. Another example: [test@lexus] select * from foo; key ----- 1 3 (2 rows) [test@lexus] select * from bar; key | value -----+------- 1 | Mike 2 | Joe (2 rows) [test@lexus] select foo.key, bar.key from foo left outer join bar on foo.key = bar.key and bar.key is null; key | key -----+----- 1 | 3 | (2 rows) [test@lexus] select foo.key, bar.key from foo left outer join bar on foo.key = bar.key; key | key -----+----- 1 | 1 3 | (2 rows) [test@lexus] select a.fookey, a.barkey from (select foo.key as fookey, bar.key as barkey from foo left outer join bar on foo.key = bar.key) as a where a.barkey is null; fookey | barkey --------+-------- 3 | (1 row) Mike Mascari
> >The original subquery looked like: > >select distinct e.ItemID from LogEvent e left outer join Item i > >on e.ItemID = i.ItemID where e.EventType != 'i' and i.ItemID is null Please, before continuing this thread, read my post below. What you're all getting around to, albeit painfully, is that this subquery is worthless as-is. This is the mysql way of finding rows in one table with no match in another without the convenience of the "in" or "exists" constructs. Because we're using Postgres and have those constructs, the original query can be rewritten simply with either: delete from LogEvent where EventType != 'i' and ItemID not in (select ItemID from Item) That's it. That's the whole query. It does what he wants. Michael -- Michael Darrin Chaney mdchaney@michaelchaney.com http://www.michaelchaney.com/
Michael Chaney wrote: > Please, before continuing this thread, read my post below. What you're > all getting around to, albeit painfully, is that this subquery is > worthless as-is. This is the mysql way of finding rows in one table > with no match in another without the convenience of the "in" or "exists" > constructs. > > Because we're using Postgres and have those constructs, the original > query can be rewritten simply with either: > > delete from LogEvent where EventType != 'i' and ItemID not in > (select ItemID from Item) > > That's it. That's the whole query. It does what he wants. One more minor point. :-) If you are using 7.3 or earlier, PostgreSQL will sequentially scan the IN subquery result, which executes quite slowly and therefore the EXISTS method Stephan stated should be used: DELETE FROM LogEvent WHERE EventType != 'i' AND NOT EXISTS ( SELECT 1 FROM Item WHERE Item.ItemID = LogEvent.ItemID ); If you are using >= 7.4, then your query above is optimal: http://www.postgresql.org/docs/7.4/static/release.html#RELEASE-7-4 Just something to consider, Mike Mascari > > Michael
On Fri, Feb 27, 2004 at 12:05:48PM -0500, Mike Mascari wrote: > Michael Chaney wrote: > > >Please, before continuing this thread, read my post below. What you're > >all getting around to, albeit painfully, is that this subquery is > >worthless as-is. This is the mysql way of finding rows in one table > >with no match in another without the convenience of the "in" or "exists" > >constructs. > > > >Because we're using Postgres and have those constructs, the original > >query can be rewritten simply with either: > > > >delete from LogEvent where EventType != 'i' and ItemID not in > > (select ItemID from Item) > > > >That's it. That's the whole query. It does what he wants. > > One more minor point. :-) > > If you are using 7.3 or earlier, PostgreSQL will sequentially scan > the IN subquery result, which executes quite slowly and therefore > the EXISTS method Stephan stated should be used: > > DELETE FROM LogEvent > WHERE EventType != 'i' AND NOT EXISTS ( > SELECT 1 > FROM Item > WHERE Item.ItemID = LogEvent.ItemID > ); > > If you are using >= 7.4, then your query above is optimal: Not necessarily. I had a query just last week that still wouldn't optimize with the "in" notation, but did optimize with "exists" notation. My other post about this showed both queries for that reason, but I still feel that, for academic purposes, the "in" clause is far more readable. Anyway, good point. Michael -- Michael Darrin Chaney mdchaney@michaelchaney.com http://www.michaelchaney.com/
On Fri, 27 Feb 2004, Mike Mascari wrote: > Stephan Szabo wrote: > > On Fri, 27 Feb 2004, Mike Mascari wrote: > > > >>To do what I think you believe to be happening w.r.t. outer joins, > >>you'd have to have a subquery like: > >> > >>[test@lexus] select a.fookey > >>test-# FROM > >>test-# (SELECT foo.key AS fookey, bar.key as barkey FROM foo LEFT > >>OUTER JOIN bar ON foo.key = bar.key) AS a > >>test-# WHERE a.barkey IS NULL; > > > > This AFAICS is pretty much what he did, except that he didn't alias the > > join which is okay I believe. He had one condition in on and two > > conditions in where. > > > > The original subquery looked like: > > select distinct e.ItemID from LogEvent e left outer join Item i > > on e.ItemID = i.ItemID where e.EventType != 'i' and i.ItemID is null > > That is indeed the original subquery. But the 'i.ItemID is null' > condition doesn't change the IN list one iota. He was somehow ... > Another example: > > [test@lexus] select * from foo; > key > ----- > 1 > 3 > (2 rows) > > [test@lexus] select * from bar; > key | value > -----+------- > 1 | Mike > 2 | Joe > (2 rows) > > [test@lexus] select foo.key, bar.key from foo left outer join bar on > foo.key = bar.key and bar.key is null; ON conditions and WHERE conditions are different. Try select foo.key, bar.key from foo left outer join bar on foo.key=bar.key where bar.key is null;
Stephan Szabo wrote: > ON conditions and WHERE conditions are different. > > Try > select foo.key, bar.key from foo left outer join bar on foo.key=bar.key > where bar.key is null; Yep. Sorry. Mike Mascari