Thread: race conditions, intersect in subqueries
I'm working with an application I wrote which does something along these lines: SELECT ID FROM ITEM WHERE URL='X' FOR UPDATE IF (ROW RETURNED) { $ID = ITEM.ID } ELSE { SELECT nextval ('item_id_seq') $ID = nextval INSERT INTO ITEM.... } DO OTHER STUFF WITH $ID So, I check if an item with a given url exists. If it does, I get its id and use it later. If it doesn't, I insert a new item and proceed with the new id. Everything happens inside a transaction. Now, there is a race condition where the first line is executed simultaneously by two processes, looking for the same url, and resulting in duplicate lines. So far, there are about 40 duplicates in a 80,000 row database, and short of manually correcting them, I don't know what to do to fix the race condition. Finally, it seems I cannot do INTERSECT on subqueries, sice the following fails with a parse error "at or near INTERSECT", and the subquery by itself works. SELECT * FROM item WHERE id IN (SELECT item_id FROM item_words, words WHERE words.id = words_id AND words.word='x' INTERSECT SELECT item_id FROM item_words, words WHERE words.id = words_id AND words.word='y') Basically, I'm using the above query to look for words in a reverse index, sometimes with as many as 10 different words, causing a lot of rows to be generated, to be later filtered by the intersects. Are there any better ways to do that, performance-wise? BTW, I'm not sure whether this is the appropriate mailing list to report this, but the query optimizer should read SELECT * FROM x WHERE id IN (SELECT x_id FROM Y) as SELECT * FROM x WHERE EXISTS (SELECT * FROM Y WHERE x_id = x.id) when the tables are "large", and the necessary indexes exist. Thanks for the help. Cristovao.
On Fri, 8 Sep 2000, [iso-8859-1] Crist�v�o Dalla Costa wrote: > I'm working with an application I wrote which does something along these > lines: > > SELECT ID FROM ITEM WHERE URL='X' FOR UPDATE > IF (ROW RETURNED) { > $ID = ITEM.ID > } ELSE { > SELECT nextval ('item_id_seq') > $ID = nextval > INSERT INTO ITEM.... > } > DO OTHER STUFF WITH $ID > > So, I check if an item with a given url exists. If it does, I get its id and > use it later. If it doesn't, I insert a new item and proceed with the new > id. Everything happens inside a transaction. Now, there is a race condition > where the first line is executed simultaneously by two processes, looking > for the same url, and resulting in duplicate lines. So far, there are about > 40 duplicates in a 80,000 row database, and short of manually correcting > them, I don't know what to do to fix the race condition. I'm assuming that this only happens when both are doing this and the row didn't already exist, both insert, because select for update isn't going to lock the non-existant row and they aren't going to block for each other since no rows match. Do the rows have the same sequence number or just different sequences and the same data? I'm not sure what else select for update could do in the latter case other than allow both through since there are no rows that conflict. If you want a quick hack (other than locking the table explicitly), you could effectively make a simple locking mechanism with another table, always have a row for each "lock" you want, and select for update that, basically locking out one until the other is done. > Finally, it seems I cannot do INTERSECT on subqueries, sice the following > fails with a parse error "at or near INTERSECT", and the subquery by itself > works. Yes, currently things like INTERSECT, EXCEPT, etc, are only allowable at the top level of the query. This is likely to go away in 7.2 (the dreaded querytree work :) ). > SELECT * FROM item WHERE id IN (SELECT item_id FROM item_words, words WHERE > words.id = words_id AND words.word='x' INTERSECT SELECT item_id FROM > item_words, words WHERE words.id = words_id AND words.word='y') > > Basically, I'm using the above query to look for words in a reverse index, > sometimes with as many as 10 different words, causing a lot of rows to be > generated, to be later filtered by the intersects. Are there any better ways > to do that, performance-wise? I'm not sure if it'll be fast, but is the subquery equivalent to (assuming item_id is on item_words): SELECT distinct a.item_id FROM item_words a, words b, item_words c, words d where b.id=a.words_id AND b.word='x' AND d.id=c.words_id AND d.word='y' AND a.item_id=c.item_id Well, then converting that IN into an EXISTS... and adding the id=a.item_id at the end. > BTW, I'm not sure whether this is the appropriate mailing list to report > this, but the query optimizer should read > SELECT * FROM x WHERE id IN (SELECT x_id FROM Y) > as > SELECT * FROM x WHERE EXISTS (SELECT * FROM Y WHERE x_id = x.id) > when the tables are "large", and the necessary indexes exist. This is a known problem, it's mentioned in the FAQ ("Why are my subqueries using IN so slow") with the note that we hope to fix this problem, but it hasn't been tackled yet.
Whenever I do inserts like yours, I do it in this kind of manner (I tried to use your pseudocode style): SELECT ID FROM ITEM WHERE URL='X' FOR UPDATE IF (ROW RETURNED) { $ID = ITEM.ID } ELSE { INSERT INTO ITEM ... GET THE OID [via PQoidStatus or your environment's equivalent] SELECT ID FROM ITEM WHERE oid = [THE OID WE GOT FROM ABOVE] $ID = ITEM.ID } Note that this depends upon ID in the ITEM table being defined with DEFAULT NEXTVAL('item_id_seq'). As for the intersect in subqueries, I see two common uses for your database. The first is to return every item that has any one or more words from your word list linked to it. That's easy. SELECT * FROM item WHERE id IN ( SELECT DISTINCT item_id FROM item_words, words WHERE words.id = words_id AND ( words.word = 'x' OR words.word = 'y' [etc for all your words] ) ); The second use, and the one I think you're describing, is to return only those items that have every word in your word list linked to it. (Like what you would expect from Altavista with a query like "+four +score +seven +years +ago"; the plus sign is supposed to require the word that follows it to be somewhere on the page.) That's tougher. Instead of doing all those INTERSECTs, it may be possible to get the appropriate list of item_ids with something like this: SELECT item_id FROM item_words, words WHERE words.id = words_id AND ( words.word = 'x' OR words.word = 'y' ) GROUP BY item_id HAVING count(item_id) = 2; Note that this would really only be valid if item_words does not allow duplicates of (item_id, word_id). This kind of thing could simplify all those INTERSECTs, but for some reason it cannot be used as a subselect (at least in the version of postgresql I am using, which I think is just 7.0). The only way I could find to use this kind of construct would be to run this query selecting into a temporary table and running your original query using ...IN (SELECT item_id FROM yourtemptable) as the subselect. That's all I can think of. I hope some of this was at least interesting, even if it doesn't help in your particular application. Doug At 07:15 PM 9/8/00 -0300, Cristóvão Dalla Costa wrote: >I'm working with an application I wrote which does something along these >lines: > >SELECT ID FROM ITEM WHERE URL='X' FOR UPDATE >IF (ROW RETURNED) { > $ID = ITEM.ID >} ELSE { > SELECT nextval ('item_id_seq') > $ID = nextval > INSERT INTO ITEM.... >} >DO OTHER STUFF WITH $ID > >So, I check if an item with a given url exists. If it does, I get its id and >use it later. If it doesn't, I insert a new item and proceed with the new >id. Everything happens inside a transaction. Now, there is a race condition >where the first line is executed simultaneously by two processes, looking >for the same url, and resulting in duplicate lines. So far, there are about >40 duplicates in a 80,000 row database, and short of manually correcting >them, I don't know what to do to fix the race condition. > >Finally, it seems I cannot do INTERSECT on subqueries, sice the following >fails with a parse error "at or near INTERSECT", and the subquery by itself >works. > >SELECT * FROM item WHERE id IN (SELECT item_id FROM item_words, words WHERE >words.id = words_id AND words.word='x' INTERSECT SELECT item_id FROM >item_words, words WHERE words.id = words_id AND words.word='y') > >Basically, I'm using the above query to look for words in a reverse index, >sometimes with as many as 10 different words, causing a lot of rows to be >generated, to be later filtered by the intersects. Are there any better ways >to do that, performance-wise? > >BTW, I'm not sure whether this is the appropriate mailing list to report >this, but the query optimizer should read >SELECT * FROM x WHERE id IN (SELECT x_id FROM Y) >as >SELECT * FROM x WHERE EXISTS (SELECT * FROM Y WHERE x_id = x.id) >when the tables are "large", and the necessary indexes exist. > >Thanks for the help. > >Cristovao.