Thread: race conditions, intersect in subqueries

race conditions, intersect in subqueries

From
Cristóvão Dalla Costa
Date:
I'm working with an application I wrote which does something along these
lines:

SELECT ID FROM ITEM WHERE URL='X' FOR UPDATE
IF (ROW RETURNED) {
    $ID = ITEM.ID
} ELSE {
    SELECT nextval ('item_id_seq')
    $ID = nextval
    INSERT INTO ITEM....
}
DO OTHER STUFF WITH $ID

So, I check if an item with a given url exists. If it does, I get its id and
use it later. If it doesn't, I insert a new item and proceed with the new
id. Everything happens inside a transaction. Now, there is a race condition
where the first line is executed simultaneously by two processes, looking
for the same url, and resulting in duplicate lines. So far, there are about
40 duplicates in a 80,000 row database, and short of manually correcting
them, I don't know what to do to fix the race condition.

Finally, it seems I cannot do INTERSECT on subqueries, sice the following
fails with a parse error "at or near INTERSECT", and the subquery by itself
works.

SELECT * FROM item WHERE id IN (SELECT item_id FROM item_words, words WHERE
words.id = words_id AND words.word='x' INTERSECT SELECT item_id FROM
item_words, words WHERE words.id = words_id AND words.word='y')

Basically, I'm using the above query to look for words in a reverse index,
sometimes with as many as 10 different words, causing a lot of rows to be
generated, to be later filtered by the intersects. Are there any better ways
to do that, performance-wise?

BTW, I'm not sure whether this is the appropriate mailing list to report
this, but the query optimizer should read
SELECT * FROM x WHERE id IN (SELECT x_id FROM Y)
as
SELECT * FROM x WHERE EXISTS (SELECT * FROM Y WHERE x_id = x.id)
when the tables are "large", and the necessary indexes exist.

Thanks for the help.

Cristovao.




Re: race conditions, intersect in subqueries

From
Stephan Szabo
Date:
On Fri, 8 Sep 2000, [iso-8859-1] Crist�v�o Dalla Costa wrote:

> I'm working with an application I wrote which does something along these
> lines:
>
> SELECT ID FROM ITEM WHERE URL='X' FOR UPDATE
> IF (ROW RETURNED) {
>     $ID = ITEM.ID
> } ELSE {
>     SELECT nextval ('item_id_seq')
>     $ID = nextval
>     INSERT INTO ITEM....
> }
> DO OTHER STUFF WITH $ID
>
> So, I check if an item with a given url exists. If it does, I get its id and
> use it later. If it doesn't, I insert a new item and proceed with the new
> id. Everything happens inside a transaction. Now, there is a race condition
> where the first line is executed simultaneously by two processes, looking
> for the same url, and resulting in duplicate lines. So far, there are about
> 40 duplicates in a 80,000 row database, and short of manually correcting
> them, I don't know what to do to fix the race condition.

I'm assuming that this only happens when both are doing this and the row
didn't already exist, both insert, because select for update isn't going
to lock the non-existant row and they aren't going to block for each
other since no rows match.  Do the rows have the same sequence number or
just different sequences and the same data?  I'm not sure what else
select for update could do in the latter case other than allow both
through since there are no rows that conflict.

If you want a quick hack (other than locking the table explicitly), you
could effectively make a simple locking mechanism with another table,
always have a row for each "lock" you want, and select for update that,
basically locking out one until the other is done.

> Finally, it seems I cannot do INTERSECT on subqueries, sice the following
> fails with a parse error "at or near INTERSECT", and the subquery by itself
> works.

Yes, currently things like INTERSECT, EXCEPT, etc, are only allowable at
the top level of the query.  This is likely to go away in 7.2 (the dreaded
querytree work :) ).

> SELECT * FROM item WHERE id IN (SELECT item_id FROM item_words, words WHERE
> words.id = words_id AND words.word='x' INTERSECT SELECT item_id FROM
> item_words, words WHERE words.id = words_id AND words.word='y')
>
> Basically, I'm using the above query to look for words in a reverse index,
> sometimes with as many as 10 different words, causing a lot of rows to be
> generated, to be later filtered by the intersects. Are there any better ways
> to do that, performance-wise?

I'm not sure if it'll be fast, but is the subquery equivalent to
(assuming item_id is on item_words):
SELECT distinct a.item_id FROM item_words a, words b, item_words c,
 words d where b.id=a.words_id AND b.word='x' AND d.id=c.words_id AND
 d.word='y' AND a.item_id=c.item_id

Well, then converting that IN into an EXISTS... and adding the
id=a.item_id at the end.


> BTW, I'm not sure whether this is the appropriate mailing list to report
> this, but the query optimizer should read
> SELECT * FROM x WHERE id IN (SELECT x_id FROM Y)
> as
> SELECT * FROM x WHERE EXISTS (SELECT * FROM Y WHERE x_id = x.id)
> when the tables are "large", and the necessary indexes exist.

This is a known problem, it's mentioned in the FAQ
("Why are my subqueries using IN so slow") with the note that
we hope to fix this problem, but it hasn't been tackled yet.



Re: race conditions, intersect in subqueries

From
Doug Semig
Date:
Whenever I do inserts like yours, I do it in this kind of manner (I tried
to use your pseudocode style):

SELECT ID FROM ITEM WHERE URL='X' FOR UPDATE
IF (ROW RETURNED) {
  $ID = ITEM.ID
} ELSE {
  INSERT INTO ITEM ...
  GET THE OID [via PQoidStatus or your environment's equivalent]
  SELECT ID FROM ITEM WHERE oid = [THE OID WE GOT FROM ABOVE]
  $ID = ITEM.ID
}

Note that this depends upon ID in the ITEM table being defined with DEFAULT
NEXTVAL('item_id_seq').

As for the intersect in subqueries, I see two common uses for your
database.  The first is to return every item that has any one or more words
from your word list linked to it.  That's easy.

SELECT *
  FROM item
 WHERE id IN (
     SELECT DISTINCT item_id
       FROM item_words, words
      WHERE words.id = words_id
        AND (
             words.word = 'x'
             OR words.word = 'y'
             [etc for all your words]
            )
     );

The second use, and the one I think you're describing, is to return only
those items that have every word in your word list linked to it.  (Like
what you would expect from Altavista with a query like "+four +score +seven
+years +ago"; the plus sign is supposed to require the word that follows it
to be somewhere on the page.)  That's tougher.

Instead of doing all those INTERSECTs, it may be possible to get the
appropriate list of item_ids with something like this:

SELECT item_id
  FROM item_words, words
 WHERE words.id = words_id
   AND (
        words.word = 'x'
        OR words.word = 'y'
       )
GROUP BY item_id
HAVING count(item_id) = 2;

Note that this would really only be valid if item_words does not allow
duplicates of (item_id, word_id).

This kind of thing could simplify all those INTERSECTs, but for some reason
it cannot be used as a subselect (at least in the version of postgresql I
am using, which I think is just 7.0).  The only way I could find to use
this kind of construct would be to run this query selecting into a
temporary table and running your original query using ...IN (SELECT item_id
FROM yourtemptable) as the subselect.

That's all I can think of.  I hope some of this was at least interesting,
even if it doesn't help in your particular application.

Doug

At 07:15 PM 9/8/00 -0300, Cristóvão Dalla Costa wrote:
>I'm working with an application I wrote which does something along these
>lines:
>
>SELECT ID FROM ITEM WHERE URL='X' FOR UPDATE
>IF (ROW RETURNED) {
>    $ID = ITEM.ID
>} ELSE {
>    SELECT nextval ('item_id_seq')
>    $ID = nextval
>    INSERT INTO ITEM....
>}
>DO OTHER STUFF WITH $ID
>
>So, I check if an item with a given url exists. If it does, I get its id and
>use it later. If it doesn't, I insert a new item and proceed with the new
>id. Everything happens inside a transaction. Now, there is a race condition
>where the first line is executed simultaneously by two processes, looking
>for the same url, and resulting in duplicate lines. So far, there are about
>40 duplicates in a 80,000 row database, and short of manually correcting
>them, I don't know what to do to fix the race condition.
>
>Finally, it seems I cannot do INTERSECT on subqueries, sice the following
>fails with a parse error "at or near INTERSECT", and the subquery by itself
>works.
>
>SELECT * FROM item WHERE id IN (SELECT item_id FROM item_words, words WHERE
>words.id = words_id AND words.word='x' INTERSECT SELECT item_id FROM
>item_words, words WHERE words.id = words_id AND words.word='y')
>
>Basically, I'm using the above query to look for words in a reverse index,
>sometimes with as many as 10 different words, causing a lot of rows to be
>generated, to be later filtered by the intersects. Are there any better ways
>to do that, performance-wise?
>
>BTW, I'm not sure whether this is the appropriate mailing list to report
>this, but the query optimizer should read
>SELECT * FROM x WHERE id IN (SELECT x_id FROM Y)
>as
>SELECT * FROM x WHERE EXISTS (SELECT * FROM Y WHERE x_id = x.id)
>when the tables are "large", and the necessary indexes exist.
>
>Thanks for the help.
>
>Cristovao.