Re: race conditions, intersect in subqueries - Mailing list pgsql-general

From Doug Semig
Subject Re: race conditions, intersect in subqueries
Date
Msg-id 3.0.6.32.20000909035501.007b6e70@sloth.c3net.net
Whole thread Raw
In response to race conditions, intersect in subqueries  (Cristóvão Dalla Costa <cdallacosta@bigfoot.com>)
List pgsql-general
Whenever I do inserts like yours, I do it in this kind of manner (I tried
to use your pseudocode style):

SELECT ID FROM ITEM WHERE URL='X' FOR UPDATE
IF (ROW RETURNED) {
  $ID = ITEM.ID
} ELSE {
  INSERT INTO ITEM ...
  GET THE OID [via PQoidStatus or your environment's equivalent]
  SELECT ID FROM ITEM WHERE oid = [THE OID WE GOT FROM ABOVE]
  $ID = ITEM.ID
}

Note that this depends upon ID in the ITEM table being defined with DEFAULT
NEXTVAL('item_id_seq').

As for the intersect in subqueries, I see two common uses for your
database.  The first is to return every item that has any one or more words
from your word list linked to it.  That's easy.

SELECT *
  FROM item
 WHERE id IN (
     SELECT DISTINCT item_id
       FROM item_words, words
      WHERE words.id = words_id
        AND (
             words.word = 'x'
             OR words.word = 'y'
             [etc for all your words]
            )
     );

The second use, and the one I think you're describing, is to return only
those items that have every word in your word list linked to it.  (Like
what you would expect from Altavista with a query like "+four +score +seven
+years +ago"; the plus sign is supposed to require the word that follows it
to be somewhere on the page.)  That's tougher.

Instead of doing all those INTERSECTs, it may be possible to get the
appropriate list of item_ids with something like this:

SELECT item_id
  FROM item_words, words
 WHERE words.id = words_id
   AND (
        words.word = 'x'
        OR words.word = 'y'
       )
GROUP BY item_id
HAVING count(item_id) = 2;

Note that this would really only be valid if item_words does not allow
duplicates of (item_id, word_id).

This kind of thing could simplify all those INTERSECTs, but for some reason
it cannot be used as a subselect (at least in the version of postgresql I
am using, which I think is just 7.0).  The only way I could find to use
this kind of construct would be to run this query selecting into a
temporary table and running your original query using ...IN (SELECT item_id
FROM yourtemptable) as the subselect.

That's all I can think of.  I hope some of this was at least interesting,
even if it doesn't help in your particular application.

Doug

At 07:15 PM 9/8/00 -0300, Cristóvão Dalla Costa wrote:
>I'm working with an application I wrote which does something along these
>lines:
>
>SELECT ID FROM ITEM WHERE URL='X' FOR UPDATE
>IF (ROW RETURNED) {
>    $ID = ITEM.ID
>} ELSE {
>    SELECT nextval ('item_id_seq')
>    $ID = nextval
>    INSERT INTO ITEM....
>}
>DO OTHER STUFF WITH $ID
>
>So, I check if an item with a given url exists. If it does, I get its id and
>use it later. If it doesn't, I insert a new item and proceed with the new
>id. Everything happens inside a transaction. Now, there is a race condition
>where the first line is executed simultaneously by two processes, looking
>for the same url, and resulting in duplicate lines. So far, there are about
>40 duplicates in a 80,000 row database, and short of manually correcting
>them, I don't know what to do to fix the race condition.
>
>Finally, it seems I cannot do INTERSECT on subqueries, sice the following
>fails with a parse error "at or near INTERSECT", and the subquery by itself
>works.
>
>SELECT * FROM item WHERE id IN (SELECT item_id FROM item_words, words WHERE
>words.id = words_id AND words.word='x' INTERSECT SELECT item_id FROM
>item_words, words WHERE words.id = words_id AND words.word='y')
>
>Basically, I'm using the above query to look for words in a reverse index,
>sometimes with as many as 10 different words, causing a lot of rows to be
>generated, to be later filtered by the intersects. Are there any better ways
>to do that, performance-wise?
>
>BTW, I'm not sure whether this is the appropriate mailing list to report
>this, but the query optimizer should read
>SELECT * FROM x WHERE id IN (SELECT x_id FROM Y)
>as
>SELECT * FROM x WHERE EXISTS (SELECT * FROM Y WHERE x_id = x.id)
>when the tables are "large", and the necessary indexes exist.
>
>Thanks for the help.
>
>Cristovao.


pgsql-general by date:

Previous
From: "Rommel B. Abaya"
Date:
Subject: port from MSSQLServer to PostgreSQL
Next
From: Zlatko Calusic
Date:
Subject: Porting from mysql to psql (UNIX_TIMESTAMP()?)