BUG #9557: Row not visible after receiving notification - Mailing list pgsql-bugs

From marko@joh.to
Subject BUG #9557: Row not visible after receiving notification
Date
Msg-id 20140313093922.335.80556@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #9557: Row not visible after receiving notification  (Marko Tiikkaja <marko@joh.to>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      9557
Logged by:          Marko Tiikkaja
Email address:      marko@joh.to
PostgreSQL version: 9.1.12
Operating system:   Linux
Description:

Hi,

We went live with an application a few days ago, and it's been having
problems with seeing rows after receiving a notification from the
transaction that committed it.  I can't say for sure how often it happens,
but I've seen evidence suggesting at least three separate occasions.

The general design is that we have two independent "RPC Workers" LISTENing
for notifications.  Upon receiving a notification, they call a function
Claim_Request() which does this:

  BEGIN
    SELECT
      Requests.RequestID INTO _RequestID
    FROM rpcworker.Requests
    WHERE
      state = 'UNCLAIMED' AND
      -- sanity check
      Requests.Claimed IS NULL
    ORDER BY Requests.RequestID
    LIMIT 1
    FOR UPDATE ;
    IF NOT FOUND THEN
      RETURN;
    END IF;

    -- otherwise UPDATE to set claimed=now()

I suspected it might be something wonky with FOR UPDATE, so I added an extra
query sent right before Claim_Request():

    SELECT Requests.RequestID
    FROM rpcworker.Requests
    WHERE state = 'UNCLAIMED' AND
                Requests.Claimed IS NULL
    ORDER BY Requests.RequestID
    LIMIT 1

Below I have attached the relevant portions of the log when this happened.
In the example,  the SELECT without the FOR UPDATE did not see the row, but
the subsequent Claim_Request() did (supposedly due to the small latency
added by the extra round-trip by the SELECT before it).

RPCWorker B appeared to be working "normally" in this case (though it was a
bit slower to react than A), but I've attached its logs for posterity as
well.  In previous cases, where both of them called Claim_Request()
directly, we saw a couple of instances where *neither* of the RPC workers
saw the new row, which suggests something's terribly wrong.


## RPCWORKER B (UTC): ##


received notification with payload 8839
2014-03-12 21:11:27.250 +0000 LOG:  received notification on channel "RPC
Worker"
2014-03-12 21:11:27.253 +0000 LOG:  saw requestid 8839 without FOR UPDATE


## RPCWORKER A (UTC): ##


received notification with payload 8839
2014-03-12 21:11:27.240 +0000 LOG:  received notification on channel "RPC
Worker"
2014-03-12 21:11:27.245 +0000 LOG:  plain SELECT: sql: no rows in result
set
2014-03-12 21:11:27.251 +0000 WARNING:  SNAKE: saw 0, claimed 8839
2014-03-12 21:11:27.251 +0000 LOG:  received requestid 8839
2014-03-12 21:11:27.253 +0000 LOG:  plain SELECT: sql: no rows in result
set
2014-03-12 21:11:27.531 +0000 LOG:  request 8839 (attempt 1) succeeded
2014-03-12 21:11:27.562 +0000 LOG:  finished request 8839 (attempt 1)


## DBWORKER (UTC + 1): ##


2014-03-12 22:11:27.237
CET,"dbworker","db",34682,"10.1.1.71:42307",531f39d8.877a,211904,"SELECT",2014-03-11
17:29:12 CET,583/727697,1268656677,WARNING,01000,"
# the above line is the last piece of information we see from the
transaction sending the notification

# committed somewhere around here

2014-03-12 22:11:27.246
CET,"dbworker","db",34682,"10.1.1.71:42307",531f39d8.877a,211905,"SELECT",2014-03-11
17:29:12 CET,583/0,0,LOG,00000,"duration: 42.826
ms",,,,,,,,"exec_simple_query, postgres.c:1090",""

# known committed at this point (but a query has already been executed!)



## RPCWORKER A IN DB (UTC + 1): ##


2014-03-12 22:11:21.052
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12480,"SELECT",2014-03-12
16:28:07 CET,34/0,0,LOG,00000,"duration: 0.741
ms",,,,,,,,"exec_simple_query, postgres.c:1090","rpcworker

# idle for a good while here

2014-03-12 22:11:27.245
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12481,"idle",2014-03-12
16:28:07 CET,34/2000571,0,LOG,00000,"statement: -- NO FOR UPDATE
2014-03-12 22:11:27.245
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12482,"SELECT",2014-03-12
16:28:07 CET,34/0,0,LOG,00000,"duration: 0.983
ms",,,,,,,,"exec_simple_query, postgres.c:1090","rpcworker"

# didn't see the row

2014-03-12 22:11:27.252
CET,"rpcworker","db",33568,"10.1.1.72:48556",53207d07.8320,12483,"idle",2014-03-12
16:28:07 CET,34/2000572,0,LOG,00000,"statement:  -- Claim_Request()

# sees the row!



## RPCWORKER B IN DB (UTC + 1): ##


2014-03-12 22:11:27.253
CET,"rpcworker","db",32865,"10.2.1.72:33236",53207cee.8061,14392,"idle",2014-03-12
16:27:42 CET,95/1580775,0,LOG,00000,"statement: -- NO FOR UPDATE

# is a bit sluggish, so sees the row!

2014-03-12 22:11:27.254
CET,"rpcworker","db",32865,"10.2.1.72:33236",53207cee.8061,14394,"idle",2014-03-12
16:27:42 CET,95/1580776,0,LOG,00000,"statement:  -- Claim_Request()

# would likely see the row, but already claimed by worker A

pgsql-bugs by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: BUG #9553: why bitmap index scan startup_cost=0? it's a bug?
Next
From: Fujii Masao
Date:
Subject: Re: BUG #9118: WAL Sender does not disconnect replication clients during shutdown