Thread: PostgreSQL question re select for update (locking??)

PostgreSQL question re select for update (locking??)

From
Reid Thompson
Date:
Hello list,
We are using PostgreSQL 8.0.3.  Some background, and a couple of
questions..

We have a database table called "jobq" on the database machine,
and 2 networked server machines.
One of the network server machines has around 20 server processes
connecting over the network using ODBC.
These servers all attempt to "claim" jobs off the jobq to process
them.

Each server process claims a jobq record by selecting for update a
jobq record where the pid column is null, then rewrites the record with
the pid set in the pid column.

The "distilled" sql select statement is:

    * SELECT J.*, C.name, C.client_id, C.priority
    * FROM jobq J, campaign C
    * WHERE J.pid IS NULL
    * AND 'my_YYYYMMDDhhmmss'>=J.due_date
    * AND J.campaign_id=C.id
    * ORDER BY C.priority,J.due_date,J.item_id
    * LIMIT 1 FOR UPDATE;


What we are seeing is:

- We dump hundreds of records into the jobq table, expecting that a
  server process will each claim a record (and process it), but some of
  the processes call the claim function but get nothing returned by the
  select; they sleep for a short time (eg 10 seconds - this sleep
  differs on a per process basis) then attempt to claim again.
  Sometimes this claim returns a jobq record, other times it may take 4
  or 5 attempted claims before a record is actually returned by the
  select.

  Our expectation was that with hundreds of selectable records available,
  that each server process would 'get' the next available record --
  Is this a false expectation on our part?
  Is the currently 'active' select for update limit 1 blocking the others?
  If not, can someone give us some direction..??

- We do not see any SQL error in our server process diagnostic output
  (but I will revisit this to double check).

- We do not see any SQL error in the postgres logs.


Here is an example / snapshot of postgres processes on the database machine:
(ps -ef |grep "postgres.*192" |grep -v idle)
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
postgres 27866 6434 0 05:05 ? 00:03:21 postgres: obc obc 192.168.100.164(34499) SELECT
postgres 27871 6434 0 05:08 ? 00:05:57 postgres: obc obc 192.168.100.171(37066) SELECT waiting
postgres 27887 6434 0 05:09 ? 00:01:44 postgres: obc obc 192.168.100.171(37130) SELECT
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Here is an example / snapshot of sessions:
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
 datid | procpid | usesysid |        current_query         | query_start
-------+---------+----------+------------------------------+-------------
 17231 |   23540 |      100 | <command string not enabled> |
 17231 |   23541 |      100 | <command string not enabled> |
 17231 |   23542 |      100 | <command string not enabled> |
 17231 |   23543 |      100 | <command string not enabled> |
 17231 |   23544 |      100 | <command string not enabled> |
 17231 |   23545 |      100 | <command string not enabled> |
 17231 |   23546 |      100 | <command string not enabled> |
 17231 |   23547 |      100 | <command string not enabled> |
 17231 |   23548 |      100 | <command string not enabled> |
 17231 |   23549 |      100 | <command string not enabled> |
 17231 |   23550 |      100 | <command string not enabled> |
 17231 |   23551 |      100 | <command string not enabled> |
 17231 |   23552 |      100 | <command string not enabled> |
 17231 |   23553 |      100 | <command string not enabled> |
 17231 |   23554 |      100 | <command string not enabled> |
 17231 |   23555 |      100 | <command string not enabled> |
 17231 |   23556 |      100 | <command string not enabled> |
 17231 |   23557 |      100 | <command string not enabled> |
 17231 |   23558 |      100 | <command string not enabled> |
 17231 |   23559 |      100 | <command string not enabled> |
 17231 |   23560 |      100 | <command string not enabled> |
 17231 |   23561 |      100 | <command string not enabled> |
 17231 |   23562 |      100 | <command string not enabled> |
 17231 |   23563 |      100 | <command string not enabled> |
 17231 |   23564 |      100 | <command string not enabled> |
 17231 |   23565 |      100 | <command string not enabled> |
 17231 |   23566 |      100 | <command string not enabled> |
 17231 |   23567 |      100 | <command string not enabled> |
 17231 |   23568 |      100 | <command string not enabled> |
 17231 |   23569 |      100 | <command string not enabled> |
 17231 |   23570 |      100 | <command string not enabled> |
 17231 |   23571 |      100 | <command string not enabled> |
 17231 |   23572 |      100 | <command string not enabled> |
 17231 |   23573 |      100 | <command string not enabled> |
 17231 |   23574 |      100 | <command string not enabled> |
 17231 |   23575 |      100 | <command string not enabled> |
(36 rows)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Here is an example / snapshot of pg_locks:
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
 relation | database | transaction |  pid  |       mode       | granted
----------+----------+-------------+-------+------------------+---------
    18384 |    17231 |             | 23544 | AccessShareLock  | t
    18267 |    17231 |             | 23556 | AccessShareLock  | t
    18267 |    17231 |             | 23556 | RowShareLock     | t
          |          |    32900999 | 23564 | ShareLock        | f
          |          |    32900999 | 23556 | ExclusiveLock    | t
          |          |    32901005 | 23568 | ExclusiveLock    | t
    18418 |    17231 |             | 23556 | AccessShareLock  | t
    18384 |    17231 |             | 23556 | AccessShareLock  | t
    18357 |    17231 |             | 23555 | AccessShareLock  | t
    18256 |    17231 |             | 23544 | AccessShareLock  | t
    18352 |    17231 |             | 23554 | AccessShareLock  | t
    18352 |    17231 |             | 23554 | RowShareLock     | t
    18256 |    17231 |             | 23556 | AccessShareLock  | t
    18267 |    17231 |             | 23564 | AccessShareLock  | t
    18267 |    17231 |             | 23564 | RowShareLock     | t
    18352 |    17231 |             | 23544 | AccessShareLock  | t
    18352 |    17231 |             | 23544 | RowShareLock     | t
    18352 |    17231 |             | 23544 | RowExclusiveLock | t
          |          |    32901001 | 23564 | ExclusiveLock    | t
          |          |    32901000 | 23554 | ShareLock        | f
    18267 |    17231 |             | 23568 | AccessShareLock  | t
    18267 |    17231 |             | 23568 | RowShareLock     | t
    18267 |    17231 |             | 23554 | AccessShareLock  | t
    18267 |    17231 |             | 23554 | RowShareLock     | t
    18267 |    17231 |             | 23544 | AccessShareLock  | t
    18267 |    17231 |             | 23544 | RowShareLock     | t
          |          |    32901000 | 23544 | ExclusiveLock    | t
    18384 |    17231 |             | 23555 | AccessShareLock  | t
          |          |    32900979 | 23555 | ExclusiveLock    | t
          |          |    32901004 | 23554 | ExclusiveLock    | t
    18352 |    17231 |             | 23568 | AccessShareLock  | t
    18352 |    17231 |             | 23568 | RowShareLock     | t
          |          |    32901009 |  3909 | ExclusiveLock    | t
    18352 |    17231 |             | 23556 | AccessShareLock  | t
    18352 |    17231 |             | 23556 | RowShareLock     | t
    18352 |    17231 |             | 23556 | RowExclusiveLock | t
          |          |    32900999 | 23568 | ShareLock        | f
    16839 |        1 |             |  3909 | AccessShareLock  | t
    18390 |    17231 |             | 23555 | AccessShareLock  | t
    18352 |    17231 |             | 23564 | AccessShareLock  | t
    18352 |    17231 |             | 23564 | RowShareLock     | t
    18418 |    17231 |             | 23544 | AccessShareLock  | t
(42 rows)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Re: PostgreSQL question re select for update (locking??)

From
Richard Huxton
Date:
Reid Thompson wrote:
>
> Each server process claims a jobq record by selecting for update a
> jobq record where the pid column is null, then rewrites the record with
> the pid set in the pid column.
>
> The "distilled" sql select statement is:
>
>     * SELECT J.*, C.name, C.client_id, C.priority
>     * FROM jobq J, campaign C
>     * WHERE J.pid IS NULL
>     * AND 'my_YYYYMMDDhhmmss'>=J.due_date
>     * AND J.campaign_id=C.id
>     * ORDER BY C.priority,J.due_date,J.item_id
>     * LIMIT 1 FOR UPDATE;
>
>
> What we are seeing is:
>
> - We dump hundreds of records into the jobq table, expecting that a
>   server process will each claim a record (and process it), but some of
>   the processes call the claim function but get nothing returned by the
>   select; they sleep for a short time (eg 10 seconds - this sleep
>   differs on a per process basis) then attempt to claim again.
>   Sometimes this claim returns a jobq record, other times it may take 4
>   or 5 attempted claims before a record is actually returned by the
>   select.
>
>   Our expectation was that with hundreds of selectable records available,
>   that each server process would 'get' the next available record --
>   Is this a false expectation on our part?
>   Is the currently 'active' select for update limit 1 blocking the others?
>   If not, can someone give us some direction..??

The problem is that the order of events isn't quite what you expect.

You can think of the SELECT ... LIMIT 1 as identifying a set containing
one row. Since it's FOR UPDATE, the backend tries to get a lock, and if
necessary waits until it can. Of course, all the other processors are
doing the same thing.

When another process updates that row, the backend re-checks the
conditions and you end up with a set of zero rows, which it helpfully
gains a lock on and then returns to you!

So - what to do? Well, there are several options.

1. Keep the lock/update period short by just marking the row as
"claimed" and committing. You still need to loop trying to get a job to
process, but the process is smoother because the possible window for
conflicts is short.

2. Select "random" jobs from the list if you don't care about order.
Again, you're just making things smoother, not removing the requirement
to loop checking.

3. Have a "master" process that allocates jobs to processors. Or
simulate such a thing by having processors call an update_jobs()
function. That way you can pre-allocate jobs to processors (at perhaps a
cost in efficiency if the number of jobs in-queue falls below the number
of processors).

--
   Richard Huxton
   Archonet Ltd