Possible multiprocess lock/unlock-loop problem in Postgresql 9.2 - Mailing list pgsql-general

From Yngve N. Pettersen
Subject Possible multiprocess lock/unlock-loop problem in Postgresql 9.2
Date
Msg-id op.w85tisjc3dfyax@damia
Whole thread Raw
Responses Re: Possible multiprocess lock/unlock-loop problem in Postgresql 9.2
List pgsql-general
Hello all,

I am running a Postgresql 9.2 system (IIRC v9.2.4; I am about to upgrade
to 9.2.6) on a system with 32-cores, 256GB RAM, 64GB shared RAM for
postgresql. The applications I am running are written using Django
(currently v1.5)

For a while I have been observing what may be special kind of deadlock, or
perhaps more precisely a lock-unlock loop among a group of postgresql
processes performing similar, possibly overlapping queries.

I am wondering if this is a known issue? If so, are there any workarounds
aside from the ones I am currently using myself? I did not see something
like this listed in the lock section of the Bug Todo list.

I have so far observed this in two specific query scenarios, and the
problem seem to trigger at 10 or more processes performing similar queries
(with 8 processes I do not observe the problem). When this happens, 10 of
the processes (the rest will be in waiting state) will run at 100% CPU
(one full core) without apparently being able to complete the operation.
IIRC I have seen this go on for hours (even for queries that should
complete in fractions of a second) before I have killed the postgres
processes (they are so busy they do not detect that the connection has
been closed by killing the remote process).

In one of the cases, an UPDATE operation, the problem only appear to
happen during the first such updates, when all relevant entries are in the
same state, not later, so I have been able to avoid the problem by making
sure that initially there are no more than 8 active processes working on
the database. IIRC even just killing the processes and restarting the job
without any changes to process counts seem to avoid the problem, but that
does not avoid the initial startup problem. The queries are (supposed to)
start in a random staggered sequence several seconds apart just to avoid
too many operations at the same time.

The second case is a multi-table SELECT, and again running 10 or more
processes seem to trigger the problem, and I have had to reduce the number
of parallel queries to 8 to avoid the issues. In this case it might be
possible to randomize the sequence of queries, but that still could
accidentally trigger a similar 10+ parallel query case, and all of the
queries will always access at least one of the tables. I am investigating
other alternatives for this case.

At present I have not had time to create a testcase for this, but I can
give some details of the UPDATE scenario.

The relevant fields in the table (a job queue list) are the unique
"id"-field and a "state" field with 3 values (Idle, Started, Finished).

The operation have two steps, first a retrieval of a group id's for
currently idle entries, then an update of the records with those IDs to
state started, assuming they are still idle (to avoid multiple allocations
of the same job), returning a list of updated item ids.

The update query looks like this:

   UPDATE queue SET state = E'S' WHERE state = E'I' AND id IN (<list of
integers>) RETURNING id;

There is a BEGIN/COMMIT wrap around the operation, including the SELECT
query.

My guess is that the active processes get into a lock/unlock loop
regarding the "state" field because the list of ids overlap, and for some
reason, instead of completing the operation according to some kind of
priority order.

I have no real guess about what is causing the SELECT operations to loop
in that fashion, since there is no write operations that could cause the
kind of apparent loop I see in the UPDATE operation. A wild guess could be
that the SELECT processes are passing around the operation for creating a
common table join for linking records from two (or more tables), or just
retrieval of records from a single table, since the operation overlap
tables and sub selections from those tables. If this guess have some
bearing on what is going on, it may also be something similar that is
going on in the UPDATE case, e.g. an attempt to collect all the affected
records, and the job just gets passed on to the next process due to some
logic.

Any ideas about what is going on?

--
Using Opera's mail client: http://www.opera.com/mail/


pgsql-general by date:

Previous
From: David Johnston
Date:
Subject: Re: returning json data row from json query
Next
From: Adrian Klaver
Date:
Subject: Re: authentication failure