Possible multiprocess lock/unlock-loop problem in Postgresql 9.2 - Mailing list pgsql-general
From | Yngve N. Pettersen |
---|---|
Subject | Possible multiprocess lock/unlock-loop problem in Postgresql 9.2 |
Date | |
Msg-id | op.w85tisjc3dfyax@damia Whole thread Raw |
Responses |
Re: Possible multiprocess lock/unlock-loop problem in
Postgresql 9.2
|
List | pgsql-general |
Hello all, I am running a Postgresql 9.2 system (IIRC v9.2.4; I am about to upgrade to 9.2.6) on a system with 32-cores, 256GB RAM, 64GB shared RAM for postgresql. The applications I am running are written using Django (currently v1.5) For a while I have been observing what may be special kind of deadlock, or perhaps more precisely a lock-unlock loop among a group of postgresql processes performing similar, possibly overlapping queries. I am wondering if this is a known issue? If so, are there any workarounds aside from the ones I am currently using myself? I did not see something like this listed in the lock section of the Bug Todo list. I have so far observed this in two specific query scenarios, and the problem seem to trigger at 10 or more processes performing similar queries (with 8 processes I do not observe the problem). When this happens, 10 of the processes (the rest will be in waiting state) will run at 100% CPU (one full core) without apparently being able to complete the operation. IIRC I have seen this go on for hours (even for queries that should complete in fractions of a second) before I have killed the postgres processes (they are so busy they do not detect that the connection has been closed by killing the remote process). In one of the cases, an UPDATE operation, the problem only appear to happen during the first such updates, when all relevant entries are in the same state, not later, so I have been able to avoid the problem by making sure that initially there are no more than 8 active processes working on the database. IIRC even just killing the processes and restarting the job without any changes to process counts seem to avoid the problem, but that does not avoid the initial startup problem. The queries are (supposed to) start in a random staggered sequence several seconds apart just to avoid too many operations at the same time. The second case is a multi-table SELECT, and again running 10 or more processes seem to trigger the problem, and I have had to reduce the number of parallel queries to 8 to avoid the issues. In this case it might be possible to randomize the sequence of queries, but that still could accidentally trigger a similar 10+ parallel query case, and all of the queries will always access at least one of the tables. I am investigating other alternatives for this case. At present I have not had time to create a testcase for this, but I can give some details of the UPDATE scenario. The relevant fields in the table (a job queue list) are the unique "id"-field and a "state" field with 3 values (Idle, Started, Finished). The operation have two steps, first a retrieval of a group id's for currently idle entries, then an update of the records with those IDs to state started, assuming they are still idle (to avoid multiple allocations of the same job), returning a list of updated item ids. The update query looks like this: UPDATE queue SET state = E'S' WHERE state = E'I' AND id IN (<list of integers>) RETURNING id; There is a BEGIN/COMMIT wrap around the operation, including the SELECT query. My guess is that the active processes get into a lock/unlock loop regarding the "state" field because the list of ids overlap, and for some reason, instead of completing the operation according to some kind of priority order. I have no real guess about what is causing the SELECT operations to loop in that fashion, since there is no write operations that could cause the kind of apparent loop I see in the UPDATE operation. A wild guess could be that the SELECT processes are passing around the operation for creating a common table join for linking records from two (or more tables), or just retrieval of records from a single table, since the operation overlap tables and sub selections from those tables. If this guess have some bearing on what is going on, it may also be something similar that is going on in the UPDATE case, e.g. an attempt to collect all the affected records, and the job just gets passed on to the next process due to some logic. Any ideas about what is going on? -- Using Opera's mail client: http://www.opera.com/mail/
pgsql-general by date: