Re: Possible multiprocess lock/unlock-loop problem in Postgresql 9.2 - Mailing list pgsql-general
From | Yngve N. Pettersen |
---|---|
Subject | Re: Possible multiprocess lock/unlock-loop problem in Postgresql 9.2 |
Date | |
Msg-id | op.w86cl6ot3dfyax@damia Whole thread Raw |
In response to | Re: Possible multiprocess lock/unlock-loop problem in Postgresql 9.2 (Andrew Sullivan <ajs@crankycanuck.ca>) |
Responses |
Re: Possible multiprocess lock/unlock-loop problem in
Postgresql 9.2
|
List | pgsql-general |
Hi, On Sat, 04 Jan 2014 16:23:42 +0100, Andrew Sullivan <ajs@crankycanuck.ca> wrote: > On Sat, Jan 04, 2014 at 12:14:42PM +0100, Yngve N. Pettersen wrote: >> The update query looks like this: >> >> UPDATE queue SET state = E'S' WHERE state = E'I' AND id IN (<list >> of integers>) RETURNING id; >> >> There is a BEGIN/COMMIT wrap around the operation, including the >> SELECT query. > > Am I right that you're doing the SELECT, pulling that up into the > application, and then that becomes your <list of integers>? If so, my > first advice is, "Don't do that," because you're making the > transaction run longer (which is probably part of the problem). > Instead, you probably want something like > > UPDATE . . . AND id IN (SELECT . . .LIMIT. . .) RETURNING id; > > This won't solve your problem perfectly, though. I tried that before, but ran into some issues, IIRC a similar looping problem as this where queries never ended. I split it up in an attempt to solve that problem. >> My guess is that the active processes get into a lock/unlock loop >> regarding the "state" field because the list of ids overlap, and for >> some reason, instead of completing the operation according to some >> kind of priority order. > > My guess is that you have a "lock inversion", yes, but it's hard to > guess what. You want to look at the pg_locks view to figure what's > blocking what. It seems likely that your case is not strictly a > deadlock. A deadlock is a case where transaction A has a lock on > something that transaction B needs, and needs to wait for a lock on an > object that is locked by transaction B. Neither one could possibly > complete, and you get a deadlock detection. That is why I called it a (possible) multiprocess lock/unlock loop, not a deadlock. It looks to me like one process starts looking at a task, then realizes that this can be or are being handled by another process, and passes the baton, and that this happens so quickly that the processes can keep 100% CPU > There's an additional possibility that is suggested by your > description, and that is that it's not locks at all, but that you're > running into some kind of system issue. Did you adjust the setting of > sort_mem? It's a long shot, but it could be that if there are enough > sorts in the SELECT (which you don't show us), you're chewing through In the select/update case there is no sorting in the query; there is an offset/limit clause though, number of records retrieved are currently restricted to <10000 per query (out of 20 million in the active subset). SELECT id from queue where state = E'I' and job_id = <integer> offset <random 200..150000> limit <1-6000> The offset is to avoid having the processes thread on each others toes and avoid selecting the same records, in case the order is predictable There are indexes on on several fields including conditionals on some state values in combination with job_id As I recall, there is no sorting in the SELECT case either, although there is some filtering on some of the tables to restrict what is being returned. > a lot of sort_mem. Remember that sort_mem is allocated _per sort_, so > it could be that a lot of these allocations fit fine in real memory if > only 8 processes are doing it; but if 10 do, you pass your threshold > for physical memory and start swapping. I wouldn't expect high CPU > under that case, though, but high I/O. So I think it's doubtful. However, in the UPDATE case, the looping processes are all UPDATE queries, no SELECTs involved. There is no swap on the system (I don't want swapping, precisely because it then slows down the system), but there is lots of RAM, with 180+GB available. I don't have the config file handy at the moment, but as I recall, while I did adjust the sort_mem variable it was not high enough (IIRC <=512MB) to consume that much memory in 10 processes even with a lot of sorts. I have not seen any peaks in application memory during the loop events as I recall. Another thing that IMO count against this is that for the UPDATE case, the system seems to be quite able to handle 50-60 such queries at a time (inside 10-30 seconds), except the first time the queue is accessed after the job list has been initialized (and after ANALYZE have been run on the table, and the table is analyzed 6+ times an hour). -- Using Opera's mail client: http://www.opera.com/mail/
pgsql-general by date: