Re: Possible multiprocess lock/unlock-loop problem in Postgresql 9.2 - Mailing list pgsql-general
From | Yngve N. Pettersen |
---|---|
Subject | Re: Possible multiprocess lock/unlock-loop problem in Postgresql 9.2 |
Date | |
Msg-id | op.xayw5kw53dfyax@killashandra.invalid.invalid Whole thread Raw |
In response to | Re: Possible multiprocess lock/unlock-loop problem in Postgresql 9.2 ("Yngve N. Pettersen" <yngve@spec-work.net>) |
Responses |
Re: Possible multiprocess lock/unlock-loop problem in
Postgresql 9.2
|
List | pgsql-general |
Hi again, Sorry about the delay, but an unexpected 3 week trip combined with not being able to fix the system's router whose configuration had become corrupted before I left, meant that I could not perform any testing until this week, after the router had been fixed and reinstalled. I just did a test using expanded logging of queries and duration, and actually saw this happen with just 4 (four) processes, not the 10 I have seen before. The relevant parts of the log, as well as an EXPLAIN and table info dump are attached in a zipfile; the large parts consisting of 6000 to-be-updated IDs in the UPDATE commands have been removed for clarity (the sequences were not overlapping, with a numerical distance of at least 80000 from the other updates, and each sequence was within a range of 7500 IDs) Background: Immediately before this operation the queue had been set up with the command INSERT INTO probedata2_probequeue (part_of_run_id, server_id, state) SELECT '334' AS part_of_run_id, server_id, E'I' AS state FROM probedata2_preparedqueueitem WHERE part_of_queue_id = '2' followed by a COMMIT, a process taking ~15 minutes to copy a 23 million entry table into the queue. After this the total number of entries in the target table is 70 million, distributed across 3 runs. Some SELECT count and ANALYZE operations have probably also been performed after the initialization operation, before processes Shortly afterward (within a couple of minutes), four processes initiated the SELECT and UPDATE sequence I outlined earlier, each operation was performed between 1 and 10 seconds after one of the others, 20 seconds from start to last command started. SELECT id FROM probedata2_probequeue WHERE state = E'I' AND part_of_run_id = 334 LIMIT 6000 OFFSET 85103 UPDATE probedata2_probequeue SET "state"=E'S' WHERE "state" = E'I' AND id IN ( .....) RETURNING id The SELECT operations took 21-122ms to complete, while the UPDATEs took 1093000-1120000 ms (~18 minutes, to complete). During this time the processes were running at 100% CPU. With more processes involved earlier I recall seeing even longer execution times for the UPDATEs, before I killed the processes (this is the first time I have seen this kind of situation be resolved without killing the processes). For reference, once these 4 commands had completed (the system is set up to wait until it sees task completed messages from the first processes that were started, before starting new ones), the next sequence of these commands took 122ms and 107ms, respectively, and the second took 50ms and 108ms. Any suggestions for where to investigate further? I am considering setting up a small example to see if I can reproduce, but have not had time to do so yet. On Sat, 04 Jan 2014 20:06:19 +0100, Yngve N. Pettersen <yngve@spec-work.net> wrote: > On Sat, 04 Jan 2014 19:40:31 +0100, Andrew Sullivan > <ajs@crankycanuck.ca> wrote: > >> On Sat, Jan 04, 2014 at 07:07:08PM +0100, Yngve N. Pettersen wrote: >>> I tried that before, but ran into some issues, IIRC a similar looping >>> problem as this where queries never ended. I split it up in an attempt >>> to >>> solve that problem. >> >> Pulling the data out into the application and sending it back in won't >> improve things. Exactly the same number of rows need to be visited, >> but the way you have it now you have to marshall all the data and ship >> it to the application too. So it's automatically slower. Indeed, >> making it slower might have masked your problem. > > Could be > >> >>> In the select/update case there is no sorting in the query; there is an >>> offset/limit clause though, number of records retrieved are currently >>> restricted to <10000 per query (out of 20 million in the active >>> subset). >>> >>> SELECT id from queue where state = E'I' and job_id = <integer> >>> offset >>> <random 200..150000> limit <1-6000> >> >> This could be part of problem. Are the different threads working on >> different job_ids, or is that the same job_id? If you don't SORT that > > Same job_id, at least in the current system. > >> query before the OFFSET, then the rows will come back in whatever >> order the system likes. > > I suspect that a sort operation on (currently) 20+ million rows for > every query for just 6000 (previous version was 1500 entries) would > cause quite a bit more slowness than breaking up the query in two > operations, or the risk of collisions would, because each process would > have to load all that information (even if it is cached). > >>> However, in the UPDATE case, the looping processes are all UPDATE >>> queries, >>> no SELECTs involved. >> >> But you said it's all in the same transaction scope. The lock is a >> the transaction scope. > > But the statement it locks/loops on are only UPDATE statements, also in > the processes that are waiting. > >> Anyway, what I'd do is try to cause the condition and post the >> pg_locks information. When I've done this in the past, usually the >> best thing to do is also to have query logs on for everything (don't >> forget to log the pid!) so you can see what the other transaction >> you're waiting on touched already. You can usually find the inversion >> that way. Once you see it, it's always obvious what you've done, in >> my experience (and completely mystifying before that, unfortunately). > > Will take a look in a few days, probably midweek. > > -- Sincerely, Yngve N. Pettersen Using Opera's mail client: http://www.opera.com/mail/
Attachment
pgsql-general by date: