Lock/deadlock issues with priority queue in Postgres - possible VACUUM conflicts - Mailing list pgsql-general
From | Chris Angelico |
---|---|
Subject | Lock/deadlock issues with priority queue in Postgres - possible VACUUM conflicts |
Date | |
Msg-id | CAPTjJmre8jFOj2Lx3O01qe+-qqdGsY4iR5eOqzxN20-HCXYFJw@mail.gmail.com Whole thread Raw |
Responses |
Re: Lock/deadlock issues with priority queue in Postgres - possible
VACUUM conflicts
Re: Lock/deadlock issues with priority queue in Postgres - possible VACUUM conflicts |
List | pgsql-general |
This has got to be a solved problem, but I can't find a simple example of what to do, nor a description of what I'm doing wrong. In this project, we need to have a disk-based priority queue, with a few snazzy features (eg limited retries and scheduling) but nothing particularly exotic. Requests are added to the queue by any number of different processes, and it's not a problem if a request "misses the check" and is delayed by a few seconds. Requests are processed by a pool of workers which are unaware of each other (and may be running on different computers); the sole arbiter is the database itself. Everything's currently on a single Debian 6.0.3 laptop, with Postgres (the .deb package from openscg.org) set to max_connections = 5100, and the kernel reconfigured to allow 1GB shared memory and "500 32000 32 1024" semaphores. 'select version()' gives: PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit The current (PHP) implementation involves a 'claimed' column (NULL for unclaimed, otherwise is an integer worker id - this allows for the detection of crashed workers) and the following query: "update interfaceout set claimed=$workerid,lastretry=now(),retrycount=retrycount+1 where claimed is null and id=(select id from interfaceout where claimed is null and not done and scheduletime<=now() order by priority,random() limit 1) returning *" This query is performed (in autocommit mode), then some (currently minimal) processing is done, and then another query is done to "close off" the call: "update interfaceout set claimed=null,done=$1,response=$2,success=$3 where id=$4" with parameters set according to whether or not the call was completed (if not it gets retried), and the results of the processing. Another process is then signalled to handle the response, and that process will then delete the row. With one single worker process, this works fine; but with 16K calls and 50 workers (not ridiculously huge numbers by any means), lock contention becomes a major issue, and there are even occasional deadlocks. Performance plummets. I'm not entirely sure, but it seems that problems start happening when autovacuum kicks in. Monitoring the pg_locks table (joining with pg_stat_activity to see current query in progress) shows that the time is all being spent on the first query, attempting to set claimed; the other queries almost never show up in the list. Researching the issue brought up a couple of good links: http://stackoverflow.com/questions/389541/select-unlocked-row-in-postgresql http://blog.endpoint.com/2008/12/parallel-inventory-access-using.html http://wiki.postgresql.org/wiki/PGQ_Tutorial PGQ looks promising, but I can't afford the risk of losing calls in the event that there are no workers to process them (the correct action is for them simply to languish in the database until one is started up). The StackOverflow thread lists a number of excellent ideas, most of which are already implemented in the above code, or are unnecessary (for instance, each worker needs only acquire one row - no need to grab the top N). There's some interesting stuff on advisory locks; is this the only way to do this safely? This is surely something that has been done before. Can anyone point me to a good tutorial on the subject? Thanks! Chris Angelico
pgsql-general by date: