Thread: dataset lock
Hello, I use a PG database on a HPC system (cluster). My processes get a dataset from the database and change the row, each processis independend. My table shows something like: id, status, data id = PK a unqiue number status a enum value which "open", "waiting", "working", "done" So each process calls a SQL statement select * from where status = "waiting", so the process should get the next waitingtask, after the process gets the task, the status should be changed to "working", so no other process shouldn't get the task. My processes are independed,so it can be, that 2 (or more) processes call the select statement at the same time and get in this case equal tasks, so I need somelocking. How can I do this with Postgres, that each row / task in my table is read / write by one process. On threads I would create a mutexeg: lock() row = select * from table where status = waiting update status = working from table where id = row.id unlock() do something with row Which is the best solution with postgres? should I create a procedure which takes the next job, change it and returns theid, so each process calls "select getNextJob()" ? Thanks Phil
On Apr 16, 2013, at 7:50 AM, Philipp Kraus <philipp.kraus@flashpixx.de> wrote: > Hello, > > I use a PG database on a HPC system (cluster). My processes get a dataset from the database and change the row, each processis independend. > My table shows something like: id, status, data > > id = PK a unqiue number > status a enum value which "open", "waiting", "working", "done" > > So each process calls a SQL statement select * from where status = "waiting", so the process should get the next waitingtask, after the process > gets the task, the status should be changed to "working", so no other process shouldn't get the task. My processes areindepended, so it can > be, that 2 (or more) processes call the select statement at the same time and get in this case equal tasks, so I need somelocking. How can > I do this with Postgres, that each row / task in my table is read / write by one process. On threads I would create a mutexeg: > > lock() > row = select * from table where status = waiting > update status = working from table where id = row.id > unlock() > > do something with row > > Which is the best solution with postgres? should I create a procedure which takes the next job, change it and returns theid, so each process > calls "select getNextJob()" ? "select for update" might be the answer to what you're asking for - it'll lock the rows matched until the end of the transaction,blocking any other select for update on the same rows. If performance is important then you might want to look at some of the off the shelf queuing systems instead - PgQ or queue_classic,for instance. Cheers, Steve
On 2013-04-16 19:11:20 +0200, Steve Atkins said: > > On Apr 16, 2013, at 7:50 AM, Philipp Kraus <philipp.kraus@flashpixx.de> wrote: > >> Hello, >> >> I use a PG database on a HPC system (cluster). My processes get a >> dataset from the database and change the row, each process is >> independend. >> My table shows something like: id, status, data >> >> id = PK a unqiue number >> status a enum value which "open", "waiting", "working", "done" >> >> So each process calls a SQL statement select * from where status = >> "waiting", so the process should get the next waiting task, after the >> process >> gets the task, the status should be changed to "working", so no other >> process shouldn't get the task. My processes are independed, so it can >> be, that 2 (or more) processes call the select statement at the same >> time and get in this case equal tasks, so I need some locking. How can >> I do this with Postgres, that each row / task in my table is read / >> write by one process. On threads I would create a mutex eg: >> >> lock() >> row = select * from table where status = waiting >> update status = working from table where id = row.id >> unlock() >> >> do something with row >> >> Which is the best solution with postgres? should I create a procedure >> which takes the next job, change it and returns the id, so each process >> calls "select getNextJob()" ? > > "select for update" might be the answer to what you're asking for - > it'll lock the rows matched until the end of the transaction, blocking > any other select for update on the same rows. Okay my explaination are a little bit bad, so I try it in another way: My PG database is connected to differend cluster nodes (MPI). Each programm / process on each node are independed and run the SQL select * from table where status = waiting after that I update the row with the update statement (set status = working) so in this case one process can run the select, than comes another process and runs also the select, but both processes get an equal row. But this does not allowed. The second process need not see the row, which is taken by the first process. So can I suppress, that a select call sees a row, which is locked by a transaction? So I would like to do this with a store procedure, that runs the select and the update and after that it returns the PK of the selected dataset. If two (or more) processes run the SP at the same time, but the update can create an error, so the stored procedure is stopped and must called again. I need a solution, that a row, which is taken by one process not shown by all other processes Thx Phil
Philipp Kraus wrote: > My PG database is connected to differend cluster nodes (MPI). Each > programm / process on each node are independed and run the SQL > select * from table where status = waiting > after that I update the row with the update statement (set status = working) > > so in this case one process can run the select, than comes another > process and runs also the select, but both processes get an equal row. > But this does not allowed. The second process need not see the row, > which is taken by the first process. So can I suppress, that a select > call > sees a row, which is locked by a transaction? So I would like to do > this with a store procedure, that runs the select and the update and > after that > it returns the PK of the selected dataset. If two (or more) processes > run the SP at the same time, but the update can create an error, so the > stored > procedure is stopped and must called again. > I need a solution, that a row, which is taken by one process not shown > by all other processes Do you want to implement something like a queue? I can think of two techniques: 1) Locking ---------- In a transaction, you get a few rows for processing by SELECT * FROM table WHERE status = waiting ORDER BY id LIMIT 5 FOR UPDATE; ("id" is the primary key here). Then you process and update the rows and commit. This will cause concurrent SELECT FOR UPDATE operations to block until the transaction is committed, effectively serializing the processing. 2) Set a marker --------------- You get a few rows by UPDATE table SET status = processing WHERE id IN (SELECT id FROM table WHERE status = waiting ORDER BY id LIMIT 5) RETURNING *; Then process and update the rows. This won't block concurrent processes for the whole time it takes to process the rows, so it's probably closer to what you want. Yours, Laurenz Albe
On 2013-04-17 09:18:13 +0200, Albe Laurenz said: > Philipp Kraus wrote: >> My PG database is connected to differend cluster nodes (MPI). Each >> programm / process on each node are independed and run the SQL >> select * from table where status = waiting >> after that I update the row with the update statement (set status = working) >> >> so in this case one process can run the select, than comes another >> process and runs also the select, but both processes get an equal row. >> But this does not allowed. The second process need not see the row, >> which is taken by the first process. So can I suppress, that a select >> call >> sees a row, which is locked by a transaction? So I would like to do >> this with a store procedure, that runs the select and the update and >> after that >> it returns the PK of the selected dataset. If two (or more) processes >> run the SP at the same time, but the update can create an error, so the >> stored >> procedure is stopped and must called again. >> I need a solution, that a row, which is taken by one process not shown >> by all other processes > > Do you want to implement something like a queue? Yes > > I can think of two techniques: > > 1) Locking > ---------- > In a transaction, you get a few rows for processing by > SELECT * FROM table WHERE status = waiting > ORDER BY id LIMIT 5 FOR UPDATE; > ("id" is the primary key here). > Then you process and update the rows and commit. > This will cause concurrent SELECT FOR UPDATE operations > to block until the transaction is committed, effectively > serializing the processing. > > 2) Set a marker > --------------- > You get a few rows by > UPDATE table SET status = processing WHERE id IN > (SELECT id FROM table WHERE status = waiting > ORDER BY id LIMIT 5) RETURNING *; > Then process and update the rows. > > This won't block concurrent processes for the whole > time it takes to process the rows, so it's probably > closer to what you want. The marker solution seems to be the correct idea, I must think about it. Would you create the call within a stored procedure (and call the SP from the client) or would you use the statement from a client direct? Thx Phil
Philipp Kraus wrote: >> Do you want to implement something like a queue? > > Yes >> You get a few rows by >> UPDATE table SET status = processing WHERE id IN >> (SELECT id FROM table WHERE status = waiting >> ORDER BY id LIMIT 5) RETURNING *; >> Then process and update the rows. >> >> This won't block concurrent processes for the whole >> time it takes to process the rows, so it's probably >> closer to what you want. > > The marker solution seems to be the correct idea, I must think about it. > Would you create the call within a stored procedure (and call the SP > from the client) or would you use the > statement from a client direct? I don't think it matters - use what is most constistent with your taste and architecture. Yours, Laurenz Albe