Thread: dataset lock

dataset lock

From

Philipp Kraus

Date:

16 April 2013, 14:50:35

Hello,

I use a PG database on a HPC system (cluster). My processes get a dataset from the database and change the row, each
processis independend. 
My table shows something like: id, status, data

id = PK a unqiue number
status a enum value which "open", "waiting", "working", "done"

So each process calls a SQL statement select * from where status = "waiting", so the process should get the next
waitingtask, after the process 
gets the task, the status should be changed to "working", so no other process shouldn't get the task. My processes are
independed,so it can 
be, that 2 (or more) processes call the select statement at the same time and get in this case equal tasks, so I need
somelocking. How can 
I do this with Postgres, that each row / task in my table is read / write by one process. On threads I would create a
mutexeg: 

lock()
row = select * from table where status = waiting
update status = working from table where id = row.id
unlock()

do something with row

Which is the best solution with postgres? should I create a procedure which takes the next job, change it and returns
theid, so each process 
calls "select getNextJob()" ?

Thanks

Phil

Re: dataset lock

From

Steve Atkins

Date:

16 April 2013, 17:11:31

On Apr 16, 2013, at 7:50 AM, Philipp Kraus <philipp.kraus@flashpixx.de> wrote:

> Hello,
>
> I use a PG database on a HPC system (cluster). My processes get a dataset from the database and change the row, each
processis independend. 
> My table shows something like: id, status, data
>
> id = PK a unqiue number
> status a enum value which "open", "waiting", "working", "done"
>
> So each process calls a SQL statement select * from where status = "waiting", so the process should get the next
waitingtask, after the process 
> gets the task, the status should be changed to "working", so no other process shouldn't get the task. My processes
areindepended, so it can 
> be, that 2 (or more) processes call the select statement at the same time and get in this case equal tasks, so I need
somelocking. How can 
> I do this with Postgres, that each row / task in my table is read / write by one process. On threads I would create a
mutexeg: 
>
> lock()
> row = select * from table where status = waiting
> update status = working from table where id = row.id
> unlock()
>
> do something with row
>
> Which is the best solution with postgres? should I create a procedure which takes the next job, change it and returns
theid, so each process 
> calls "select getNextJob()" ?

"select for update" might be the answer to what you're asking for - it'll lock the rows matched until the end of the
transaction,blocking any other select for update on the same rows. 

If performance is important then you might want to look at some of the off the shelf queuing systems instead - PgQ or
queue_classic,for instance. 

Cheers,
  Steve

Re: dataset lock

From

Philipp Kraus

Date:

16 April 2013, 18:16:16

On 2013-04-16 19:11:20 +0200, Steve Atkins said:

>
> On Apr 16, 2013, at 7:50 AM, Philipp Kraus <philipp.kraus@flashpixx.de> wrote:
>
>> Hello,
>>
>> I use a PG database on a HPC system (cluster). My processes get a
>> dataset from the database and change the row, each process is
>> independend.
>> My table shows something like: id, status, data
>>
>> id = PK a unqiue number
>> status a enum value which "open", "waiting", "working", "done"
>>
>> So each process calls a SQL statement select * from where status =
>> "waiting", so the process should get the next waiting task, after the
>> process
>> gets the task, the status should be changed to "working", so no other
>> process shouldn't get the task. My processes are independed, so it can
>> be, that 2 (or more) processes call the select statement at the same
>> time and get in this case equal tasks, so I need some locking. How can
>> I do this with Postgres, that each row / task in my table is read /
>> write by one process. On threads I would create a mutex eg:
>>
>> lock()
>> row = select * from table where status = waiting
>> update status = working from table where id = row.id
>> unlock()
>>
>> do something with row
>>
>> Which is the best solution with postgres? should I create a procedure
>> which takes the next job, change it and returns the id, so each process
>> calls "select getNextJob()" ?
>
> "select for update" might be the answer to what you're asking for -
> it'll lock the rows matched until the end of the transaction, blocking
> any other select for update on the same rows.

Okay my explaination are a little bit bad, so I try it in another way:

My PG database is connected to differend cluster nodes (MPI). Each
programm / process on each node are independed and run the SQL
select * from table where status = waiting
after that I update the row with the update statement (set status = working)

so in this case one process can run the select, than comes another
process and runs also the select, but both processes get an equal row.
But this does not allowed. The second process need not see the row,
which is taken by the first process. So can I suppress, that a select
call
sees a row, which is locked by a transaction? So I would like to do
this with a store procedure, that runs the select and the update and
after that
it returns the PK of the selected dataset. If two (or more) processes
run the SP at the same time, but the update can create an error, so the
stored
procedure is stopped and must called again.
I need a solution, that a row, which is taken by one process not shown
by all other processes

Thx

Phil

Re: dataset lock

From

Albe Laurenz

Date:

17 April 2013, 07:18:25

Philipp Kraus wrote:
> My PG database is connected to differend cluster nodes (MPI). Each
> programm / process on each node are independed and run the SQL
> select * from table where status = waiting
> after that I update the row with the update statement (set status = working)
>
> so in this case one process can run the select, than comes another
> process and runs also the select, but both processes get an equal row.
> But this does not allowed. The second process need not see the row,
> which is taken by the first process. So can I suppress, that a select
> call
> sees a row, which is locked by a transaction? So I would like to do
> this with a store procedure, that runs the select and the update and
> after that
> it returns the PK of the selected dataset. If two (or more) processes
> run the SP at the same time, but the update can create an error, so the
> stored
> procedure is stopped and must called again.
> I need a solution, that a row, which is taken by one process not shown
> by all other processes

Do you want to implement something like a queue?

I can think of two techniques:

1) Locking
----------
In a transaction, you get a few rows for processing by
SELECT * FROM table WHERE status = waiting
   ORDER BY id LIMIT 5 FOR UPDATE;
("id" is the primary key here).
Then you process and update the rows and commit.

This will cause concurrent SELECT FOR UPDATE operations
to block until the transaction is committed, effectively
serializing the processing.

2) Set a marker
---------------
You get a few rows by
UPDATE table SET status = processing WHERE id IN
   (SELECT id FROM table WHERE status = waiting
       ORDER BY id LIMIT 5) RETURNING *;
Then process and update the rows.

This won't block concurrent processes for the whole
time it takes to process the rows, so it's probably
closer to what you want.

Yours,
Laurenz Albe

Re: dataset lock

From

Philipp Kraus

Date:

17 April 2013, 14:44:39

On 2013-04-17 09:18:13 +0200, Albe Laurenz said:

> Philipp Kraus wrote:
>> My PG database is connected to differend cluster nodes (MPI). Each
>> programm / process on each node are independed and run the SQL
>> select * from table where status = waiting
>> after that I update the row with the update statement (set status = working)
>>
>> so in this case one process can run the select, than comes another
>> process and runs also the select, but both processes get an equal row.
>> But this does not allowed. The second process need not see the row,
>> which is taken by the first process. So can I suppress, that a select
>> call
>> sees a row, which is locked by a transaction? So I would like to do
>> this with a store procedure, that runs the select and the update and
>> after that
>> it returns the PK of the selected dataset. If two (or more) processes
>> run the SP at the same time, but the update can create an error, so the
>> stored
>> procedure is stopped and must called again.
>> I need a solution, that a row, which is taken by one process not shown
>> by all other processes
>
> Do you want to implement something like a queue?

Yes

>
> I can think of two techniques:
>
> 1) Locking
> ----------
> In a transaction, you get a few rows for processing by
> SELECT * FROM table WHERE status = waiting
>    ORDER BY id LIMIT 5 FOR UPDATE;
> ("id" is the primary key here).
> Then you process and update the rows and commit.
>   This will cause concurrent SELECT FOR UPDATE operations
> to block until the transaction is committed, effectively
> serializing the processing.
>
> 2) Set a marker
> ---------------
> You get a few rows by
> UPDATE table SET status = processing WHERE id IN
>    (SELECT id FROM table WHERE status = waiting
>        ORDER BY id LIMIT 5) RETURNING *;
> Then process and update the rows.
>
> This won't block concurrent processes for the whole
> time it takes to process the rows, so it's probably
> closer to what you want.

The marker solution seems to be the correct idea, I must think about it.
Would you create the call within a stored procedure (and call the SP
from the client) or would you use the
statement from a client direct?

Thx

Phil

Re: dataset lock

From

Albe Laurenz

Date:

19 April 2013, 12:41:46

Philipp Kraus wrote:
>> Do you want to implement something like a queue?
>
> Yes

>> You get a few rows by
>> UPDATE table SET status = processing WHERE id IN
>>    (SELECT id FROM table WHERE status = waiting
>>        ORDER BY id LIMIT 5) RETURNING *;
>> Then process and update the rows.
>>
>> This won't block concurrent processes for the whole
>> time it takes to process the rows, so it's probably
>> closer to what you want.
>
> The marker solution seems to be the correct idea, I must think about it.
> Would you create the call within a stored procedure (and call the SP
> from the client) or would you use the
> statement from a client direct?

I don't think it matters - use what is most constistent with
your taste and architecture.

Yours,
Laurenz Albe