Thread: Shared row locking

Shared row locking

From

Alvaro Herrera

Date:

17 December 2004, 01:26:18

Hi,

I've been thinking on how to do shared row locking.  There are some very
preliminar ideas on this issue.  Please comment; particularly if any
part of it sounds unworkable or too incomplete.

There are several problems to be solved here: the grammar, the internal
SelectStmt representation, how to store and share the info between
backends, how to clean up at transaction end, and how to clean up at
backend crash.

The Grammar
===========

The SQL spec does not say anything on this respect (that I can find).
It only talks of "FOR UPDATE" and "FOR READ ONLY".  However, because the
FK code uses SPI to do the locking, we definitely have to expose the
funcionality through SQL.  So I think we need a new clause, which I
propose to be "FOR SHARE".


The Parser and SelectStmt
=========================

The parser uses for_update_clause and opt_for_update_clause
nonterminals.  I assume it's best to change them to (new)
locking_clause, which can in turn be for_update_clause or (new)
for_share_clause.

SelectStmt currently has a forUpdate field (a List to the to-be-locked
tables, or an empty list meaning all of them).  We could simply add
another list, say forShare, or use a common list and a flag saying that
it's one or the other.  I prefer adding a new list.  (Same with the
Query node.)


How to Store the Info
=====================

This is the really interesting part.  I have two ideas, one mine (btree)
and other Bruce's (bitmap).

Using a B-tree
--------------
When a backend wants to lock a tuple, it set a bit in its infomask.
Then it inserts to a btree in a special tablespace, using
RelationId-BlockNumber-OffsetNumber as key, and BackendId-TransactionId
as value; actually, an array with a single element containing those two
values.

When a backend wants to lock a tuple that is already locked, it goes to
the btree and inserts itself into the array.  To do this, it inserts a
new index item (the enlarged array) and delete the previous one.  No
other backend may want to insert simultaneously (thus causing an ugly
race condition), because we hold an exclusive lock on the tuple's heap
page's buffer.

At transaction end, nothing special happens (tuples are not unlocked
explicitly).

When someone wants to know if the tuple is locked (to mark it FOR
UPDATE, or to delete it), it checks the infomask.  If it says it's
locked, it goes to check the btree.  If the array contains only
BackendId-TransactionId pairs that are no longer running, then the 
tuple is not locked and can be deleted/marked (and the btree can be
cleaned up).  Else, it will have to wait, using XactLockTableWait, for
the first transaction in the array that is still running.  We can be
sure that no one will try to share-lock the tuple while we check the
btree because we hold an exclusive lock on the tuple's heap page's
buffer.

Note that to check whether a transaction is running we need to lock
SInvalLock.  To minimize the time we hold it, we save the BackendId so
we don't have to scan the whole shmInvalBuffer->procState array, only
the item that we need to look at.  Another possibility would be to use
stock TransactionIdIsInProgress and save the extra 4 bytes of storage.

At server restart, the btree is created empty (or just deleted).  There
is one btree per database.


Using a Bitmap
--------------
First we create a counter called shared lock row counter.  Then we
create a file like pg_clog, and each counter slot has a bit for every
backend.  When we want to shared lock a row we increment the counter and
put that counter value on the row, and set our backend bit in the new
file.  We also store that counter value in our backend local memory.  On
commit we go through that local memory list and reset all our bits for
those counters.  When a row has all zeros, it can be recycled like we do
with pg_clog.


Problems and random comments
============================

There is possibility of starvation, if somebody wants to lock
exclusively a tuple and shared lockers are coming all the time.  Not
sure how to solve this.

The wakeup mechanism is not discussed ... is there a special need for
something beyond what we can do with XactLockTable{Insert,Wait} ?

Thanks to tablespaces, it's very easy to create special Relations that
can be dealt with by standard buffer and storage manager, etc.

The btree idea:
- does not need crash recovery.  Maybe we could use a stripped down version of nbtree.  This could cause a
maintanibilitynightmare.
 

- can't hold more than 300 or so simultaneous lockers (because of value length, limited to 1/3 of a page).  I doubt
thisis a real problem.
 

- could have problems (excessive storage requirements) in the long run because of empty or almost-empty pages.

The bitmap idea:
- seems to have lower overhead

- can use the same lazy cleanup mechanism exposed for the btree idea (in which case we don't need the list in local
memory).

- What can happen in presence of large max_connections settings?  Is this a real problem?

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
Oh, oh, las chicas galacianas, lo harán por las perlas,
¡Y las de Arrakis por el agua! Pero si buscas damas
Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck)

Re: Shared row locking

From

Bruce Momjian

Date:

17 December 2004, 01:58:24

Alvaro Herrera wrote:
> The btree idea:
> - does not need crash recovery.  Maybe we could use a stripped down
>   version of nbtree.  This could cause a maintanibility nightmare.

Are you saying the btree is an index with no heap?  If so, what about
the xid's?  Are they just in the btree?

How does the btree get cleaned up over time?

> The bitmap idea:
> - seems to have lower overhead
> 
> - can use the same lazy cleanup mechanism exposed for the btree idea (in
>   which case we don't need the list in local memory).

You mean all empty/zero rows can be removed?  Can we guarantee that on
commit we can clean up the bitmap?  If not the idea doesn't work.

> - What can happen in presence of large max_connections settings?  Is
>   this a real problem?

I thought about that.  50 backends is 7 bytes, 1000 backends is 128
bytes.  For a large number of backends you could just allow X concurrent
locks and use space X*4 bytes.

I think the basic issue is that the btree can be of variable length
while the bitmap has to be of a fixed length.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Shared row locking

From

Christopher Kings-Lynne

Date:

17 December 2004, 02:09:11

> The SQL spec does not say anything on this respect (that I can find).
> It only talks of "FOR UPDATE" and "FOR READ ONLY".  However, because the
> FK code uses SPI to do the locking, we definitely have to expose the
> funcionality through SQL.  So I think we need a new clause, which I
> propose to be "FOR SHARE".

MySQL uses LOCK IN SHARE MODE:

http://dev.mysql.com/doc/mysql/en/InnoDB_locking_reads.html

Re: Shared row locking

From

Tom Lane

Date:

17 December 2004, 02:52:51

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> Using a B-tree

> At transaction end, nothing special happens (tuples are not unlocked
> explicitly).

I don't think that works, because there is no guarantee that an entry
will get cleaned out before the XID counter wraps around.  Worst case,
you might think that a tuple is locked when the XID is left over from
the previous cycle.  (Possibly this could be avoided by cleaning out old
XIDs in this table whenever we truncate pg_clog, but that seems a tad
messy.)  I'm also a bit concerned about how we avoid table bloat if
there's no proactive cleanup at transaction end.

I think I like the pg_clog-modeled structure a bit better.  However it
could be objected that that puts a hard limit of 4G share-locked tuples
at any one time.

In the clog-modeled idea, it wasn't real clear how you decide whether to
assign a new counter value to a previously locked row, or reuse its
previous counter.  You must *not* assign a new value when the existing
entry still has bits set, but you probably do want to be aggressive
about assigning new values when you can; else it gets tough to be sure
that the log can be truncated in a reasonable time.

ISTM that your description is conflating several orthogonal issues:
how do we identify entries in this data structure (by CTID, or a shared
counter that increments each time a new lock is acquired); how do we
index the data structure (btree or linear array); and what is stored in
each entry (array of XIDs, or bitmap indexed by BackendId).  Not all of
the eight combinations work, but we do have more alternatives than the
two offered, even without coming up with any new ideas ;-)

> Note that to check whether a transaction is running we need to lock
> SInvalLock.  To minimize the time we hold it, we save the BackendId so
> we don't have to scan the whole shmInvalBuffer->procState array, only
> the item that we need to look at.  Another possibility would be to use
> stock TransactionIdIsInProgress and save the extra 4 bytes of storage.

I'm a bit worried about deadlocks and race conditions associated with
the conflict between locking a page of this data structure and locking
SInvalLock.

> At server restart, the btree is created empty (or just deleted).  There
> is one btree per database.

One per cluster you meant, right?  (Else we can't do locking of rows in
shared tables.)
        regards, tom lane

Re: Shared row locking

From

Tom Lane

Date:

17 December 2004, 02:59:33

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> You mean all empty/zero rows can be removed?  Can we guarantee that on
> commit we can clean up the bitmap?  If not the idea doesn't work.

For whatever data structure we use, we may reset the structure to empty
during backend-crash recovery.  So your objection boils down to "what if
a backend exits normally but forgets to clean up its locks?"  Assuming
that doesn't happen isn't any worse than assuming a backend will clean
up its shared memory state on non-crash exit, so I don't think it's a
serious concern.

That brings another thought: really what this is all about is working
around the fact that the standard lock manager can only cope with a
finite number of coexisting locks, because it's working in a fixed-size
shared memory arena.  Maybe we should instead think about ways to allow
the existing lock table to spill to disk when it gets too big.  That
would eliminate max_locks_per_transaction as a source of hard failures,
which would be a nice benefit.
        regards, tom lane

Re: Shared row locking

From

Bruce Momjian

Date:

17 December 2004, 03:37:46

Tom Lane wrote:
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> > Using a B-tree
> 
> > At transaction end, nothing special happens (tuples are not unlocked
> > explicitly).
> 
> I don't think that works, because there is no guarantee that an entry
> will get cleaned out before the XID counter wraps around.  Worst case,
> you might think that a tuple is locked when the XID is left over from
> the previous cycle.  (Possibly this could be avoided by cleaning out old
> XIDs in this table whenever we truncate pg_clog, but that seems a tad
> messy.)  I'm also a bit concerned about how we avoid table bloat if
> there's no proactive cleanup at transaction end.
> 
> I think I like the pg_clog-modeled structure a bit better.  However it
> could be objected that that puts a hard limit of 4G share-locked tuples
> at any one time.
> 
> In the clog-modeled idea, it wasn't real clear how you decide whether to
> assign a new counter value to a previously locked row, or reuse its
> previous counter.  You must *not* assign a new value when the existing
> entry still has bits set, but you probably do want to be aggressive
> about assigning new values when you can; else it gets tough to be sure
> that the log can be truncated in a reasonable time.

I assume you check and if all the bits are zero, you don't reuse it and
get a new counter.  In fact you shouldn't reuse it in case the log is
being truncated while you are looking.  :-)

> ISTM that your description is conflating several orthogonal issues:
> how do we identify entries in this data structure (by CTID, or a shared
> counter that increments each time a new lock is acquired); how do we
> index the data structure (btree or linear array); and what is stored in
> each entry (array of XIDs, or bitmap indexed by BackendId).  Not all of
> the eight combinations work, but we do have more alternatives than the
> two offered, even without coming up with any new ideas ;-)

True.  The only advantage to a bitmap vs. just a counter of locked
backends is that you can clean out your own backend bits from the table
without having to record them in your memory.  However, because
recording your own counters in local memory doesn't require fixed shared
memory we might be better just recording the shared lock indexes in your
local backend memory and just use an int4 counter in the pg_clog-like
file that we can decrement on backend commit.  However I am unclear that
we can guarantee an exiting backend will do that.  Certainly it is
cleared on server start.

> > Note that to check whether a transaction is running we need to lock
> > SInvalLock.  To minimize the time we hold it, we save the BackendId so
> > we don't have to scan the whole shmInvalBuffer->procState array, only
> > the item that we need to look at.  Another possibility would be to use
> > stock TransactionIdIsInProgress and save the extra 4 bytes of storage.
> 
> I'm a bit worried about deadlocks and race conditions associated with
> the conflict between locking a page of this data structure and locking
> SInvalLock.
> 
> > At server restart, the btree is created empty (or just deleted).  There
> > is one btree per database.
> 
> One per cluster you meant, right?  (Else we can't do locking of rows in
> shared tables.)

He meant one per database, I think.  I suppose we would need another one
for global tables or disallow shared locking of them.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Shared row locking

From

Bruce Momjian

Date:

19 December 2004, 04:05:18

BTom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > You mean all empty/zero rows can be removed?  Can we guarantee that on
> > commit we can clean up the bitmap?  If not the idea doesn't work.
> 
> For whatever data structure we use, we may reset the structure to empty
> during backend-crash recovery.  So your objection boils down to "what if
> a backend exits normally but forgets to clean up its locks?"  Assuming
> that doesn't happen isn't any worse than assuming a backend will clean
> up its shared memory state on non-crash exit, so I don't think it's a
> serious concern.
> 
> That brings another thought: really what this is all about is working
> around the fact that the standard lock manager can only cope with a
> finite number of coexisting locks, because it's working in a fixed-size
> shared memory arena.  Maybe we should instead think about ways to allow
> the existing lock table to spill to disk when it gets too big.  That
> would eliminate max_locks_per_transaction as a source of hard failures,
> which would be a nice benefit.

Agreed. Once concern I have about allowing the lock table to spill to
disk is that a large number of FOR UPDATE locks could push out lock
entries used by other backends, causing very poor performance.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Shared row locking

From

Simon Riggs

Date:

19 December 2004, 09:56:19

On Sun, 2004-12-19 at 04:04, Bruce Momjian wrote:
> BTom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > You mean all empty/zero rows can be removed?  Can we guarantee that on
> > > commit we can clean up the bitmap?  If not the idea doesn't work.
> > 
> > For whatever data structure we use, we may reset the structure to empty
> > during backend-crash recovery.  So your objection boils down to "what if
> > a backend exits normally but forgets to clean up its locks?"  Assuming
> > that doesn't happen isn't any worse than assuming a backend will clean
> > up its shared memory state on non-crash exit, so I don't think it's a
> > serious concern.
> > 
> > That brings another thought: really what this is all about is working
> > around the fact that the standard lock manager can only cope with a
> > finite number of coexisting locks, because it's working in a fixed-size
> > shared memory arena.  Maybe we should instead think about ways to allow
> > the existing lock table to spill to disk when it gets too big.  That
> > would eliminate max_locks_per_transaction as a source of hard failures,
> > which would be a nice benefit.
> 
> Agreed. Once concern I have about allowing the lock table to spill to
> disk is that a large number of FOR UPDATE locks could push out lock
> entries used by other backends, causing very poor performance.

In similar circumstances, DB2 uses these techniques:

- when locktable X % full, then escalate locks to full table locks: both
locktable memory and threshold% are instance parameters

- use a lock mode called Cursor Stability that locks only those rows
currently being examined by a cursor, those maintaining the lock usage
of a cursor at a constant level as the cursor moves. The lock mode of
Repeatable Read *does* lock all rows read

(these are not actually mutually exclusive)

The first one is a real pain, but the idea might be of use somewhere.

The second is a usable, practical alternative that should be considered
and might avoid the need to write the spill-to-disk code and then
discover it performs very badly.

-- 
Best Regards, Simon Riggs

Re: Shared row locking

From

Alvaro Herrera

Date:

19 December 2004, 18:32:14

On Sun, Dec 19, 2004 at 09:52:01AM +0000, Simon Riggs wrote:

Simon,

> In similar circumstances, DB2 uses these techniques:
> 
> - when locktable X % full, then escalate locks to full table locks: both
> locktable memory and threshold% are instance parameters

This is not useful at all, because the objective of this exercise is to
downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
shared row locking.  Keep in mind that this is for foreign key locking,
which is one area where deadlocks are frequently encountered because we
use too strong a lock.

> - use a lock mode called Cursor Stability that locks only those rows
> currently being examined by a cursor, those maintaining the lock usage
> of a cursor at a constant level as the cursor moves. The lock mode of
> Repeatable Read *does* lock all rows read

We can't release the locks until the end of the transaction.

> The second is a usable, practical alternative that should be considered
> and might avoid the need to write the spill-to-disk code and then
> discover it performs very badly.

I don't think any of them serves the objective I'm trying to accomplish
:-(

Thanks for your ideas anyway.  And keep having them!

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"

Re: Shared row locking

From

Tom Lane

Date:

19 December 2004, 20:55:07

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> On Sun, Dec 19, 2004 at 09:52:01AM +0000, Simon Riggs wrote:
>> - use a lock mode called Cursor Stability that locks only those rows
>> currently being examined by a cursor, those maintaining the lock usage
>> of a cursor at a constant level as the cursor moves. The lock mode of
>> Repeatable Read *does* lock all rows read

> We can't release the locks until the end of the transaction.

Well, what Simon is essentially proposing is that we work around a
potential performance issue by exposing a less-clean semantic model
to users.  I'd prefer to adopt such a thing as a last resort, not a
first one.
        regards, tom lane

Re: Shared row locking

From

Heikki Linnakangas

Date:

19 December 2004, 21:15:16

On Sun, 19 Dec 2004, Alvaro Herrera wrote:

> On Sun, Dec 19, 2004 at 09:52:01AM +0000, Simon Riggs wrote:
>
> Simon,
>
>> In similar circumstances, DB2 uses these techniques:
>>
>> - when locktable X % full, then escalate locks to full table locks: both
>> locktable memory and threshold% are instance parameters
>
> This is not useful at all, because the objective of this exercise is to
> downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
> shared row locking.  Keep in mind that this is for foreign key locking,
> which is one area where deadlocks are frequently encountered because we
> use too strong a lock.

Actually it might help in some scenarios. Remember, we're not talking 
about upgrading shared locks to exclusive locks. We're only talking about 
locking more rows than necessary (all rows).

I believe DB2 does the escalation also for perfomance. Getting a full 
table lock is cheaper than individually locking every row.

- Heikki

Re: Shared row locking

From

Tom Lane

Date:

19 December 2004, 21:23:00

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Sun, 19 Dec 2004, Alvaro Herrera wrote:
>> This is not useful at all, because the objective of this exercise is to
>> downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
>> shared row locking.

> Actually it might help in some scenarios. Remember, we're not talking 
> about upgrading shared locks to exclusive locks. We're only talking about 
> locking more rows than necessary (all rows).

Nonetheless, it would mean that locks would be taken depending on
implementation-dependent, not-visible-to-the-user considerations.
Shared locks can still cause deadlocks, and so you would have an
unreliable application, which would only be unreliable under load.

As I said in connection with the other proposal, weird user-visible
semantics should be the last resort not the first.
        regards, tom lane

Re: Shared row locking

From

Heikki Linnakangas

Date:

19 December 2004, 21:36:43

On Sun, 19 Dec 2004, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On Sun, 19 Dec 2004, Alvaro Herrera wrote:
>>> This is not useful at all, because the objective of this exercise is to
>>> downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
>>> shared row locking.
>
>> Actually it might help in some scenarios. Remember, we're not talking
>> about upgrading shared locks to exclusive locks. We're only talking about
>> locking more rows than necessary (all rows).
>
> Nonetheless, it would mean that locks would be taken depending on
> implementation-dependent, not-visible-to-the-user considerations.
> Shared locks can still cause deadlocks, and so you would have an
> unreliable application, which would only be unreliable under load.
>
> As I said in connection with the other proposal, weird user-visible
> semantics should be the last resort not the first.

I agree that lock escalation is not a good solution, we run into problems 
with DB2 lock escalation at work all the time.

- Heikki

Re: Shared row locking

From

Simon Riggs

Date:

20 December 2004, 00:24:11

On Sun, 2004-12-19 at 18:31, Alvaro Herrera wrote:
> Thanks for your ideas anyway.  And keep having them!

No problem. Just giving some info on what works and doesn't work in
other implementations.

-- 
Best Regards, Simon Riggs

Re: Shared row locking

From

Bruce Momjian

Date:

20 December 2004, 02:32:50

Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > On Sun, 19 Dec 2004, Alvaro Herrera wrote:
> >> This is not useful at all, because the objective of this exercise is to
> >> downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
> >> shared row locking.
> 
> > Actually it might help in some scenarios. Remember, we're not talking 
> > about upgrading shared locks to exclusive locks. We're only talking about 
> > locking more rows than necessary (all rows).
> 
> Nonetheless, it would mean that locks would be taken depending on
> implementation-dependent, not-visible-to-the-user considerations.
> Shared locks can still cause deadlocks, and so you would have an
> unreliable application, which would only be unreliable under load.
> 
> As I said in connection with the other proposal, weird user-visible
> semantics should be the last resort not the first.
> 

One idea is to stamp the same shared lock counter on multiple rows and
just prevent another application from also sharing that lock if too many
rows are locked.  It would wait for the lock to be released.  This
prevents the problem of having to store thousands of shared lock
counters.  There would have to be some flag so say which shared counters
are only for a single backend.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: Shared row locking

From

"Jim C. Nasby"

Date:

20 December 2004, 06:35:20

On Sun, Dec 19, 2004 at 11:35:02PM +0200, Heikki Linnakangas wrote:
> On Sun, 19 Dec 2004, Tom Lane wrote:
> 
> >Heikki Linnakangas <hlinnaka@iki.fi> writes:
> >>On Sun, 19 Dec 2004, Alvaro Herrera wrote:
> >>>This is not useful at all, because the objective of this exercise is to
> >>>downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
> >>>shared row locking.
> >
> >>Actually it might help in some scenarios. Remember, we're not talking
> >>about upgrading shared locks to exclusive locks. We're only talking about
> >>locking more rows than necessary (all rows).
> >
> >Nonetheless, it would mean that locks would be taken depending on
> >implementation-dependent, not-visible-to-the-user considerations.
> >Shared locks can still cause deadlocks, and so you would have an
> >unreliable application, which would only be unreliable under load.
> >
> >As I said in connection with the other proposal, weird user-visible
> >semantics should be the last resort not the first.
> 
> I agree that lock escalation is not a good solution, we run into problems 
> with DB2 lock escalation at work all the time.

Does anyone know how Oracle deals with this? They use MVCC like
PostgreSQL, so they'd be a better source for inspiration.
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Shared row locking

From

Gavin Sherry

Date:

20 December 2004, 07:24:04

On Sat, 18 Dec 2004, Bruce Momjian wrote:

> BTom Lane wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > You mean all empty/zero rows can be removed?  Can we guarantee that on
> > > commit we can clean up the bitmap?  If not the idea doesn't work.
> >
> > For whatever data structure we use, we may reset the structure to empty
> > during backend-crash recovery.  So your objection boils down to "what if
> > a backend exits normally but forgets to clean up its locks?"  Assuming
> > that doesn't happen isn't any worse than assuming a backend will clean
> > up its shared memory state on non-crash exit, so I don't think it's a
> > serious concern.
> >
> > That brings another thought: really what this is all about is working
> > around the fact that the standard lock manager can only cope with a
> > finite number of coexisting locks, because it's working in a fixed-size
> > shared memory arena.  Maybe we should instead think about ways to allow
> > the existing lock table to spill to disk when it gets too big.  That
> > would eliminate max_locks_per_transaction as a source of hard failures,
> > which would be a nice benefit.
>
> Agreed. Once concern I have about allowing the lock table to spill to
> disk is that a large number of FOR UPDATE locks could push out lock
> entries used by other backends, causing very poor performance.

I think if we allow the lock manager to spill to disk (and I think we do
need to allow it) then we should also be able to control the amount of
shared memory allocated. There's little point spilling the lock table to
disk if we have huge amounts of memory.

Gavin

Re: Shared row locking

From

Simon Riggs

Date:

20 December 2004, 08:35:41

On Mon, 2004-12-20 at 06:34, Jim C. Nasby wrote:
> On Sun, Dec 19, 2004 at 11:35:02PM +0200, Heikki Linnakangas wrote:
> > On Sun, 19 Dec 2004, Tom Lane wrote:
> > 
> > >Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > >>On Sun, 19 Dec 2004, Alvaro Herrera wrote:
> > >>>This is not useful at all, because the objective of this exercise is to
> > >>>downgrade locks, from exclusive row locking (SELECT ... FOR UPDATE) to
> > >>>shared row locking.
> > >
> > >>Actually it might help in some scenarios. Remember, we're not talking
> > >>about upgrading shared locks to exclusive locks. We're only talking about
> > >>locking more rows than necessary (all rows).
> > >
> > >Nonetheless, it would mean that locks would be taken depending on
> > >implementation-dependent, not-visible-to-the-user considerations.
> > >Shared locks can still cause deadlocks, and so you would have an
> > >unreliable application, which would only be unreliable under load.
> > >
> > >As I said in connection with the other proposal, weird user-visible
> > >semantics should be the last resort not the first.
> > 
> > I agree that lock escalation is not a good solution, we run into problems 
> > with DB2 lock escalation at work all the time.
> 
> Does anyone know how Oracle deals with this? They use MVCC like
> PostgreSQL, so they'd be a better source for inspiration.

Oracle only uses MVCC in its widest sense - versioning info is stored in
UNDO tablespaces (rollback segments). That implementation is covered by
aggressive patent attorneys.

Oracle implements locking at row level within each data block. The block
header expands dynamically to accommodate a list of transactions that
can access, with minimum and maximum sizes settable by the DBA. This
works reasonably well.

Each SELECT FOR UPDATE is actually a block-write, whether or not the
rows are modified, which has some additional code to recover from this
without crashing/redo. Later transactions end up cleaning up the lock
header info (which later became a problem in Parallel Server).

https://cwisdb.cc.kuleuven.ac.be/ora10doc/server.101/b10743/consist.htm

-- 
Best Regards, Simon Riggs

Re: Shared row locking

From

Alvaro Herrera

Date:

20 December 2004, 13:35:57

On Mon, Dec 20, 2004 at 06:23:24PM +1100, Gavin Sherry wrote:
> On Sat, 18 Dec 2004, Bruce Momjian wrote:

> > Agreed. Once concern I have about allowing the lock table to spill to
> > disk is that a large number of FOR UPDATE locks could push out lock
> > entries used by other backends, causing very poor performance.
> 
> I think if we allow the lock manager to spill to disk (and I think we do
> need to allow it) then we should also be able to control the amount of
> shared memory allocated. There's little point spilling the lock table to
> disk if we have huge amounts of memory.

This is a interesting idea.

Gavin also mentioned to me we should also control the amount of memory
the shared inval queue uses.  Causing all backends to refill the cache
is (I assume) a big performance hit.

Maybe we should expose this via new knobs in postgresql.conf, to ease
implementation, or maybe not, to rid users of configuring it.

As a start, we could have WARNINGs when the lock table is spilled and
when a SInvalReset occurs.  So the user can know whether he should
increase memory use for those settings.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end." (2nd Commandment for C programmers)

Re: Shared row locking

From

Tom Lane

Date:

20 December 2004, 14:56:49

Gavin Sherry <swm@linuxworld.com.au> writes:
> I think if we allow the lock manager to spill to disk (and I think we do
> need to allow it) then we should also be able to control the amount of
> shared memory allocated.

You mean like max_locks_per_transaction?
        regards, tom lane

Re: Shared row locking

From

Tom Lane

Date:

20 December 2004, 14:58:20

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> Gavin also mentioned to me we should also control the amount of memory
> the shared inval queue uses.

Perhaps, but I've really seen no evidence that there's a need to worry
about that.  Without demonstrated problems I'd sooner keep that code a
bit simpler and faster.
        regards, tom lane

Re: Shared row locking

From

"Merlin Moncure"

Date:

20 December 2004, 15:58:53

Tom lane wrote:
> Gavin Sherry <swm@linuxworld.com.au> writes:
> > I think if we allow the lock manager to spill to disk (and I think
we do
> > need to allow it) then we should also be able to control the amount
of
> > shared memory allocated.
>
> You mean like max_locks_per_transaction?

IMO, max_locks_per_transaction could use a better name a little more
documentation.  I've mentioned this a couple of times before, but there
is at least one type of lock that does not expire when the transaction
ends (user locks).

I may be over my head here, but I think lock spillover is dangerous.  In
the extreme situations where this would happen, it would be a real
performance buster.  Personally, I would rather see locks escalate when
the table gets full, or at least allow this as a configuration
parameter.

Merlin

Re: Shared row locking

From

Tom Lane

Date:

20 December 2004, 16:47:58

"Merlin Moncure" <merlin.moncure@rcsonline.com> writes:
> I may be over my head here, but I think lock spillover is dangerous.  In
> the extreme situations where this would happen, it would be a real
> performance buster.  Personally, I would rather see locks escalate when
> the table gets full, or at least allow this as a configuration
> parameter. 

To me, "performance buster" is better than "random, unrepeatable
deadlock failures".  In any case, if we find we *can't* implement this
in a non-performance-busting way, then it would be time enough to look
at alternatives that force the user to manage the problem for us.
        regards, tom lane

Re: Shared row locking

From

Alvaro Herrera

Date:

20 December 2004, 18:09:39

On Mon, Dec 20, 2004 at 11:47:41AM -0500, Tom Lane wrote:

> To me, "performance buster" is better than "random, unrepeatable
> deadlock failures".  In any case, if we find we *can't* implement this
> in a non-performance-busting way, then it would be time enough to look
> at alternatives that force the user to manage the problem for us.

I am confused by this discussion.

To solve the problem I want to solve, we have three orthogonal
possibilities:

1. implement shared row locking using the ideas outlined in the mail
starting this thread (pg_clog-like seems to be the winner, details TBD).

2. implement shared lock table spill-to-disk mechanism.

3. implement lock escalation.

Some people seems to think 3 is better than 2.  What do they think of 1?

Some facts:

- DB2 implements 3 and some people have problems with deadlocks.

- 2 could have a performance impact, and we don't even know how to start.  For example, what would be an algorithm to
decidewhat locks to send to disk?

- I am interested in implementing 1, maybe 2.  Don't know about 3.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
One man's impedance mismatch is another man's layer of abstraction.
(Lincoln Yeoh)

Re: Shared row locking

From

Tom Lane

Date:

20 December 2004, 18:34:34

Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> To solve the problem I want to solve, we have three orthogonal
> possibilities:

> 1. implement shared row locking using the ideas outlined in the mail
> starting this thread (pg_clog-like seems to be the winner, details TBD).

> 2. implement shared lock table spill-to-disk mechanism.

> 3. implement lock escalation.

Check.

> - 2 could have a performance impact, and we don't even know how to
>   start.  For example, what would be an algorithm to decide what locks
>   to send to disk?

LRU, perhaps?  That's all open for investigation still.

#1 could have a pretty serious performance impact, too.  For small
numbers of FOR UPDATE locks (too few to force spill to disk) I would
expect #2 to substantially beat #1.  #1 essentially imposes the worst
case performance at all times, whereas #2 degrades (at a currently
unknown rate) when there are lots and lots of FOR UPDATE locks.

Most of the applications I've seen don't take out that many FOR UPDATE
locks at once, so I'm unclear on the rationale for choosing a fixed-but-
poor performance curve over one that is fast for few locks and degrades
for many locks.  Especially when the value of "many" is
user-configurable.

Furthermore, we have also seen issues with too many locks on ordinary
objects, which #2 would solve simultaneously.

So I feel that #2 is clearly the approach to try first.  If we find that
we can't do spill-to-disk without serious performance degradation, then
I'd be inclined to try #1 next.  I really don't care for the
user-visible semantics changes implied by #3 ...
        regards, tom lane

Re: Shared row locking

From

"Zeugswetter Andreas DAZ SD"

Date:

21 December 2004, 09:23:18

> In general, I agree with Tom: I haven't seen many programs that use
> extended SELECT FOR UPDATE logic. However, the ones I have seen have
> been batch style programs written using a whole-table cursor - these
> latter ones have been designed for the cursor stability approach.

I think if we add shared locks we should by default behave like
cursor stability isolation level, that only holds one shared lock for
the current cursor row. The semantics are well defined in SQL.
If you want repeatable read you need to change isolation level.
I know FK checks will need to keep the locks, but I would special case
that.

Andreas

Re: Shared row locking

From

"Jim C. Nasby"

Date:

21 December 2004, 21:49:05

On Mon, Dec 20, 2004 at 03:09:24PM -0300, Alvaro Herrera wrote:
> To solve the problem I want to solve, we have three orthogonal
> possibilities:
> 
> 1. implement shared row locking using the ideas outlined in the mail
> starting this thread (pg_clog-like seems to be the winner, details TBD).
> 
> 2. implement shared lock table spill-to-disk mechanism.
> 
> 3. implement lock escalation.
> 
> 
> Some people seems to think 3 is better than 2.  What do they think of 1?
> 
> 
> Some facts:
> 
> - DB2 implements 3 and some people have problems with deadlocks.

FWIW, I have seen DB2 deadlock on a single row update statement in a
database with no one else connected. It was an issue with how they were
implementing repeatable read concurrency. What this indicates to me is
that they've got some 'issues' with their locking mechanisms, and that
#3 shouldn't be thrown out just because DB2 doesn't do it right. AFAIK
Sybase and SQL-server also use lock escalation and I've not heard of
issues with it.
-- 
Jim C. Nasby, Database Consultant               decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Shared row locking

From

Manfred Koizar

Date:

30 December 2004, 00:16:30

On Thu, 16 Dec 2004 21:54:14 -0300, Alvaro Herrera
<alvherre@dcc.uchile.cl> wrote:
>  Else, it will have to wait, using XactLockTableWait, for
>the first transaction in the array that is still running.  We can be
>sure that no one will try to share-lock the tuple while we check the
>btree because we hold an exclusive lock on the tuple's heap page's
>buffer.

Do you really want to XactLockTableWait while holding an exclusive
lock on a heap page?  Or are you going to release the lock before
entering the wait?

>Thanks to tablespaces, it's very easy to create special Relations that
>can be dealt with by standard buffer and storage manager, etc.

I don't get how tablespaces would help.

>The btree idea:
>- does not need crash recovery.  Maybe we could use a stripped down
>  version of nbtree.  This could cause a maintanibility nightmare.

It could be a nightmare if you simply duplicate and then modify the
code.  A better way would be to refactor nbtree to be able to handle
btrees with different properties:

. shared/private

. logged/non-logged

. flexible value data type.

>- can't hold more than 300 or so simultaneous lockers (because of value
>  length, limited to 1/3 of a page).  I doubt this is a real problem.

Or you store each lock as a separate index tuple.  If this is expected
to be a problem because of many repeating key vaules, nbtree should be
enhanced to support duplicate key compression, which might be a good
thing per se.

ServusManfred