Thread: Notes on lock table spilling

Notes on lock table spilling

From
Alvaro Herrera
Date:
Hackers,

I'm seeing what can I do about spilling the lock table to disk
I welcome comments on the ideas outlined here.  If anyone sees a
showstopper please let me know.


Notes on Spilling the Lock Table to Disk
----------------------------------------
The problem being solved here is the inherent limitation of the lock table
to grow in size beyond whatever fits into shared memory.  This has not being
a problem until now, but will be as soon as we implement locking tuples
using it rather than marking the tuples on disk.  At the same time solving
this problem will remove the max_locks_per_transaction limit.

On this new design there are two hash tables in shared memory.  The first
one is the heavy, complete table holding all information about locks (the
table we have now).  In this table, the key is LOCKTAG (4 bytes) and the
value is LOCK (132 bytes total on my system).

The second table is a "placeholder table", with LOCKTAG as key and a pointer
to where the entry can be found on disk as value (8 bytes total on my system).
This place on disk would be based on the slru.c code that is already used in
pg_clog and pg_subtrans.

A would-be locker which needs to check whether the lock has been taken,
first searches in the main hash table.  If the lock is found, the second
table needs not be scanned.  On the other hand, if the lock is not found, it
checks a boolean that indicates whether there is something on the second
table, and searches that if needed.  If it finds the lock there, it must be
extracted, deleted from the secondary table and put into the main table.

One problem is that the current shmem allocator is unable to return memory
to the shared memory pool.  So if we fill the shared memory with the primary
table, there won't be space for the secondary table when we need it.  A way
around this is preallocating space for both tables (or maybe only the second).
Another would be creating a freelist for shmem, but it hasn't been worth the
trouble all these years ...


Memory Requirements
-------------------
The current default settings allow for 6400 locks (64 locks per transaction,
100 max connections).  This is about 850 kB on my system, or around 103
BLCKSZ pages.  With the new design we would use 8 BLCKSZ pages for the SLRU
subsystem, 64 kB for the secondary table (which allows for ~ 8192
placeholders) and 640 kB for the main table (which allows for ~ 5000 locks).
So it would use the same amount of memory as now, and it could hold around
13000 locks.  Or if we could use 128 kB for the secondary table and 576 kB
for the main table, and it will hold 20000 locks.  This is without
increasing the current memory requirements; on a normal system we could use
three or four times that amount without much problem.  Maybe it can be made
configurable.


SLRU Handling
-------------
On handling the SLRU system, we can keep a bitmap at the start of each page
indicating which entries are used.  Thus we don't have to scan the whole
page to look for an empty place to put a new entry.

We can keep track of pages with free space with a single page number in
memory and another page number at the start of each page (which records what
the page number in memory was when an item was deleted from this page).  So
the page number in memory always has space, and when it fills, its pointer
goes to the page that was previously being filled.

We still need a way to compact this, so the log can be truncated.  It's not
clear to me how to do that however, because we have to move the entry on
disk and at the same time change the pointer in the secondary hash table.

Strategy
--------
One further question is what locks to move out of the main table.  It seems
the easiest way is using LRU.  I think we need to use two LRU queues, one
for tuple locks and other for the rest (and moving tuple locks first).  Thus
heavy deletes don't block the rest of the operations.


Further Problems
----------------
We have a problem as soon as somebody tries to delete a lot of rows from
a big table.  We cannot possibly extend the memory requirements forever,
so we need to spill to disk without having an in-shared-memory index.
This is bad, because performance will suck.  One idea would be to store
the remaining tuple locks on a file on disk, one file per relation.
This way, a transaction doing a heavy DELETE does not need to make other
concurrent transactions come to a halt (but it will cause trouble to
other transactions working on the same table).  I'm open to better ideas
on this.

The locking strategy for SLRU and the secondary hash table handling
should be explained in more detail.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
FOO MANE PADME HUM


Re: Notes on lock table spilling

From
Tom Lane
Date:
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> We have a problem as soon as somebody tries to delete a lot of rows from
> a big table.  We cannot possibly extend the memory requirements forever,
> so we need to spill to disk without having an in-shared-memory index.

Yes.  I'm not sure that I see the point of the in-memory index at all...
there is some intermediate regime where it would improve performance,
but it surely does not solve the basic problem that shared memory is
finite.

Maybe something involving lossy storage would work?  Compare recent
discussions about lossy bitmaps generated from index scans.
        regards, tom lane


Re: Notes on lock table spilling

From
Alvaro Herrera
Date:
On Thu, Mar 31, 2005 at 12:19:08AM -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> > We have a problem as soon as somebody tries to delete a lot of rows from
> > a big table.  We cannot possibly extend the memory requirements forever,
> > so we need to spill to disk without having an in-shared-memory index.
> 
> Yes.  I'm not sure that I see the point of the in-memory index at all...
> there is some intermediate regime where it would improve performance,
> but it surely does not solve the basic problem that shared memory is
> finite.

My idea was that it would help eliminate I/O.  But probably you are
right that it may be the wrong idea; probably it's better to have an
on-disk index for the on-disk storage of LOCK (we don't want sequential
scanning of an on-disk lock array, do we?).  If it's cached in memory,
then no I/O would be needed until we started recording a lot of locks,
thus achieving the same effect with simpler code and better degradation.

I'm thinking in sketching some sort of simple btree on top of slru
pages.  Nothing concrete yet.


> Maybe something involving lossy storage would work?  Compare recent
> discussions about lossy bitmaps generated from index scans.

Hmm.  Some problems come to mind:

- How to decide when to use lossy storage.  Perhaps if the executor (or rather, the planner) thinks it is about to
acquirelots of tuple locks, then it could hint the lock manager.
 

- How to unlock?  (or: how to know when the lock is released for sure?) I think this can be solved by counting
lockers/unlockers.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"Nadie esta tan esclavizado como el que se cree libre no siendolo" (Goethe)


Re: Notes on lock table spilling

From
Simon Riggs
Date:
On Wed, 2005-03-30 at 18:09 -0400, Alvaro Herrera wrote:
> I'm seeing what can I do about spilling the lock table to disk
> I welcome comments on the ideas outlined here.  If anyone sees a
> showstopper please let me know.

Perhaps a little delayed, but yes, I have major reservations about the
whole concept of spilling the lock table to disk. If you implement this,
I would very much like a switch to be able to turn it off, somehow.

All else you've done about locking sounds great and I don't want my
comments here to detract from my general appreciation of your work and
the project as a whole.

IIRC there is not another major system that spills locks to disk and
there's a big reason: performance is very poor. Other systems accept
some limitations in order to avoid that. Oracle takes locks and holds
them within the block itself, which then need to have block cleanup
performed on them next time they're read. DB2 has a lock table which
doesn't grow in size or spill to disk, ever. [It does do lock
escalation, which isn't too great either - but this will only be granted
if no other users conflict, so its not too bad a solution.]

DB2 even goes to great lengths to avoid this by offering additional
locking modes of Cursor Stability (CS) - which only locks the rows
currently being viewed or on which a cursor is currently placed. DB2
would call locking everything Repeatable Read mode (RR) which performs
so badly in most situations people don't use it - and thats even without
spilling to disk. Oracle limits the number of blocks that will have
copies of them held in the Undo tablespace, so there is an effective
limit on number of locks that can be taken there also.

I would be more inclined to implement a rolling window lock similar to
CS, so you never run out of lock table space. That can give you issues,
sure, but not performance ones. That idea may have flaws also: I do not
wish at this stage to back an alternative, just to say that the spilling
to disk idea sounds like it has big problems, IMHO.

If you did go down the route of spilling to disk, I'd would like to make
very sure that only the person that asked for hundreds of locks would be
effected by all those locks. i.e. only their locks were spilled to disk,
so the lock cache is effectively "scan resistant". Other users with
modest locking requests would be uneffected. Not sure how you'd do that
with the SLRU code as it is.

There are no easy answers to this conundrum, but I'm not sure spilling
to disk is the answer. 

Best Regards, Simon Riggs



Re: Notes on lock table spilling

From
"Merlin Moncure"
Date:
> On Wed, 2005-03-30 at 18:09 -0400, Alvaro Herrera wrote:
> > I'm seeing what can I do about spilling the lock table to disk
> > I welcome comments on the ideas outlined here.  If anyone sees a
> > showstopper please let me know.
>
> Perhaps a little delayed, but yes, I have major reservations about the
> whole concept of spilling the lock table to disk. If you implement
this,
> I would very much like a switch to be able to turn it off, somehow.

me too.  I very much like the current behavior, so that a transaction is
aborted when lock table space is exhausted.  This is pretty flexible
because you can always manipulate the lock table size.  However,
Alvaro's stuff would still be nice when doing massive database changes
that are not performance sensitive.

By the way Alvaro, are you still planning to reorganize the lock union?

Merlin


Re: Notes on lock table spilling

From
Greg Stark
Date:
Simon Riggs <simon@2ndquadrant.com> writes:

> DB2 even goes to great lengths to avoid this by offering additional
> locking modes of Cursor Stability (CS) - which only locks the rows
> currently being viewed or on which a cursor is currently placed. DB2
> would call locking everything Repeatable Read mode (RR) which performs
> so badly in most situations people don't use it - and thats even without
> spilling to disk. Oracle limits the number of blocks that will have
> copies of them held in the Undo tablespace, so there is an effective
> limit on number of locks that can be taken there also.

I think you're mixing up different things. DB2's Cursor Stability vs
Repeatable Read modes are more akin to READ COMMITTED vs SERIALIZABLE.

Postgres doesn't need to lock records at all for simply ensuring consistent
snapshots. The locks he's working on are the ones used to prevent updates to
parent records while an in-progress transaction is inserting or updating a
child record in a table with a foreign key dependence.

That said, it would be interesting to have an option to set on foreign keys to
indicate that no locks are necessary. Perhaps in the form of an option on the
parent table that would prevent any primary keys from ever being deleted.

-- 
greg



Re: Notes on lock table spilling

From
Alvaro Herrera
Date:
On Mon, Apr 04, 2005 at 07:08:11PM +0100, Simon Riggs wrote:

> IIRC there is not another major system that spills locks to disk and
> there's a big reason: performance is very poor. Other systems accept
> some limitations in order to avoid that. Oracle takes locks and holds
> them within the block itself, which then need to have block cleanup
> performed on them next time they're read. DB2 has a lock table which
> doesn't grow in size or spill to disk, ever.

We can have the in-memory hash table grow to be "enough for most
applications"; say, have it store 10000 locks without resorting to
spilling.  As a secondary storage there would be the in-memory buffers
for the spill area; comfortably another 5000 locks.  Because there is no
need to have the lock table persist across system crashes (except in
case we implement 2PC), there's no need to fsync() or XLog this
information.  So you would not touch disk until you have acquired 15000
locks or so.  This doesn't really address your concern, because for
spilling logic to work we may need to evict any transaction's lock, not
just our own.  Thus we can't guarantee that only the heavy locker's
locks are spilled (spilt?)


Now, I talked to Bruce via IM the other day and he doesn't like the idea
of using the regular lock manager to lock tuples.  It's too heavy, he
says.  So he proposed using "phantom Xids" instead.  (Some of you may
remember the phantom CommandIds back in the subtransactions day; these
are quite different, so they have none of the problems the others had.)
This idea is explained below.

In that chat, we also commented on a third possible solution: using a
lock manager process, instead of using shared memory for locks.  I'm not
sure if this is really workable; there needs to be some means of
communication beyond simple signals for this to be effective, I think.
Bruce didn't seems very fond of this idea; at least he seemed way more
attracted to the phantoms.


So, we now have four ideas for solving the shared-row locking
problem:

A. Using the lock manager to lock tuples.  This requires some sort of spill-to-disk logic.

B. Not using the lock manager to lock tuples.  This can be done by using Bruce's Phantom Xids idea.

C. Using a hybrid approach.  This can be done using the unused space in heap pages, as proposed by  Paul Tillotson, and
fallingback to the lock manager.  Thus we still  need some spilling logic.  I'm still not clear how would this work
giventhe current page layout and XLog requirements.
 

D. Using a lock manager process.  With a separate process, we don't need  any of this, because it can use the amount of
memoryit sees fit.  We  may have some communicational overhead which may make the idea  unfeasible.  But of course, we
wouldretain a shared memory area for  "caching" (so we wouldn't need to touch the process until we are holding  lots of
locks.)Does anyone thinks this is an idea worth pursuing?  Or  rather, does anyone see huge showstoppers here?
 


Using Phantom Xids
==================
The idea here is to use an approach similar to what we use now: mark the
tuples with an Xid when it is locked.  A phantom Xid is a sort-of Xid,
with multiple real Xids associated to it.  So we mark the tuple with the
regular Xid the first time the share lock is acquired; if a second
transaction wants to lock the tuple, it creates a new phantom Xid which
"contains" the original Xid in the tuple and its own Xid, insert it into
the phantom Xid table, and mark the tuple with that as Xmax.

Thus, the operations we need to implement this are:
1. Create new phantom Xid with two real Xids
2. Find the phantom Xid which has some combination of real Xids
3. Add a real Xid to a phantom Xid
4. Cleanup the whole thing

Bruce proposes using two hash tables for this: one with the phantom Xid
as the key, and other with XOR(real xids) as key.  Thus we could implement
1, 2 and 3 easily.  Cleanup would happen when somebody else visits the
phantom Xid and finds it populated with non-running transactions (it can
clean the Xmax marking as well, if the phantom Xid ends up empty), and
by VACUUM so that we don't have problems with Xid wraparound.


Comments?

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"La vida es para el que se aventura"


Re: Notes on lock table spilling

From
Alvaro Herrera
Date:
On Mon, Apr 04, 2005 at 02:50:30PM -0400, Merlin Moncure wrote:

> > Perhaps a little delayed, but yes, I have major reservations about
> > the whole concept of spilling the lock table to disk. If you
> > implement this, I would very much like a switch to be able to turn
> > it off, somehow.

> me too.  I very much like the current behavior, so that a transaction is
> aborted when lock table space is exhausted.  This is pretty flexible
> because you can always manipulate the lock table size.

Hmm.  I'll keep that in consideration (if we come to implement lock
table spilling.)

> By the way Alvaro, are you still planning to reorganize the lock union?

Not right now.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
Este mail se entrega garantizadamente 100% libre de sarcasmo.


Re: Notes on lock table spilling

From
Simon Riggs
Date:
On Mon, 2005-04-04 at 16:27 -0400, Alvaro Herrera wrote:
> On Mon, Apr 04, 2005 at 07:08:11PM +0100, Simon Riggs wrote:
> 
> > IIRC there is not another major system that spills locks to disk and
> > there's a big reason: performance is very poor. Other systems accept
> > some limitations in order to avoid that. Oracle takes locks and holds
> > them within the block itself, which then need to have block cleanup
> > performed on them next time they're read. DB2 has a lock table which
> > doesn't grow in size or spill to disk, ever.
> 
> We can have the in-memory hash table grow to be "enough for most
> applications"; say, have it store 10000 locks without resorting to
> spilling.  As a secondary storage there would be the in-memory buffers
> for the spill area; comfortably another 5000 locks.  Because there is no
> need to have the lock table persist across system crashes (except in
> case we implement 2PC), there's no need to fsync() or XLog this
> information.  So you would not touch disk until you have acquired 15000
> locks or so.  This doesn't really address your concern, because for
> spilling logic to work we may need to evict any transaction's lock, not
> just our own.  Thus we can't guarantee that only the heavy locker's
> locks are spilled (spilt?)

Well, thanks but still not happy. I'd like to find a way to penalise
only the heavy locker (and anybody unlucky enough to request rows that
have been locked by them). One idea: if we had an in-memory hash, then
an on-disk b-tree we'd be able to limit the number of locks any backend
could have in memory but allow locks to spill to disk for them. 

> So, we now have four ideas for solving the shared-row locking
> problem:

Five

> 
> A. Using the lock manager to lock tuples.
>    This requires some sort of spill-to-disk logic.

I'm worried about extra LWlock contention which we really don't need.

> B. Not using the lock manager to lock tuples.
>    This can be done by using Bruce's Phantom Xids idea.
> 
> C. Using a hybrid approach.
>    This can be done using the unused space in heap pages, as proposed by
>    Paul Tillotson, and falling back to the lock manager.  Thus we still
>    need some spilling logic.  I'm still not clear how would this work
>    given the current page layout and XLog requirements.

Well, that might work. Each lock is associated with a transaction, so we
can test the lock applicability the same way we read row visibility.
Setting a lock would not alter the dirty flag on the block. If a block
does get written, we can "vacuum" the old locks next time the lock table
is used. If the on-block lock table is full, we know we need to check
the on-disk lock structure.

> D. Using a lock manager process.  With a separate process, we don't need
>    any of this, because it can use the amount of memory it sees fit.  We
>    may have some communicational overhead which may make the idea
>    unfeasible.  But of course, we would retain a shared memory area for
>    "caching" (so we wouldn't need to touch the process until we are holding
>    lots of locks.) Does anyone thinks this is an idea worth pursuing?  Or
>    rather, does anyone see huge showstoppers here?

(Another process: hmmm, getting crowded isn't it?)

E. Implement a lock mode that doesn't lock all the rows that have been
touched, just the current rolling window: less locks, less need to spill
to disk, so less need to find really good solution.

Perhaps its time to revisit what the benefits are of doing *any* of the
above are again. Maybe there's a different way to do just 80% of those
requirements with much less pain all round.

Best Regards, Simon Riggs



Re: Notes on lock table spilling

From
Greg Stark
Date:
Alvaro Herrera <alvherre@dcc.uchile.cl> writes:

> Using Phantom Xids
> ==================
> The idea here is to use an approach similar to what we use now: mark the
> tuples with an Xid when it is locked.  A phantom Xid is a sort-of Xid,
> with multiple real Xids associated to it.  So we mark the tuple with the
> regular Xid the first time the share lock is acquired; if a second
> transaction wants to lock the tuple, it creates a new phantom Xid which
> "contains" the original Xid in the tuple and its own Xid, insert it into
> the phantom Xid table, and mark the tuple with that as Xmax.

That sounds like a fancy way to describe "make a linked list of lockers".

-- 
greg



Re: Notes on lock table spilling

From
Alvaro Herrera
Date:
On Mon, Apr 04, 2005 at 11:32:47PM -0400, Greg Stark wrote:
> 
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes:
> 
> > Using Phantom Xids
> > ==================
> > The idea here is to use an approach similar to what we use now: mark the
> > tuples with an Xid when it is locked.  A phantom Xid is a sort-of Xid,
> > with multiple real Xids associated to it.  So we mark the tuple with the
> > regular Xid the first time the share lock is acquired; if a second
> > transaction wants to lock the tuple, it creates a new phantom Xid which
> > "contains" the original Xid in the tuple and its own Xid, insert it into
> > the phantom Xid table, and mark the tuple with that as Xmax.
> 
> That sounds like a fancy way to describe "make a linked list of lockers".

Yeah, that's the idea :-)  However I was trying to describe how it would
be stored.  We have room for only one Xid in the tuple, so in order to
store multiple lockers, we invent the concept of an Xid that's tied to
multiple transactions.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
Maybe there's lots of data loss but the records of data loss are also lost.
(Lincoln Yeoh)