Notes on lock table spilling - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Notes on lock table spilling
Date
Msg-id 20050330220915.GA7115@dcc.uchile.cl
Whole thread Raw
Responses Re: Notes on lock table spilling
Re: Notes on lock table spilling
List pgsql-hackers
Hackers,

I'm seeing what can I do about spilling the lock table to disk
I welcome comments on the ideas outlined here.  If anyone sees a
showstopper please let me know.


Notes on Spilling the Lock Table to Disk
----------------------------------------
The problem being solved here is the inherent limitation of the lock table
to grow in size beyond whatever fits into shared memory.  This has not being
a problem until now, but will be as soon as we implement locking tuples
using it rather than marking the tuples on disk.  At the same time solving
this problem will remove the max_locks_per_transaction limit.

On this new design there are two hash tables in shared memory.  The first
one is the heavy, complete table holding all information about locks (the
table we have now).  In this table, the key is LOCKTAG (4 bytes) and the
value is LOCK (132 bytes total on my system).

The second table is a "placeholder table", with LOCKTAG as key and a pointer
to where the entry can be found on disk as value (8 bytes total on my system).
This place on disk would be based on the slru.c code that is already used in
pg_clog and pg_subtrans.

A would-be locker which needs to check whether the lock has been taken,
first searches in the main hash table.  If the lock is found, the second
table needs not be scanned.  On the other hand, if the lock is not found, it
checks a boolean that indicates whether there is something on the second
table, and searches that if needed.  If it finds the lock there, it must be
extracted, deleted from the secondary table and put into the main table.

One problem is that the current shmem allocator is unable to return memory
to the shared memory pool.  So if we fill the shared memory with the primary
table, there won't be space for the secondary table when we need it.  A way
around this is preallocating space for both tables (or maybe only the second).
Another would be creating a freelist for shmem, but it hasn't been worth the
trouble all these years ...


Memory Requirements
-------------------
The current default settings allow for 6400 locks (64 locks per transaction,
100 max connections).  This is about 850 kB on my system, or around 103
BLCKSZ pages.  With the new design we would use 8 BLCKSZ pages for the SLRU
subsystem, 64 kB for the secondary table (which allows for ~ 8192
placeholders) and 640 kB for the main table (which allows for ~ 5000 locks).
So it would use the same amount of memory as now, and it could hold around
13000 locks.  Or if we could use 128 kB for the secondary table and 576 kB
for the main table, and it will hold 20000 locks.  This is without
increasing the current memory requirements; on a normal system we could use
three or four times that amount without much problem.  Maybe it can be made
configurable.


SLRU Handling
-------------
On handling the SLRU system, we can keep a bitmap at the start of each page
indicating which entries are used.  Thus we don't have to scan the whole
page to look for an empty place to put a new entry.

We can keep track of pages with free space with a single page number in
memory and another page number at the start of each page (which records what
the page number in memory was when an item was deleted from this page).  So
the page number in memory always has space, and when it fills, its pointer
goes to the page that was previously being filled.

We still need a way to compact this, so the log can be truncated.  It's not
clear to me how to do that however, because we have to move the entry on
disk and at the same time change the pointer in the secondary hash table.

Strategy
--------
One further question is what locks to move out of the main table.  It seems
the easiest way is using LRU.  I think we need to use two LRU queues, one
for tuple locks and other for the rest (and moving tuple locks first).  Thus
heavy deletes don't block the rest of the operations.


Further Problems
----------------
We have a problem as soon as somebody tries to delete a lot of rows from
a big table.  We cannot possibly extend the memory requirements forever,
so we need to spill to disk without having an in-shared-memory index.
This is bad, because performance will suck.  One idea would be to store
the remaining tuple locks on a file on disk, one file per relation.
This way, a transaction doing a heavy DELETE does not need to make other
concurrent transactions come to a halt (but it will cause trouble to
other transactions working on the same table).  I'm open to better ideas
on this.

The locking strategy for SLRU and the secondary hash table handling
should be explained in more detail.

-- 
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
FOO MANE PADME HUM


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Patch for collation using ICU
Next
From: Neil Conway
Date:
Subject: add_missing_from in 8.1