Notes on lock table spilling - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Notes on lock table spilling |
Date | |
Msg-id | 20050330220915.GA7115@dcc.uchile.cl Whole thread Raw |
Responses |
Re: Notes on lock table spilling
Re: Notes on lock table spilling |
List | pgsql-hackers |
Hackers, I'm seeing what can I do about spilling the lock table to disk I welcome comments on the ideas outlined here. If anyone sees a showstopper please let me know. Notes on Spilling the Lock Table to Disk ---------------------------------------- The problem being solved here is the inherent limitation of the lock table to grow in size beyond whatever fits into shared memory. This has not being a problem until now, but will be as soon as we implement locking tuples using it rather than marking the tuples on disk. At the same time solving this problem will remove the max_locks_per_transaction limit. On this new design there are two hash tables in shared memory. The first one is the heavy, complete table holding all information about locks (the table we have now). In this table, the key is LOCKTAG (4 bytes) and the value is LOCK (132 bytes total on my system). The second table is a "placeholder table", with LOCKTAG as key and a pointer to where the entry can be found on disk as value (8 bytes total on my system). This place on disk would be based on the slru.c code that is already used in pg_clog and pg_subtrans. A would-be locker which needs to check whether the lock has been taken, first searches in the main hash table. If the lock is found, the second table needs not be scanned. On the other hand, if the lock is not found, it checks a boolean that indicates whether there is something on the second table, and searches that if needed. If it finds the lock there, it must be extracted, deleted from the secondary table and put into the main table. One problem is that the current shmem allocator is unable to return memory to the shared memory pool. So if we fill the shared memory with the primary table, there won't be space for the secondary table when we need it. A way around this is preallocating space for both tables (or maybe only the second). Another would be creating a freelist for shmem, but it hasn't been worth the trouble all these years ... Memory Requirements ------------------- The current default settings allow for 6400 locks (64 locks per transaction, 100 max connections). This is about 850 kB on my system, or around 103 BLCKSZ pages. With the new design we would use 8 BLCKSZ pages for the SLRU subsystem, 64 kB for the secondary table (which allows for ~ 8192 placeholders) and 640 kB for the main table (which allows for ~ 5000 locks). So it would use the same amount of memory as now, and it could hold around 13000 locks. Or if we could use 128 kB for the secondary table and 576 kB for the main table, and it will hold 20000 locks. This is without increasing the current memory requirements; on a normal system we could use three or four times that amount without much problem. Maybe it can be made configurable. SLRU Handling ------------- On handling the SLRU system, we can keep a bitmap at the start of each page indicating which entries are used. Thus we don't have to scan the whole page to look for an empty place to put a new entry. We can keep track of pages with free space with a single page number in memory and another page number at the start of each page (which records what the page number in memory was when an item was deleted from this page). So the page number in memory always has space, and when it fills, its pointer goes to the page that was previously being filled. We still need a way to compact this, so the log can be truncated. It's not clear to me how to do that however, because we have to move the entry on disk and at the same time change the pointer in the secondary hash table. Strategy -------- One further question is what locks to move out of the main table. It seems the easiest way is using LRU. I think we need to use two LRU queues, one for tuple locks and other for the rest (and moving tuple locks first). Thus heavy deletes don't block the rest of the operations. Further Problems ---------------- We have a problem as soon as somebody tries to delete a lot of rows from a big table. We cannot possibly extend the memory requirements forever, so we need to spill to disk without having an in-shared-memory index. This is bad, because performance will suck. One idea would be to store the remaining tuple locks on a file on disk, one file per relation. This way, a transaction doing a heavy DELETE does not need to make other concurrent transactions come to a halt (but it will cause trouble to other transactions working on the same table). I'm open to better ideas on this. The locking strategy for SLRU and the secondary hash table handling should be explained in more detail. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) FOO MANE PADME HUM
pgsql-hackers by date: