Buffer access rules, and a probable bug - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Buffer access rules, and a probable bug |
Date | |
Msg-id | 17307.994124425@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Buffer access rules, and a probable bug
|
List | pgsql-hackers |
I have been making some notes about the rules for accessing shared disk buffers, since they aren't spelled out anywhere now AFAIK. In process I found what seems to be a nasty bug in the code that tries to build btree indexes that include already-dead tuples. (If memory serves, Hiroshi added that code awhile back to help suppress the "heap tuples != index tuples" complaint from VACUUM.) Would people look this over and see if they agree with my deductions? regards, tom lane Notes about shared buffer access rules -------------------------------------- There are two separate access control mechanisms for shared disk buffers: reference counts (a/k/a pin counts) and buffer locks. (Actually, there's a third level of access control: one must hold the appropriate kind of lock on a relation before one can legally access any page belonging to the relation. Relation-level locks are not discussed here.) Pins: one must "hold a pin on" a buffer (increment its reference count) before being allowed to do anything at all with it. An unpinned buffer is subject to being reclaimed and reused for a different page at any instant, so touching it is unsafe. Typically a pin is acquired by ReadBuffer and released by WriteBuffer (if one modified the page) or ReleaseBuffer (if not). It is OK and indeed common for a single backend to pin a page more than once concurrently; the buffer manager handles this efficiently. It is considered OK to hold a pin for long intervals --- for example, sequential scans hold a pin on the current page until done processing all the tuples on the page, which could be quite a while if the scan is the outer scan of a join. Similarly, btree index scans hold a pin on the current index page. This is OK because there is actually no operation that waits for a page's pin count to drop to zero. (Anything that might need to do such a wait is instead handled by waiting to obtain the relation-level lock, which is why you'd better hold one first.) Pins may not be held across transaction boundaries, however. Buffer locks: there are two kinds of buffer locks, shared and exclusive, which act just as you'd expect: multiple backends can hold shared locks on the same buffer, but an exclusive lock prevents anyone else from holding either shared or exclusive lock. (These can alternatively be called READ and WRITE locks.) These locks are relatively short term: they should not be held for long. They are implemented as per-buffer spinlocks, so another backend trying to acquire a competing lock will spin as long as you hold yours! Buffer locks are acquired and released by LockBuffer(). It will *not* work for a single backend to try to acquire multiple locks on the same buffer. One must pin a buffer before trying to lock it. Buffer access rules: 1. To scan a page for tuples, one must hold a pin and either shared or exclusive lock. To examine the commit status (XIDs and status bits) of a tuple in a shared buffer, one must likewise hold a pin and either shared or exclusive lock. 2. Once one has determined that a tuple is interesting (visible to the current transaction) one may drop the buffer lock, yet continue to access the tuple's data for as long as one holds the buffer pin. This is what is typically done by heap scans, since the tuple returned by heap_fetch contains a pointer to tuple data in the shared buffer. Therefore the tuple cannot go away while the pin is held (see rule #5). Its state could change, but that is assumed not to matter after the initial determination of visibility is made. 3. To add a tuple or change the xmin/xmax fields of an existing tuple, one must hold a pin and an exclusive lock on the containing buffer. This ensures that no one else might see a partially-updated state of the tuple. 4. It is considered OK to update tuple commit status bits (ie, OR the values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and pin on a buffer. This is OK because another backend looking at the tuple at about the same time would OR the same bits into the field, so there is little or no risk of conflicting update; what's more, if there did manage to be a conflict it would merely mean that one bit-update would be lost and need to be done again later. 5. To physically remove a tuple or compact free space on a page, one must hold a pin and an exclusive lock, *and* observe while holding the exclusive lock that the buffer's shared reference count is one (ie, no other backend holds a pin). If these conditions are met then no other backend can perform a page scan until the exclusive lock is dropped, and no other backend can be holding a reference to an existing tuple that it might expect to examine again. Note that another backend might pin the buffer (increment the refcount) while one is performing the cleanup, but it won't be able to to actually examine the page until it acquires shared or exclusive lock. Currently, the only operation that removes tuples or compacts free space is VACUUM. It does not have to implement rule #5 directly, because it instead acquires exclusive lock at the relation level, which ensures indirectly that no one else is accessing tuples of the relation at all. To implement concurrent VACUUM we will need to make it obey rule #5. I believe that nbtree.c's btbuild() code is currently in violation of these rules, because it calls HeapTupleSatisfiesNow() while holding a pin but no lock on the containing buffer. Not only does this risk returning a wrong answer if it sees an intermediate state of the tuple xmin/xmax fields, but HeapTupleSatisfiesNow() may try to update the tuple's infomask. This could produce a wrong state of the infomask if there is a concurrent change of the infomask by an exclusive-lock holder. (Even if that catastrophe does not happen, SetBufferCommitInfoNeedsSave is not called as it should be when the status bits change.) I am also disturbed that several places in heapam.c call HeapTupleSatisfiesUpdate without checking for infomask change and calling SetBufferCommitInfoNeedsSave. In most paths through the code, it seems the buffer will get marked dirty anyway, but wouldn't it be better to make this check?
pgsql-hackers by date: