concurrent snapshots - Mailing list pgsql-hackers
From | Ants Aasma |
---|---|
Subject | concurrent snapshots |
Date | |
Msg-id | CA+CSw_uDfg2SBMicGNu13bpr2upbnVL_edoTbzvacR1FrNrZ1g@mail.gmail.com Whole thread Raw |
Responses |
Re: concurrent snapshots
|
List | pgsql-hackers |
Hi, I have been thinking about how to handle long running transactions with Robert’s commit sequence number (CSN) idea. http://archives.postgresql.org/message-id/CA%2BTgmoaAjiq%3Dd%3DkYt3qNj%2BUvi%2BMB-aRovCwr75Ca9egx-Ks9Ag%40mail.gmail.com I just started to go through transaction management code, so I would appreciate if anyone could point out any gaping holes in my ideas. There are two reasons why XID-to-CSN values need to be kept around. One is that long running transactions need a place to publish their CSN for CSN based snapshots. The other is that CSN based snapshots need to know which transactions between xmin and xmax are visible and which are invisible. My idea is to use hybrid storage for the XID-to-CSN storage. For recent transactions shared memory contains a dense XID indexed ring buffer of CSN values. Head of the buffer moves with nextXid. The tail is also moved on xid assignment, but done in larger chunks to reduce lock contention. When moving the tail up, any XID that might still be invisible to a snapshot is moved to a sparse buffer. To keep the size of the sparse buffer within limits, too old snapshots are converted into list-of-running-xids snapshots. This allows us to move up the visible to all csn-based snapshots horizon (csnmin) and release xids below that horizon from the sparse buffer. CSNs themselves can be uint32s, this means that no transaction can be older than 2 billion writing transactions, which is already enforced by xids being 32 bit values. Taking a snapshot using this idea would consist of a Xmin, Xmax and CSN. Snapshot Xmin guarantees that no xid below Xmin can have a CSN bigger than the snapshots. One way to get Xmin would be to maintain a global value when doing the tail update of the dense array. Snapshot Xmax guarantees that no xid above Xmax can have a CSN less than the snapshots. Xmax can be easily obtained from nextXid. CSN is simply the current value of the CSN counter. After the snapshot is taken, if necessary the csnmin of the proc needs to be updated and snapshot added to the list of acquired snapshots. All of the guarantees of taking a snapshot can be maintained without global locking if we have atomic reads/writes of CSN and XID values. We need to obtain them in the order Xmin -> CSN -> Xmax, inserting read barriers between the loads. Write side ordering is guaranteed by the CSN lock. Updating the proc entry can be done under the per proc lock. To check visibility of a xid between snapshots xmin and xmax, we first check if its above the dense xid array tail. If it is in dense, just read the CSN value and compare it to the snapshot value. If xid falls under the ringbuffer tail, go through the sparse buffer. If the xid is found in the sparse buffer, just read the CSN value and compare. If it is not found, then it is guaranteed to be completed before the snapshot, just read the clog status for the transaction. This is all done without locking. When looking something up from the dense array, we have to issue a read barrier and recheck the tail location to ensure that it hasn’t moved out from under us. If it has, retry. To commit a transaction we take a lock on the CSN, stamp all our XIDs with the value, update clog, a write barrier to ensure that lock free readers see the changes, increment the csn and relase the lock. For each of our XIDs we need to check if it is above or below the dense ringbuffer tail and update the respective value. Moving of the dense ringbuffer tail should be done when assigning XIDs. This allows us to enforce a hard limit on the size of the ringbuffer. To reduce contention on the CSN lock, this should be done in large increments. 1/8 of the ringbuffer at a time seems like a good compromise between frequency and buffer space efficiency. To move the tail, we first find the global minimum value of snapshot CSNs (csnmin). If every proc maintains its own csnmin under per proc lock, we can just get the current csn and go through all the procs, take the per proc lock and fetch the csnmin to get the global value. Then we go through XIDs between old tail and new tail and find all that are still running or not visible to csnmin. We insert the XIDs into the sparse buffer, issue a write barrier so they are visible everyone and then update the tail location. Because we are trawling through the procarray here anyway, its also a good time to update global xmin. Because we want to avoid unbounded growth of the sparse buffer, we need to get rid of old CSN based snapshots. Tying this into XID assignment is a good way to get some provable limits, if subtransactions can be overflowed somehow. For instance, we can keep track of the CSN during last few tail updates, and convert any too old snapshots - this puts the cap on the sparse buffer at O(num_backends*non-overflowed-sub-txs + n_steps*tail-update-step) entries. When go try to find the new csnmin and discover that a backend has a csnmin that is too old, we go through the snapshots of that backend and convert every snapshot under the desired csnmin to a traditional snapshot. To convert a snapshot to traditional, we must find list of xids that were running when the CSN was assigned. For this, we go through the sparse buffer and find all xids between xmin and xmax that have CSN above that of the snapshots or are still running. While doing that, we might as well update xmin and xmax to tighter bounds, and maybe if xmax - xmin is small enough, store it as a bitmap. When done, assign the array to the snapshot, issue a write barrier and update a flag on the snapshot that tells it to use the embedded running xids array when checking visibility. As with updating the tail of the dense array, the status of this flag needs to be rechecked after any CSN based visibility check to ensure correctness. After we are sure that no snapshot exists above our new csnmin, we can go through the sparse buffer and evict xids that are visible to all. Moving of the tail needn’t be done under xidGenLock, a separate denseHorizonUpdate lock would do. Commits would still block though, unless we are willing to either make snapshots block or hand out stale snapshots. The sparse buffer needs to be a single updater bounded size lock free datastructure. When inserting to it, we need to guarantee visibility before dense array tail update. When removing from it, correct traversal needs to be guaranteed and that no one can read it after the entry is reused. The dense buffer needs to be sized so that most writing transactions complete before falling out of the ringbuffer and most snapshots are released before converting. Something like num_backends*16 might be a good start. This would use 4KB of memory for every 64 connections. Under this sheme, taking a snapshot would be O(1), commiting would be O(subtransactions), getting a new xid would be O(1 + num backends/dense buffer step), if the buffer step is scaled with num backends, O(1) amortized. Worst case seems to be num_backends - 1 with long read/write transactions and one backend with very high write transaction rate. I haven’t yet thought through how subtransactions and replication would work with this scheme, but I feel that they don’t change any of the basic characteristics. Your thoughts? -- Ants Aasma
pgsql-hackers by date: