Thread: Bulk Inserts
I've done a little experiment with bulk inserts. => heap_bulk_insert() Behaves like heap_insert except it takes an array of tuples (HeapTuple *tups, int ntups). - Grabs a page (same as heap_insert) - While holding exclusive lock, inserts as many tuples as it can on the page.- Either the page gets full- Or we run out of tuples. - Generate xlog : choice between- Full Xlog mode : - if we inserted more than 10 tuples (totaly bogus heuristic), logthe entire page - Else, log individual tuples as heap_insert does- Light log mode : - if page was empty, only xlog a "newempty page" record, not page contents - else, log fully - heap_sync() at the end - Release the page - If we still have tuples to insert, repeat. Am I right in assuming that : 1) - If the page was empty, - and log archiving isn't used, - and the table is heap_sync()'d at the end, => only a "new empty page" record needs to be created, then the page can be completely filled ? 2) - If the page isn't empty - or log archiving is used, => logging either the inserted tuples or the entire page is OK to guarantee persistence ? (I used kill -9 to test it, recovery seems to work). Test on a concurrent COPY, 4 threads, on a table with 8 INT columns. * 8.5 HEAD : Total Time 44 s * Bulk inserts, Full XLog : Total Time 24 s * Bulk inserts, Light XLog : Total Time 10 s Quite a bit faster... I presume with more CPUs it would scale. I'm not posting the patch because it's quite ugly (especially the part to store tuples in copy.c and bulk-insert them, I should probably have used a tuplestore...) I think the tuples need to be stored and then bulk-inserted because the exclusive lock on the buffer can't be held for a long time. Lock stats (from the patch I just posted) : * 8.5 HEAD : Total Time 44 s -------- Lock stats for PID 28043 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 28043 7 0 0 0.00 0.00 2500002 804378 23.59 ( 53.11 %) 7.38 ( 16.61 %) WALInsert 28043 8 0 0 0.00 0.00 25775 32 2.91 ( 6.54 %) 0.90 ( 2.02 %) WALWrite -------- Lock stats for PID 28044 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 28044 7 0 0 0.00 0.00 2500002 802515 22.26 ( 50.11 %) 8.70 ( 19.59 %) WALInsert 28044 8 0 0 0.00 0.00 25620 42 4.00 ( 9.01 %) 1.12 ( 2.52 %) WALWrite -------- Lock stats for PID 28045 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 28045 7 0 0 0.00 0.00 2500002 799145 22.47 ( 50.32 %) 8.72 ( 19.52 %) WALInsert 28045 8 0 0 0.00 0.00 25725 38 4.08 ( 9.14 %) 1.05 ( 2.35 %) WALWrite -------- Lock stats for PID 28042 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 28042 7 0 0 0.00 0.00 2500002 809477 23.49 ( 52.44 %) 7.89 ( 17.62 %) WALInsert 28042 8 0 0 0.00 0.00 25601 37 3.27 ( 7.31 %) 1.05 ( 2.34 %) WALWrite * Bulk inserts, Full XLog : Total Time 24 s -------- Lock stats for PID 32486 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 32486 7 0 0 0.00 0.00 23765 1128 9.22 ( 38.98 %) 4.05 ( 17.14 %) WALInsert 32486 8 0 0 0.00 0.00 21120 19 2.64 ( 11.17 %) 1.32 ( 5.59 %) WALWrite -------- Lock stats for PID 32484 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 32484 7 0 0 0.00 0.00 23865 1083 9.87 ( 41.68 %) 2.87 ( 12.11 %) WALInsert 32484 8 0 0 0.00 0.00 21105 11 1.68 ( 7.11 %) 1.09 ( 4.62 %) WALWrite 32484 8508 0 0 0.00 0.00 1 1 0.19 ( 0.81 %) 0.00 ( 0.00 %) 32484 18846 0 0 0.00 0.00 1 1 0.25 ( 1.05 %) 0.00 ( 0.00 %) -------- Lock stats for PID 32485 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 32485 7 0 0 0.00 0.00 23816 1107 8.94 ( 37.75 %) 4.05 ( 17.09 %) WALInsert 32485 8 0 0 0.00 0.00 21109 21 2.59 ( 10.93 %) 1.36 ( 5.77 %) WALWrite 32485 16618 0 0 0.00 0.00 1 2 0.23 ( 0.98 %) 0.00 ( 0.00 %) -------- Lock stats for PID 32482 PID Lock ShAcq ShWait ShWaitT ShHeldT ExAcq ExWait ExWaitT ExHeldT Name 32482 7 0 0 0.00 0.00 23813 1053 9.70 ( 40.75 %) 3.41 ( 14.32 %) WALInsert 32482 8 0 0 0.00 0.00 21119 15 2.24 ( 9.43 %) 1.06 ( 4.44 %) WALWrite 32482 6770 0 0 0.00 0.00 3 1 0.17 ( 0.70 %) 0.00 ( 0.00 %) * Bulk inserts, Light XLog : Total Time 10 s No Lock stats to show (wait tims is < 0.01 s)...
Replying to myself... Jeff suggested to build pages in local memory and insert them later in the table. This is what's used in CLUSTER for instance, I believe. It has some drawbacks though : - To insert the tuples in indexes, the tuples need tids, but if you build the page in local memory, you don't know on which page they will be until after allocating the page, which will probably be done after the page is built, so it's a bit of a chicken and egg problem. - It only works on new pages. Pages which are not empty, but have free space, cannot be written in this way. The little experiment I made yesterday does not have these drawbacks, since it allocates pages in the standard way, simply it inserts many tuples in one operation instead of just inserting one. If the page happened to be empty, it's even better, but it's not necessary. If your table has lots of free space, it will be used.
2009/9/14 Pierre Frédéric Caillaud <lists@peufeu.com>
Does that heuristic change the timings much? If not, it seems like it would better to keep it simple and always do the same thing, like log the tuples (if it is done under one WALInsertLock, which I am assuming it is..)
Do you even need the new empty page record? I think a zero page will be handled correctly next time it is read into shared buffers, won't it? But I guess it is need to avoid problems with partial page writes that would leave in a state that is neither all zeros nor consistent.
If the entire page is logged, would it have to marked as not removable by the log compression tool? Or can the tool recreate the needed delta?
Jeff
I've done a little experiment with bulk inserts.
=> heap_bulk_insert()
Behaves like heap_insert except it takes an array of tuples (HeapTuple *tups, int ntups).
- Grabs a page (same as heap_insert)
- While holding exclusive lock, inserts as many tuples as it can on the page.
- Either the page gets full
- Or we run out of tuples.
- Generate xlog : choice between
- Full Xlog mode :
- if we inserted more than 10 tuples (totaly bogus heuristic), log the entire page
- Else, log individual tuples as heap_insert does
Does that heuristic change the timings much? If not, it seems like it would better to keep it simple and always do the same thing, like log the tuples (if it is done under one WALInsertLock, which I am assuming it is..)
- Light log mode :
- if page was empty, only xlog a "new empty page" record, not page contents
- else, log fully
- heap_sync() at the end
- Release the page
- If we still have tuples to insert, repeat.
Am I right in assuming that :
1)
- If the page was empty,
- and log archiving isn't used,
- and the table is heap_sync()'d at the end,
=> only a "new empty page" record needs to be created, then the page can be completely filled ?
Do you even need the new empty page record? I think a zero page will be handled correctly next time it is read into shared buffers, won't it? But I guess it is need to avoid problems with partial page writes that would leave in a state that is neither all zeros nor consistent.
2)
- If the page isn't empty
- or log archiving is used,
=> logging either the inserted tuples or the entire page is OK to guarantee persistence ?
If the entire page is logged, would it have to marked as not removable by the log compression tool? Or can the tool recreate the needed delta?
Jeff
2009/9/14 Pierre Frédéric Caillaud <lists@peufeu.com>
Yes, I did not consider that to be a problem because I did not think it would be used on indexed tables. I figured that the gain from doing bulk inserts into the table would be so diluted by the still-bottle-necked index maintenance that it was OK not to use this optimization for indexed tables.
My original thought was based on the idea of still using heap_insert, but with a modified form of bistate which would hold the exclusive lock and not just a pin. If heap_insert is being driven by the unmodified COPY code, then it can't guarantee that COPY won't stall on a pipe read or something, and so probably shouldn't hold an exclusive lock while filling the block. That is why I decided a local buffer would be better, as the exclusive lock is really a no-op and wouldn't block anyone. But if you are creating a new heap_bulk_insert and modifying the COPY to go with it, then you can guarantee it won't stall from the driving end, instead.
Whether any of these approaches will be maintainable enough to be integrated into the code base is another matter. It seems like there is already a lot of discussion going on around various permutations of copy options.
Replying to myself...
Jeff suggested to build pages in local memory and insert them later in the table. This is what's used in CLUSTER for instance, I believe.
It has some drawbacks though :
- To insert the tuples in indexes, the tuples need tids, but if you build the page in local memory, you don't know on which page they will be until after allocating the page, which will probably be done after the page is built, so it's a bit of a chicken and egg problem.
Yes, I did not consider that to be a problem because I did not think it would be used on indexed tables. I figured that the gain from doing bulk inserts into the table would be so diluted by the still-bottle-necked index maintenance that it was OK not to use this optimization for indexed tables.
- It only works on new pages. Pages which are not empty, but have free space, cannot be written in this way.
My original thought was based on the idea of still using heap_insert, but with a modified form of bistate which would hold the exclusive lock and not just a pin. If heap_insert is being driven by the unmodified COPY code, then it can't guarantee that COPY won't stall on a pipe read or something, and so probably shouldn't hold an exclusive lock while filling the block. That is why I decided a local buffer would be better, as the exclusive lock is really a no-op and wouldn't block anyone. But if you are creating a new heap_bulk_insert and modifying the COPY to go with it, then you can guarantee it won't stall from the driving end, instead.
Whether any of these approaches will be maintainable enough to be integrated into the code base is another matter. It seems like there is already a lot of discussion going on around various permutations of copy options.
Cheers,
Jeff
> Yes, I did not consider that to be a problem because I did not think it > would be used on indexed tables. I figured that the gain from doing bulk > inserts into the table would be so diluted by the still-bottle-necked > index maintenance that it was OK not to use this optimization for > indexed tables. I've tested with indexes, and the index update time is much larger than the inserts time. Bulk inserts still provide a little bonus though, and having a solution that works in all cases is better IMHO. > My original thought was based on the idea of still using heap_insert, but > with a modified form of bistate which would hold the exclusive lock and > not > just a pin. If heap_insert is being driven by the unmodified COPY code, > then it can't guarantee that COPY won't stall on a pipe read or > something, > and so probably shouldn't hold an exclusive lock while filling the block. Exactly, that's what I was thinking too, and reached the same conclusion. > That is why I decided a local buffer would be better, as the exclusive > lock > is really a no-op and wouldn't block anyone. But if you are creating a > new > heap_bulk_insert and modifying the COPY to go with it, then you can > guarantee it won't stall from the driving end, instead. I think it's better, but you have to buffer tuples : at least a full page's worth, or better, several pages' worth of tuples, in case inline compression kicks in and shrinks them, since the purpose is to be able to fill a complete page in one go. > Whether any of these approaches will be maintainable enough to be > integrated into the code base is another matter. It seems like there is > already a lot of discussion going on around various permutations of copy > options. It's not really a COPY mod, since it would also be good for big INSERT INTO SELECT FROM which is wal-bound too (even more so than COPY, since there is no parsing to do).
> Does that heuristic change the timings much? If not, it seems like it > would > better to keep it simple and always do the same thing, like log the > tuples > (if it is done under one WALInsertLock, which I am assuming it is..) It is the logging of whole pages that makes it faster. If you fill a page with tuples in one operation (while holding exclusive lock) and then insert WAL records for each tuple, there is no speed gain. Inserting a full page WAL record (since you just filled the page completely) : - only takes WalInsertLock once instead of once per tuple - reduces wal traffic - is about 2x faster in my benchmark And inserting a "clear new page" record (if the page was previously new/empty and relation is fsync'd at the end) : - only takes WalInsertLock once instead of once per tuple - reduces wal traffic a lot - is about 4x faster in my benchmark > Do you even need the new empty page record? I think a zero page will be > handled correctly next time it is read into shared buffers, won't it? I have no idea ;) > But I > guess it is need to avoid problems with partial page writes that would > leave in a state that is neither all zeros nor consistent. Plus, empty page records make for very small WAL traffic and I didn't see any performance difference with or without them. > If the entire page is logged, would it have to marked as not removable by > the log compression tool? Or can the tool recreate the needed delta? No, the tool cannot recreate the data, since the idea is precisely to replace a lot of "tuple insert" messages with one "entire page" message, which takes both less space and less time. The warm-standby replicators that get this WAL need to know the page contents to replicate it... (also, it will probably be faster for them to redo a page write than redo all the tuple inserts). Here is what I'm thinking about now : * have some kind of BulkInsertState which contains - info about the relation, indexes, triggers, etc - a tuple queue. The tuple queue may be a tuple store, or simply tuple copies in a local memory context. You'd have functions to : - Setup the BulkInsertState - Add a tuple to the BulkInsertState - Finish the operation and clear the BulkInsertState When adding a tuple, it is stored in the queue. When the queue is full, a bulk insert operation takes place, hopefully we can fill an entire page, and return. Post insert triggers and index updates are also handled at this point. When finished, the function that clears the state also inserts all remaining tuples in the queue. With this you could also do something *really* interesting : bulk index updates... Bulk index updates are probably mutually exclusive with after-row triggers though. Another angle of attack would be to make wal-writing more efficient...
2009/9/15 Pierre Frédéric Caillaud <lists@peufeu.com>
OK, that makes sense. I thought you had hacked either XLogInsert or the heap WAL replay code so that you could just accumulate tuples in the rdata chain and then submit them all under the cover of a single WALInsertLock. If you haven't done that, then of course doing the bulk insert doesn't help much if you still to tuple-by-tuple XLogInsert. So in the case that it is under the limit, you first run through the tuples putting them into the block, then run through the tuples again doing the XLogInserts?
Do you have an IO bottleneck even in the absence of fsyncs? My experience on multi-core machines with decent IO systems has been that the amount of WAL traffic (by volume) matters rather little, as opposed to the number WALInsertLocks taken, which matter quite a bit. Of course this depends quite a bit on your OS and hardware.
...
If you mean to do this without changing the xlog interfaces, I'm not optimistic.
If you have to call XLogInsert once per row that is copied (or insert...select), then my experiments show that simply taking the WALInsertLock and immediately releasing it, doing absolutely no real work while it is held, is already a substanial multi-core scalibility bottleneck. Once we accept that this must be done, the next existing bottleneck is the memcpy of the first byte from the rdata chain into the shared wal_buffers, presumably because this copy involves fighting the cache line away from other cores. Once you've copied the first byte, the rest of them seem to be almost free. (Again, this is probably hardware and situation dependent).
I've seen some suggestions that the wal_buffer block initation work be moved from being done by AdvanceXLInsert to instead be done by XLogWrite. However, I've not seen any indication that AdvanceXLInsert is a meaningful bottlneck in the first place. Except when wal_buffers is too small: then AdvanceXLInsert is a bottleneck, but only because XLogWrite is getting called from within it, in which case moving work from one to the other is probably not going to make things better.
Jeff
It is the logging of whole pages that makes it faster.Does that heuristic change the timings much? If not, it seems like it would
better to keep it simple and always do the same thing, like log the tuples
(if it is done under one WALInsertLock, which I am assuming it is..)
If you fill a page with tuples in one operation (while holding exclusive lock) and then insert WAL records for each tuple, there is no speed gain.
Inserting a full page WAL record (since you just filled the page completely) :
- only takes WalInsertLock once instead of once per tuple
OK, that makes sense. I thought you had hacked either XLogInsert or the heap WAL replay code so that you could just accumulate tuples in the rdata chain and then submit them all under the cover of a single WALInsertLock. If you haven't done that, then of course doing the bulk insert doesn't help much if you still to tuple-by-tuple XLogInsert. So in the case that it is under the limit, you first run through the tuples putting them into the block, then run through the tuples again doing the XLogInserts?
- reduces wal traffic
- is about 2x faster in my benchmark
And inserting a "clear new page" record (if the page was previously new/empty and relation is fsync'd at the end) :
- only takes WalInsertLock once instead of once per tuple
- reduces wal traffic a lot
- is about 4x faster in my benchmark
Do you have an IO bottleneck even in the absence of fsyncs? My experience on multi-core machines with decent IO systems has been that the amount of WAL traffic (by volume) matters rather little, as opposed to the number WALInsertLocks taken, which matter quite a bit. Of course this depends quite a bit on your OS and hardware.
...
Another angle of attack would be to make wal-writing more efficient...
If you mean to do this without changing the xlog interfaces, I'm not optimistic.
If you have to call XLogInsert once per row that is copied (or insert...select), then my experiments show that simply taking the WALInsertLock and immediately releasing it, doing absolutely no real work while it is held, is already a substanial multi-core scalibility bottleneck. Once we accept that this must be done, the next existing bottleneck is the memcpy of the first byte from the rdata chain into the shared wal_buffers, presumably because this copy involves fighting the cache line away from other cores. Once you've copied the first byte, the rest of them seem to be almost free. (Again, this is probably hardware and situation dependent).
I've seen some suggestions that the wal_buffer block initation work be moved from being done by AdvanceXLInsert to instead be done by XLogWrite. However, I've not seen any indication that AdvanceXLInsert is a meaningful bottlneck in the first place. Except when wal_buffers is too small: then AdvanceXLInsert is a bottleneck, but only because XLogWrite is getting called from within it, in which case moving work from one to the other is probably not going to make things better.
Jeff
> OK, that makes sense. I thought you had hacked either XLogInsert or the > heap WAL replay code so that you could just accumulate tuples in the > rdata > chain and then submit them all under the cover of a single WALInsertLock. Ah, no, I did not do that. This would be difficult to do : rdata chain contains buffer pointers, and when we are in XLogInsert, we have an exclusive lock on the buffer. If XLogInsert accumulated xlog records as you say, then later, when it's time to write them to the xlog, it would no longer hold exclusive lock on the buffer, so its contents could have changed, and if XLogInsert decides to do a full page write, the contents will be wrong. Besides, the LSN needs to be set in the page at every call to heap_insert (or else WAL will not be "Ahead"), so if XLogInsert accumulated stuff before writing it, it would need a mechanism to assign a LSN without having written the xlog data yet. > If you haven't done that, then of course doing the bulk insert doesn't > help much if you still do tuple-by-tuple XLogInsert. Exactly. > So in the case that it is > under the limit, you first run through the tuples putting them into the > block, then run through the tuples again doing the XLogInserts? Yes, exactly. This isn't really optimal... I wonder if I could build one rdata chain containing all updates to the tuples and submit it in one go. Would it work ?... > Do you have an IO bottleneck even in the absence of fsyncs? Sometimes, yes : - When xlog is on a (rather slow) software RAID1, I've found that the fsyncs which happen when switching to a new xlog segment are significant, and the amount of log data written is dangerously close to the max throughput, too. - When xlog is on a much faster RAID5, there is no such problem. fsync on raid5 is horrendously slow if you have many pending small writes, but if all you need to sync is a 16MB file which nicely aligns on raid stripes, it's not a problem. So in order to benchmark the right thing, I have : - all the tables in a big RAMDISK - xlog on RAID5 - fsync=fdatasync I've also found that setting wal_buffers to a large value like 128MB gives a significant speed boost when doing COPY or INSERT INTO SELECT, probably because it allows the backends to always find space in the buffers even if the walwriter is a bit busy. > My experience > on multi-core machines with decent IO systems has been that the amount of > WAL traffic (by volume) matters rather little, as opposed to the number > WALInsertLocks taken, which matter quite a bit. Of course this depends > quite a bit on your OS and hardware. Exactly : with the configuration above, iowait is extremely small (<1%) , yet all processes still spend 50% of their time waiting on WALInsertLock. The lock is held for about 15% of the total time ; with 4 cores and 4 threads, if a core spends X time holding the lock, it's logical that the others spend 3*X time waiting on it. But consider this : - 4 processes spend 50% of their time waiting on the lock - therefore at any time there are average 2 processes waiting - therefore every time the lock is Released, a process is woken up => more than 200.000 context switches per second seen in vmstat Actually it does 1 context switch every two inserted rows, which is enormous... I've done a bit of profiling and found that 70% of the time spent holding this lock is... computing the header's CRC. Just for a quick check, I removed all CRC calculations. There was absolutely no performance gain. >> Another angle of attack would be to make wal-writing more efficient... > If you mean to do this without changing the xlog interfaces, I'm not > optimistic. Agree : that's why I didn't even try ;) > If you have to call XLogInsert once per row that is copied (or > insert...select), then my experiments show that simply taking the > WALInsertLock and immediately releasing it, doing absolutely no real work > while it is held, is already a substanial multi-core scalibility > bottleneck. Confirmed by context switch issue above... Having all cores block on the same lock for every row can be OK if it's a spinlock protecting just a few lines of code... which is not the present case... > Once we accept that this must be done, the next existing > bottleneck is the memcpy of the first byte from the rdata chain into the > shared wal_buffers, presumably because this copy involves fighting the > cache > line away from other cores. Once you've copied the first byte, the rest > of > them seem to be almost free. (Again, this is probably hardware and > situation dependent). Since I have one 4 core CPU I probably don't see this effect (a multi-cpu case would). > I've seen some suggestions that the wal_buffer block initation work be > moved > from being done by AdvanceXLInsert to instead be done by XLogWrite. > However, I've not seen any indication that AdvanceXLInsert is a > meaningful > bottlneck in the first place. Except when wal_buffers is too small: then > AdvanceXLInsert is a bottleneck, but only because XLogWrite is getting > called from within it, in which case moving work from one to the other is > probably not going to make things better. That correlates with the benefits I saw by increasing wal_buffers to a very large value... I'm going to try other things and report back.
2009/9/16 Pierre Frédéric Caillaud <lists@peufeu.com>
Yes, I didn't mean to make XLogInsert unilaterally accumulate rdata (Actually I have done that, purely as a proof of concept. The resulting database is completely unrecoverable, but as long you don't bring it down, it runs fine and lets me test the speed of different concepts without going to the trouble of implementing them correctly).
heap_bulk_insert would do the accumulation. The hack to XLogInsert would involve making it insert an extra dummy xlog record header for every tuple in the rdata chain. Alternatively, the hack to heap replay would involve making it accept multiple tuples reported under a single WAL record. I don't know which one would be easier.
Right, XLogInsert would only be able to accumulate as long as it knew it was going to get called again before the buffer exclusive lock was released. That is why the accumulation is better done in the heap_bulk_insert, otherwise it would require an unfortunate amount of communication between the two.
I'm sure it could be made to work. I haven't checked the replay code, but I doubt it would work on this massed record right out of the box. Or we could insert dummy headers between the each tuple's WAL data.
...
That seems very large, even for the high throughput set up you describe. Is the WAL background writer set at the default interval? (On the other hand, if 128MB is just a rounding error in your machine's total RAM size, why not be generous? On the other other hand, I've seen perfomance start to drop as wal_buffers gets too large, though I never bothered to chased down the cause.)
At one point, I think that XLogInsert would synchronously write out the buffer when it noticed it was over half full, but that got ripped out (if I'm going to block, I might as well wait until it actually is full to do). Now that there is a background writer, maybe it would be a good idea to have XLogInserters wake it up if the buffer is half full.
...
Maybe there could be some hybrid approach. You take the spinlock, check that you could get the lwlock if you wanted to. If you could get the lwlock and know you have almost no work to do, then just do it while holding the spinlock. If you discover you have more work todo (xlog file switch, nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop the spinlock. This *really* breaks the encapsulation of lwlock, so it is probably not much more than idle speculation.
Jeff
Ah, no, I did not do that.OK, that makes sense. I thought you had hacked either XLogInsert or the
heap WAL replay code so that you could just accumulate tuples in the rdata
chain and then submit them all under the cover of a single WALInsertLock.
This would be difficult to do : rdata chain contains buffer pointers, and when we are in XLogInsert, we have an exclusive lock on the buffer. If XLogInsert accumulated xlog records as you say, then later, when it's time to write them to the xlog, it would no longer hold exclusive lock on the buffer, so its contents could have changed, and if XLogInsert decides to do a full page write, the contents will be wrong.
Yes, I didn't mean to make XLogInsert unilaterally accumulate rdata (Actually I have done that, purely as a proof of concept. The resulting database is completely unrecoverable, but as long you don't bring it down, it runs fine and lets me test the speed of different concepts without going to the trouble of implementing them correctly).
heap_bulk_insert would do the accumulation. The hack to XLogInsert would involve making it insert an extra dummy xlog record header for every tuple in the rdata chain. Alternatively, the hack to heap replay would involve making it accept multiple tuples reported under a single WAL record. I don't know which one would be easier.
Besides, the LSN needs to be set in the page at every call to heap_insert (or else WAL will not be "Ahead"), so if XLogInsert accumulated stuff before writing it, it would need a mechanism to assign a LSN without having written the xlog data yet.
Right, XLogInsert would only be able to accumulate as long as it knew it was going to get called again before the buffer exclusive lock was released. That is why the accumulation is better done in the heap_bulk_insert, otherwise it would require an unfortunate amount of communication between the two.
If you haven't done that, then of course doing the bulk insert doesn't help much if you still do tuple-by-tuple XLogInsert.
Exactly.Yes, exactly. This isn't really optimal...So in the case that it is
under the limit, you first run through the tuples putting them into the
block, then run through the tuples again doing the XLogInserts?
I wonder if I could build one rdata chain containing all updates to the tuples and submit it in one go. Would it work ?...
I'm sure it could be made to work. I haven't checked the replay code, but I doubt it would work on this massed record right out of the box. Or we could insert dummy headers between the each tuple's WAL data.
...
So in order to benchmark the right thing, I have :
- all the tables in a big RAMDISK
- xlog on RAID5
- fsync=fdatasync
I've also found that setting wal_buffers to a large value like 128MB gives a significant speed boost when doing COPY or INSERT INTO SELECT, probably because it allows the backends to always find space in the buffers even if the walwriter is a bit busy.
That seems very large, even for the high throughput set up you describe. Is the WAL background writer set at the default interval? (On the other hand, if 128MB is just a rounding error in your machine's total RAM size, why not be generous? On the other other hand, I've seen perfomance start to drop as wal_buffers gets too large, though I never bothered to chased down the cause.)
At one point, I think that XLogInsert would synchronously write out the buffer when it noticed it was over half full, but that got ripped out (if I'm going to block, I might as well wait until it actually is full to do). Now that there is a background writer, maybe it would be a good idea to have XLogInserters wake it up if the buffer is half full.
...
Agree : that's why I didn't even try ;)Another angle of attack would be to make wal-writing more efficient...If you mean to do this without changing the xlog interfaces, I'm not
optimistic.Confirmed by context switch issue above...If you have to call XLogInsert once per row that is copied (or
insert...select), then my experiments show that simply taking the
WALInsertLock and immediately releasing it, doing absolutely no real work
while it is held, is already a substanial multi-core scalibility
bottleneck.
Having all cores block on the same lock for every row can be OK if it's a spinlock protecting just a few lines of code... which is not the present case...
Maybe there could be some hybrid approach. You take the spinlock, check that you could get the lwlock if you wanted to. If you could get the lwlock and know you have almost no work to do, then just do it while holding the spinlock. If you discover you have more work todo (xlog file switch, nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop the spinlock. This *really* breaks the encapsulation of lwlock, so it is probably not much more than idle speculation.
Jeff
> heap_bulk_insert would do the accumulation. The hack to XLogInsert would > involve making it insert an extra dummy xlog record header for every > tuple WAL record header is something like 40 bytes, so if you make lots of inserts in a page, you'd be better off WALing the whole page, it takes less space, it is much faster, and if you're coming right after a checkpoint, you're going to do it anyway on the first inserted row, so it would be even better to do it on the last inserted row... WAL Insert record is 55 bytes + tuple data However, you can't hold an exclusive lock on a buffer while going out in the executor to fetch the next tuple, since that can take a undetermined amount of time : imagine the record comes from a dblink() and there is a connection loss... so you'd need a TupleBuffer, something like a tuplestore that doesn't spill to disk and holds only about 32 kB of tuples, and : - fill buffer with tuples coming from the executor (or COPY), - pass this to heap_bulk_insert(), - toasts tuples (maybe also bulk-insert in the TOAST table) - take the exclusive lock on the buffer, - insert tuples quickly, - log the whole page when it's full, or log individual inserts (in 1 operation) - release the lock > in the rdata chain. Alternatively, the hack to heap replay would involve > making it accept multiple tuples reported under a single WAL record. I > don't know which one would be easier. This one is probably easier, since all your WAL records refer to the same buffer. > I'm sure it could be made to work. I haven't checked the replay code, > but I > doubt it would work on this massed record right out of the box. Or we > could > insert dummy headers between the each tuple's WAL data. For small rows, the header is as large as the row... might as well get rid of it ! Inserting dummy headers would not work, here's why : - The critical part (lock-wise) of XLogInsert is generating a LSN - The LSN calculation needs to verify that the header isn't split between pages - Therefore, if you want to insert N records in one operation, then this critical part is going to take quite some time - And the lock should be held for as little time as possible... > That seems very large, even for the high throughput set up you > describe. Is > the WAL background writer set at the default interval? (On the other > hand, > if 128MB is just a rounding error in your machine's total RAM size, why > not > be generous? On the other other hand, I've seen perfomance start to > drop as > wal_buffers gets too large, though I never bothered to chased down the > cause.) > > At one point, I think that XLogInsert would synchronously write out the > buffer when it noticed it was over half full, but that got ripped out (if > I'm going to block, I might as well wait until it actually is full to > do). > Now that there is a background writer, maybe it would be a good idea to > have > XLogInserters wake it up if the buffer is half full. > ... Here's what I think : Currently, using big WAL buffers decreases performance (while you'd expect to increase it). Here's why. Case 1) consider those concurrent transactions : - Transaction 1 : do a big operation (vacuum, COPY, etc) - Transaction 2 : BEGIN; do a few little UPDATEs... COMMIT Suppose you have a large wal_buffer like 32MB. By the time transaction 2 wants to commit and sync, transaction 1 has inserted lots of WAL in the buffer, so transaction 1 finds it needs to write and fsync() like 2x 16 MB WAL segments. Needless to say, transaction 2 commit is going to be pretty slow. With small buffers, all the stuff would have been in the OS cache if not already on the disks. Case 2) you just do a big COPY, or a big VACUUM that writes lots of xlog. Suppose you're generating like 50 MB/s of WAL. You've set your WAL writer delay to something very low like 20 ms, so the buffers are nicely emptied... However, with this throughput, you're going through 1 xlog segment every 300 ms, so when WALWrite() tries to fsync() the segment to advance to the next, the data to write is in the OS cache, but the OS probably hasn't decided to write it to disk yet, so your fsync call is very expensive, and it is blocking, this means if your buffers are smaller than a WAL segment, then the buffers are full while you wait for fsync, and everything stands still. Here's my current test setup : - WAL on a dedicated disk (this is really good for performance unless you got a battery backed up controller) - fsync=o_dsync - very aggressive WAL writer delay like 5 ms Basically, all the WAL disk does is write WAL, so the head hardly moves at all. The disk takes about 8 ms for a rotation (this is less than wal_writer_delay). Whenever there is WAL to write, walwriter writes it, and blocks because of O_DSYNC mode. So, basically, at each disk rotation, the pending WAL is written. Then, fsync() is a noop. This gives excellent, and very smooth performance, whereas fsync() gives quite random, and sometimes pretty high, wait times, since the length of fsync() wait depends on how much unflushed data sits in the OS buffers. However there is a problem with O_DSYNC : - Latency is excellent, but throughput is miserable because you can only write 1 chunk per disk rotation. - If all you need to write is a few pages, and you want to sync them, it's perfect. - If you want to write at a high throughput, though, you'd better write data in LARGE chunks, ideally 1 WAL segment at a time. So, O_DSYNC is much better with large WAL buffers (>16 MB) and small walwriter delay, ideally delay < disk rotation. > Maybe there could be some hybrid approach. You take the spinlock, check > that you could get the lwlock if you wanted to. If you could get the > lwlock > and know you have almost no work to do, then just do it while holding the > spinlock. If you discover you have more work todo (xlog file switch, > nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop > the > spinlock. This *really* breaks the encapsulation of lwlock, so it is > probably not much more than idle speculation. That's a description of something that could look like a futex ;) Anyway, I've made a little patch (attached). It applies to git revision 1384847ef8731718a79a32cd354e31c31c5294a0 of current postgres (from last week), it probably applies to current HEAD, but I have to learn how to do a merge in git before I can tell you that ;) What it does : I've put extremely detailed comments in xlog.c, this is a brief summary : Since the critical part of XLogInsert is generating a LSN, not writing the data, I have created a LSN generator, which is extremely fast, so it is protected by a spinlock. Using anything else than a spinlock creates a nice point of serialization and kills all performance gains. Concurrent XLogInserts get a LSN from the LSN generator, and then they insert their data in the buffers, concurrently, under a LW_SHARED WALInsertLock. The buffer queue logic was entirely redone too. XlogWrite marks buffers as free as soon as possible, so if you use 32 MB wal_buffers, when XLogWrite writes data, the buffer space is reused immediately, while the previous data is being fsync()ed. To avoid writing partial WAL records to disk, we must be sure that all records in the buffer are completely written. This is done by taking an exclusive lock on WALInsertLock, which waits for all memory writes to finish, then taking a peek at the last written record, and releasing the lock immediately. I've also added an adaptive walwriter delay (depending on load). I've added a lot of comments in xlog.c, check them out. Here is the configuration I use for testing : fsync = on synchronous_commit = on or off depending on test wal_sync_method = open_sync full_page_writes = on wal_buffers = 32MB wal_writer_delay = 100ms (adaptive walwriter delay lowers it if needed) Benchmarks are parallel COPY, and parallel INSERT INTO SELECT. However, be careful with parallel INSERT INTO SELECT if all your threads insert into the same table : you'll be benchmarking the FSM more than the WAL itself... On my setup, 8.5 is bound by the WALInsertLock, but with this patch, my disks are often maxed out, so I can't really tell. I had to put the tables on a ramdisk. I get an almost x2 speedup on parallel COPY of a table with 9 INT fields, then cpu is maxed out parsing integers... If you have more than 4 cores and faster disks, I'd be very interested in your results ! It will print some stuff to the console, if you see lots of ">>W>W>>>W", it means your WAL writing throughput is maxed out. "W" means it had to flush buffers in XLogInsert (bad) instead of somewhere non-critical like the wal writer. "<" and ">" mean start and end of exclusive lock on WALInsert taken because it had to flush. Attached is a little Python benchmark script. Change the definitions for where to put the files if you want to use it. Regards, Pierre