Thread: Bulk Inserts

Bulk Inserts

From
Pierre Frédéric Caillaud
Date:
I've done a little experiment with bulk inserts.

=> heap_bulk_insert()

Behaves like heap_insert except it takes an array of tuples (HeapTuple  
*tups, int ntups).

- Grabs a page (same as heap_insert)

- While holding exclusive lock, inserts as many tuples as it can on the  
page.- Either the page gets full- Or we run out of tuples.

- Generate xlog : choice between- Full Xlog mode :    - if we inserted more than 10 tuples (totaly bogus heuristic),
logthe  
 
entire page    - Else, log individual tuples as heap_insert does- Light log mode :    - if page was empty, only xlog a
"newempty page" record, not page  
 
contents    - else, log fully    - heap_sync() at the end

- Release the page
- If we still have tuples to insert, repeat.

Am I right in assuming that :

1)
- If the page was empty,
- and log archiving isn't used,
- and the table is heap_sync()'d at the end,
=> only a "new empty page" record needs to be created, then the page can  
be completely filled ?

2)
- If the page isn't empty
- or log archiving is used,
=> logging either the inserted tuples or the entire page is OK to  
guarantee persistence ?

(I used kill -9 to test it, recovery seems to work).

Test on a concurrent COPY, 4 threads, on a table with 8 INT columns.

* 8.5 HEAD : Total Time 44 s
* Bulk inserts, Full XLog : Total Time 24 s
* Bulk inserts, Light XLog : Total Time 10 s

Quite a bit faster... I presume with more CPUs it would scale.

I'm not posting the patch because it's quite ugly (especially the part to  
store tuples in copy.c and bulk-insert them, I should probably have used a  
tuplestore...)
I think the tuples need to be stored and then bulk-inserted because the  
exclusive lock on the buffer can't be held for a long time.


Lock stats (from the patch I just posted) :

* 8.5 HEAD : Total Time 44 s

-------- Lock stats for PID 28043    PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT       
ExAcq     ExWait             ExWaitT                ExHeldT     Name  28043       7          0          0        0.00
    0.00     
 
2500002     804378       23.59 ( 53.11 %)        7.38 ( 16.61 %)  WALInsert  28043       8          0          0
0.00       0.00       
 
25775         32        2.91 (  6.54 %)        0.90 (  2.02 %)  WALWrite

-------- Lock stats for PID 28044    PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT       
ExAcq     ExWait             ExWaitT                ExHeldT     Name  28044       7          0          0        0.00
    0.00     
 
2500002     802515       22.26 ( 50.11 %)        8.70 ( 19.59 %)  WALInsert  28044       8          0          0
0.00       0.00       
 
25620         42        4.00 (  9.01 %)        1.12 (  2.52 %)  WALWrite

-------- Lock stats for PID 28045    PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT       
ExAcq     ExWait             ExWaitT                ExHeldT     Name  28045       7          0          0        0.00
    0.00     
 
2500002     799145       22.47 ( 50.32 %)        8.72 ( 19.52 %)  WALInsert  28045       8          0          0
0.00       0.00       
 
25725         38        4.08 (  9.14 %)        1.05 (  2.35 %)  WALWrite

-------- Lock stats for PID 28042    PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT       
ExAcq     ExWait             ExWaitT                ExHeldT     Name  28042       7          0          0        0.00
    0.00     
 
2500002     809477       23.49 ( 52.44 %)        7.89 ( 17.62 %)  WALInsert  28042       8          0          0
0.00       0.00       
 
25601         37        3.27 (  7.31 %)        1.05 (  2.34 %)  WALWrite


* Bulk inserts, Full XLog : Total Time 24 s

-------- Lock stats for PID 32486    PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT       
ExAcq     ExWait             ExWaitT                ExHeldT     Name  32486       7          0          0        0.00
    0.00       
 
23765       1128        9.22 ( 38.98 %)        4.05 ( 17.14 %)  WALInsert  32486       8          0          0
0.00       0.00       
 
21120         19        2.64 ( 11.17 %)        1.32 (  5.59 %)  WALWrite

-------- Lock stats for PID 32484    PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT       
ExAcq     ExWait             ExWaitT                ExHeldT     Name  32484       7          0          0        0.00
    0.00       
 
23865       1083        9.87 ( 41.68 %)        2.87 ( 12.11 %)  WALInsert  32484       8          0          0
0.00       0.00       
 
21105         11        1.68 (  7.11 %)        1.09 (  4.62 %)  WALWrite  32484    8508          0          0
0.00       0.00           
 
1          1        0.19 (  0.81 %)        0.00 (  0.00 %)  32484   18846          0          0        0.00        0.00
         
 
1          1        0.25 (  1.05 %)        0.00 (  0.00 %)

-------- Lock stats for PID 32485    PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT       
ExAcq     ExWait             ExWaitT                ExHeldT     Name  32485       7          0          0        0.00
    0.00       
 
23816       1107        8.94 ( 37.75 %)        4.05 ( 17.09 %)  WALInsert  32485       8          0          0
0.00       0.00       
 
21109         21        2.59 ( 10.93 %)        1.36 (  5.77 %)  WALWrite  32485   16618          0          0
0.00       0.00           
 
1          2        0.23 (  0.98 %)        0.00 (  0.00 %)

-------- Lock stats for PID 32482    PID    Lock      ShAcq     ShWait     ShWaitT     ShHeldT       
ExAcq     ExWait             ExWaitT                ExHeldT     Name  32482       7          0          0        0.00
    0.00       
 
23813       1053        9.70 ( 40.75 %)        3.41 ( 14.32 %)  WALInsert  32482       8          0          0
0.00       0.00       
 
21119         15        2.24 (  9.43 %)        1.06 (  4.44 %)  WALWrite  32482    6770          0          0
0.00       0.00           
 
3          1        0.17 (  0.70 %)        0.00 (  0.00 %)

* Bulk inserts, Light XLog : Total Time 10 s

No Lock stats to show (wait tims is < 0.01 s)...



Re: Bulk Inserts

From
Pierre Frédéric Caillaud
Date:
Replying to myself...

Jeff suggested to build pages in local memory and insert them later in the  
table. This is what's used in CLUSTER for instance, I believe.

It has some drawbacks though :

- To insert the tuples in indexes, the tuples need tids, but if you build  
the page in local memory, you don't know on which page they will be until  
after allocating the page, which will probably be done after the page is  
built, so it's a bit of a chicken and egg problem.

- It only works on new pages. Pages which are not empty, but have free  
space, cannot be written in this way.

The little experiment I made yesterday does not have these drawbacks,  
since it allocates pages in the standard way, simply it inserts many  
tuples in one operation instead of just inserting one. If the page  
happened to be empty, it's even better, but it's not necessary. If your  
table has lots of free space, it will be used.


Re: Bulk Inserts

From
Jeff Janes
Date:
2009/9/14 Pierre Frédéric Caillaud <lists@peufeu.com>

I've done a little experiment with bulk inserts.

=> heap_bulk_insert()

Behaves like heap_insert except it takes an array of tuples (HeapTuple *tups, int ntups).

- Grabs a page (same as heap_insert)

- While holding exclusive lock, inserts as many tuples as it can on the page.
       - Either the page gets full
       - Or we run out of tuples.

- Generate xlog : choice between
       - Full Xlog mode :
               - if we inserted more than 10 tuples (totaly bogus heuristic), log the entire page
               - Else, log individual tuples as heap_insert does

Does that heuristic change the timings much?  If not, it seems like it would better to keep it simple and always do the same thing, like log the tuples (if it is done under one WALInsertLock, which I am assuming it is..)
 
       - Light log mode :
               - if page was empty, only xlog a "new empty page" record, not page contents
               - else, log fully
               - heap_sync() at the end

- Release the page
- If we still have tuples to insert, repeat.

Am I right in assuming that :

1)
- If the page was empty,
- and log archiving isn't used,
- and the table is heap_sync()'d at the end,
=> only a "new empty page" record needs to be created, then the page can be completely filled ?

Do you even need the new empty page record?  I think a zero page will be handled correctly next time it is read into shared buffers, won't it?  But I guess it is need to avoid  problems with partial page writes that would leave in a state that is neither all zeros nor consistent.



2)
- If the page isn't empty
- or log archiving is used,
=> logging either the inserted tuples or the entire page is OK to guarantee persistence ?

If the entire page is logged, would it have to marked as not removable by the log compression tool?  Or can the tool recreate the needed delta?
 
Jeff

Re: Bulk Inserts

From
Jeff Janes
Date:


2009/9/14 Pierre Frédéric Caillaud <lists@peufeu.com>

Replying to myself...

Jeff suggested to build pages in local memory and insert them later in the table. This is what's used in CLUSTER for instance, I believe.

It has some drawbacks though :

- To insert the tuples in indexes, the tuples need tids, but if you build the page in local memory, you don't know on which page they will be until after allocating the page, which will probably be done after the page is built, so it's a bit of a chicken and egg problem.

Yes, I did not consider that to be a problem because I did not think it would be used on indexed tables.  I figured that the gain from doing bulk inserts into the table would be so diluted by the still-bottle-necked index maintenance that it was OK not to use this optimization for indexed tables.
 

- It only works on new pages. Pages which are not empty, but have free space, cannot be written in this way.

My original thought was based on the idea of still using heap_insert, but with a modified form of bistate which would hold the exclusive lock and not just a pin.  If heap_insert is being driven by the unmodified COPY code, then it can't guarantee that COPY won't stall on a pipe read or something, and so probably shouldn't hold an exclusive lock while filling the block.  That is why I decided a local buffer would be better, as the exclusive lock is really a no-op and wouldn't block anyone.  But if you are creating a new heap_bulk_insert and modifying the COPY to go with it, then you can guarantee it won't stall from the driving end, instead.

 Whether any of these approaches will be maintainable enough to be integrated into the code base is another matter.  It seems like there is already a lot of discussion going on around various permutations of copy options.

Cheers,

Jeff

Re: Bulk Inserts

From
Pierre Frédéric Caillaud
Date:
> Yes, I did not consider that to be a problem because I did not think it
> would be used on indexed tables.  I figured that the gain from doing bulk
> inserts into the table would be so diluted by the still-bottle-necked  
> index maintenance that it was OK not to use this optimization for  
> indexed tables.

I've tested with indexes, and the index update time is much larger than  
the inserts time. Bulk inserts still provide a little bonus though, and  
having a solution that works in all cases is better IMHO.

> My original thought was based on the idea of still using heap_insert, but
> with a modified form of bistate which would hold the exclusive lock and  
> not
> just a pin.  If heap_insert is being driven by the unmodified COPY code,
> then it can't guarantee that COPY won't stall on a pipe read or  
> something,
> and so probably shouldn't hold an exclusive lock while filling the block.

Exactly, that's what I was thinking too, and reached the same conclusion.

> That is why I decided a local buffer would be better, as the exclusive  
> lock
> is really a no-op and wouldn't block anyone.  But if you are creating a  
> new
> heap_bulk_insert and modifying the COPY to go with it, then you can
> guarantee it won't stall from the driving end, instead.

I think it's better, but you have to buffer tuples : at least a full  
page's worth, or better, several pages' worth of tuples, in case inline  
compression kicks in and shrinks them, since the purpose is to be able to  
fill a complete page in one go.

>  Whether any of these approaches will be maintainable enough to be
> integrated into the code base is another matter.  It seems like there is
> already a lot of discussion going on around various permutations of copy
> options.

It's not really a COPY mod, since it would also be good for big INSERT  
INTO SELECT FROM which is wal-bound too (even more so than COPY, since  
there is no parsing to do).


Re: Bulk Inserts

From
Pierre Frédéric Caillaud
Date:
> Does that heuristic change the timings much?  If not, it seems like it  
> would
> better to keep it simple and always do the same thing, like log the  
> tuples
> (if it is done under one WALInsertLock, which I am assuming it is..)

It is the logging of whole pages that makes it faster.
If you fill a page with tuples in one operation (while holding exclusive  
lock) and then insert WAL records for each tuple, there is no speed gain.

Inserting a full page WAL record (since you just filled the page  
completely) :

- only takes WalInsertLock once instead of once per tuple
- reduces wal traffic
- is about 2x faster in my benchmark

And inserting a "clear new page" record (if the page was previously  
new/empty and relation is fsync'd at the end) :

- only takes WalInsertLock once instead of once per tuple
- reduces wal traffic a lot
- is about 4x faster in my benchmark

> Do you even need the new empty page record?  I think a zero page will be
> handled correctly next time it is read into shared buffers, won't it?

I have no idea ;)

> But I
> guess it is need to avoid  problems with partial page writes that would
> leave in a state that is neither all zeros nor consistent.

Plus, empty page records make for very small WAL traffic and I didn't see  
any performance difference with or without them.

> If the entire page is logged, would it have to marked as not removable by
> the log compression tool?  Or can the tool recreate the needed delta?

No, the tool cannot recreate the data, since the idea is precisely to  
replace a lot of "tuple insert" messages with one "entire page" message,  
which takes both less space and less time. The warm-standby replicators  
that get this WAL need to know the page contents to replicate it... (also,  
it will probably be faster for them to redo a page write than redo all the  
tuple inserts).

Here is what I'm thinking about now :

* have some kind of BulkInsertState which contains
- info about the relation, indexes, triggers, etc
- a tuple queue.

The tuple queue may be a tuple store, or simply tuple copies in a local  
memory context.

You'd have functions to :

- Setup the BulkInsertState
- Add a tuple to the BulkInsertState
- Finish the operation and clear the BulkInsertState

When adding a tuple, it is stored in the queue.
When the queue is full, a bulk insert operation takes place, hopefully we  
can fill an entire page, and return.
Post insert triggers and index updates are also handled at this point.

When finished, the function that clears the state also inserts all  
remaining tuples in the queue.

With this you could also do something *really* interesting : bulk index  
updates...

Bulk index updates are probably mutually exclusive with after-row triggers  
though.

Another angle of attack would be to make wal-writing more efficient...







Re: Bulk Inserts

From
Jeff Janes
Date:
2009/9/15 Pierre Frédéric Caillaud <lists@peufeu.com>
Does that heuristic change the timings much?  If not, it seems like it would
better to keep it simple and always do the same thing, like log the tuples
(if it is done under one WALInsertLock, which I am assuming it is..)

It is the logging of whole pages that makes it faster.
If you fill a page with tuples in one operation (while holding exclusive lock) and then insert WAL records for each tuple, there is no speed gain.

Inserting a full page WAL record (since you just filled the page completely) :

- only takes WalInsertLock once instead of once per tuple


OK, that makes sense.  I thought you had hacked either XLogInsert or the heap WAL replay code so that you could just accumulate tuples in the rdata chain and then submit them all under the cover of a single WALInsertLock.  If you haven't done that, then of course doing the bulk insert doesn't help much if you still to tuple-by-tuple XLogInsert.  So in the case that it is under the limit, you first run through the tuples putting them into the block, then run through the tuples again doing the XLogInserts?
 
- reduces wal traffic
- is about 2x faster in my benchmark

And inserting a "clear new page" record (if the page was previously new/empty and relation is fsync'd at the end) :

- only takes WalInsertLock once instead of once per tuple
- reduces wal traffic a lot
- is about 4x faster in my benchmark

Do you have an IO bottleneck even in the absence of fsyncs?  My experience on multi-core machines with decent IO systems has been that the amount of WAL traffic (by volume) matters rather little, as opposed to the number WALInsertLocks taken, which matter quite a bit.  Of course this depends quite a bit on your OS and hardware.
 

...

Another angle of attack would be to make wal-writing more efficient...

If you mean to do this without changing the xlog interfaces, I'm not optimistic. 

If you have to call XLogInsert once per row that is copied (or insert...select), then my experiments show that simply taking the WALInsertLock and immediately releasing it, doing absolutely no real work while it is held, is already a substanial multi-core scalibility bottleneck.  Once we accept that this must be done, the next existing bottleneck is the memcpy of the first byte from the rdata chain into the shared wal_buffers, presumably because this copy involves fighting the cache line away from other cores.  Once you've copied the first byte, the rest of them seem to be almost free.  (Again, this is probably hardware and situation dependent). 

I've seen some suggestions that the wal_buffer block initation work be moved from being done by AdvanceXLInsert to instead be done by XLogWrite.  However, I've not seen any indication that AdvanceXLInsert is a meaningful bottlneck in the first place.  Except when wal_buffers is too small: then AdvanceXLInsert is a bottleneck, but only because XLogWrite is getting called from within it, in which case moving work from one to the other is probably not going to make things better.

Jeff

Re: Bulk Inserts

From
Pierre Frédéric Caillaud
Date:
> OK, that makes sense.  I thought you had hacked either XLogInsert or the
> heap WAL replay code so that you could just accumulate tuples in the  
> rdata
> chain and then submit them all under the cover of a single WALInsertLock.

Ah, no, I did not do that.

This would be difficult to do : rdata chain contains buffer pointers, and  
when we are in XLogInsert, we have an exclusive lock on the buffer. If  
XLogInsert accumulated xlog records as you say, then later, when it's time  
to write them to the xlog, it would no longer hold exclusive lock on the  
buffer, so its contents could have changed, and if XLogInsert decides to  
do a full page write, the contents will be wrong.

Besides, the LSN needs to be set in the page at every call to heap_insert  
(or else WAL will not be "Ahead"), so if XLogInsert accumulated stuff  
before writing it, it would need a mechanism to assign a LSN without  
having written the xlog data yet.

> If you haven't done that, then of course doing the bulk insert doesn't  
> help much if you still do tuple-by-tuple XLogInsert.

Exactly.

> So in the case that it is
> under the limit, you first run through the tuples putting them into the
> block, then run through the tuples again doing the XLogInserts?

Yes, exactly. This isn't really optimal...
I wonder if I could build one rdata chain containing all updates to the  
tuples and submit it in one go. Would it work ?...

> Do you have an IO bottleneck even in the absence of fsyncs?

Sometimes, yes :

- When xlog is on a (rather slow) software RAID1, I've found that the  
fsyncs which happen when switching to a new xlog segment are significant,  
and the amount of log data written is dangerously close to the max  
throughput, too.
- When xlog is on a much faster RAID5, there is no such problem. fsync on  
raid5 is horrendously slow if you have many pending small writes, but if  
all you need to sync is a 16MB file which nicely aligns on raid stripes,  
it's not a problem.

So in order to benchmark the right thing, I have :
- all the tables in a big RAMDISK
- xlog on RAID5
- fsync=fdatasync

I've also found that setting wal_buffers to a large value like 128MB gives  
a significant speed boost when doing COPY or INSERT INTO SELECT, probably  
because it allows the backends to always find space in the buffers even if  
the walwriter is a bit busy.

> My experience
> on multi-core machines with decent IO systems has been that the amount of
> WAL traffic (by volume) matters rather little, as opposed to the number
> WALInsertLocks taken, which matter quite a bit.  Of course this depends
> quite a bit on your OS and hardware.

Exactly : with the configuration above, iowait is extremely small (<1%) ,  
yet all processes still spend 50% of their time waiting on WALInsertLock.

The lock is held for about 15% of the total time ; with 4 cores and 4  
threads, if a core spends X time holding the lock, it's logical that the  
others spend 3*X time waiting on it.

But consider this :

- 4 processes spend 50% of their time waiting on the lock
- therefore at any time there are average 2 processes waiting
- therefore every time the lock is Released, a process is woken up

=> more than 200.000 context switches per second seen in vmstat

Actually it does 1 context switch every two inserted rows, which is  
enormous...

I've done a bit of profiling and found that 70% of the time spent holding  
this lock is... computing the header's CRC.
Just for a quick check, I removed all CRC calculations. There was  
absolutely no performance gain.

>> Another angle of attack would be to make wal-writing more efficient...
> If you mean to do this without changing the xlog interfaces, I'm not
> optimistic.

Agree : that's why I didn't even try ;)

> If you have to call XLogInsert once per row that is copied (or
> insert...select), then my experiments show that simply taking the
> WALInsertLock and immediately releasing it, doing absolutely no real work
> while it is held, is already a substanial multi-core scalibility
> bottleneck.

Confirmed by context switch issue above...

Having all cores block on the same lock for every row can be OK if it's a  
spinlock protecting just a few lines of code... which is not the present  
case...

> Once we accept that this must be done, the next existing
> bottleneck is the memcpy of the first byte from the rdata chain into the
> shared wal_buffers, presumably because this copy involves fighting the  
> cache
> line away from other cores.  Once you've copied the first byte, the rest  
> of
> them seem to be almost free.  (Again, this is probably hardware and
> situation dependent).

Since I have one 4 core CPU I probably don't see this effect (a multi-cpu  
case would).

> I've seen some suggestions that the wal_buffer block initation work be  
> moved
> from being done by AdvanceXLInsert to instead be done by XLogWrite.
> However, I've not seen any indication that AdvanceXLInsert is a  
> meaningful
> bottlneck in the first place.  Except when wal_buffers is too small: then
> AdvanceXLInsert is a bottleneck, but only because XLogWrite is getting
> called from within it, in which case moving work from one to the other is
> probably not going to make things better.

That correlates with the benefits I saw by increasing wal_buffers to a  
very large value...

I'm going to try other things and report back.




Re: Bulk Inserts

From
Jeff Janes
Date:


2009/9/16 Pierre Frédéric Caillaud <lists@peufeu.com>

OK, that makes sense.  I thought you had hacked either XLogInsert or the
heap WAL replay code so that you could just accumulate tuples in the rdata
chain and then submit them all under the cover of a single WALInsertLock.

Ah, no, I did not do that.

This would be difficult to do : rdata chain contains buffer pointers, and when we are in XLogInsert, we have an exclusive lock on the buffer. If XLogInsert accumulated xlog records as you say, then later, when it's time to write them to the xlog, it would no longer hold exclusive lock on the buffer, so its contents could have changed, and if XLogInsert decides to do a full page write, the contents will be wrong.

Yes, I didn't mean to make XLogInsert unilaterally accumulate rdata  (Actually I have done that, purely as a proof of concept.  The resulting database is completely unrecoverable, but as long you don't bring it down, it runs fine and lets me test the speed of different concepts without going to the trouble of implementing them correctly). 

heap_bulk_insert would do the accumulation.  The hack to XLogInsert would involve making it insert an extra dummy xlog record header for every tuple in the rdata chain.  Alternatively, the hack to heap replay would involve making it accept multiple tuples reported under a single WAL record.  I don't know which one would be easier.
 

Besides, the LSN needs to be set in the page at every call to heap_insert (or else WAL will not be "Ahead"), so if XLogInsert accumulated stuff before writing it, it would need a mechanism to assign a LSN without having written the xlog data yet.

Right, XLogInsert would only be able to accumulate as long as it knew it was going to get called again before the buffer exclusive lock was released.  That is why the accumulation is better done in the heap_bulk_insert, otherwise it would require an unfortunate amount of communication between the two.
 


If you haven't done that, then of course doing the bulk insert doesn't help much if you still do tuple-by-tuple XLogInsert.

Exactly.


So in the case that it is
under the limit, you first run through the tuples putting them into the
block, then run through the tuples again doing the XLogInserts?

Yes, exactly. This isn't really optimal...
I wonder if I could build one rdata chain containing all updates to the tuples and submit it in one go. Would it work ?...

I'm sure it could be made to work.  I haven't checked the replay code, but I doubt it would work on this massed record right out of the box.  Or we could insert dummy headers between the each tuple's WAL data. 

...


So in order to benchmark the right thing, I have :
- all the tables in a big RAMDISK
- xlog on RAID5
- fsync=fdatasync

I've also found that setting wal_buffers to a large value like 128MB gives a significant speed boost when doing COPY or INSERT INTO SELECT, probably because it allows the backends to always find space in the buffers even if the walwriter is a bit busy.

That seems very large, even for the high throughput set up you describe.  Is the WAL background writer set at the default interval?  (On the other hand, if 128MB is just a rounding error in your machine's total RAM size, why not be generous?  On the other other hand, I've seen perfomance start to drop as wal_buffers gets too large, though I never bothered to chased down the cause.)

At one point, I think that XLogInsert would synchronously write out the buffer when it noticed it was over half full, but that got ripped out (if I'm going to block, I might as well wait until it actually is full to do).  Now that there is a background writer, maybe it would be a good idea to have XLogInserters wake it up if the buffer is half full.
...


Another angle of attack would be to make wal-writing more efficient...
If you mean to do this without changing the xlog interfaces, I'm not
optimistic.

Agree : that's why I didn't even try ;)


If you have to call XLogInsert once per row that is copied (or
insert...select), then my experiments show that simply taking the
WALInsertLock and immediately releasing it, doing absolutely no real work
while it is held, is already a substanial multi-core scalibility
bottleneck.

Confirmed by context switch issue above...

Having all cores block on the same lock for every row can be OK if it's a spinlock protecting just a few lines of code... which is not the present case...


Maybe there could be some hybrid approach.  You take the spinlock, check that you could get the lwlock if you wanted to.  If you could get the lwlock and know you have almost no work to do, then just do it while holding the spinlock.  If you discover you have more work todo (xlog file switch, nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop the spinlock.  This *really* breaks the encapsulation of lwlock, so it is probably not much more than idle speculation.
 
Jeff

Re: Bulk Inserts and WAL Inserts

From
Pierre Frédéric Caillaud
Date:

> heap_bulk_insert would do the accumulation.  The hack to XLogInsert would
> involve making it insert an extra dummy xlog record header for every
> tuple

WAL record header is something like 40 bytes, so if you make lots of
inserts in a page, you'd be better off WALing the whole page, it takes
less space, it is much faster, and if you're coming right after a
checkpoint, you're going to do it anyway on the first inserted row, so it
would be even better to do it on the last inserted row...

WAL Insert record is 55 bytes + tuple data

However, you can't hold an exclusive lock on a buffer while going out in
the executor to fetch the next tuple, since that can take a undetermined
amount of time : imagine the record comes from a dblink() and there is a
connection loss... so you'd need a TupleBuffer, something like a
tuplestore that doesn't spill to disk and holds only about 32 kB of
tuples, and :
- fill buffer with tuples coming from the executor (or COPY),
- pass this to heap_bulk_insert(),
- toasts tuples (maybe also bulk-insert in the TOAST table)
- take the exclusive lock on the buffer,
- insert tuples quickly,
- log the whole page when it's full, or log individual inserts (in 1
operation)
- release the lock

> in the rdata chain.  Alternatively, the hack to heap replay would involve
> making it accept multiple tuples reported under a single WAL record.  I
> don't know which one would be easier.

This one is probably easier, since all your WAL records refer to the same
buffer.

> I'm sure it could be made to work.  I haven't checked the replay code,
> but I
> doubt it would work on this massed record right out of the box.  Or we
> could
> insert dummy headers between the each tuple's WAL data.

For small rows, the header is as large as the row... might as well get rid
of it !

Inserting dummy headers would not work, here's why :

- The critical part (lock-wise) of XLogInsert is generating a LSN
- The LSN calculation needs to verify that the header isn't split between
pages
- Therefore, if you want to insert N records in one operation, then this
critical part is going to take quite some time
- And the lock should be held for as little time as possible...

> That seems very large, even for the high throughput set up you
> describe.  Is
> the WAL background writer set at the default interval?  (On the other
> hand,
> if 128MB is just a rounding error in your machine's total RAM size, why
> not
> be generous?  On the other other hand, I've seen perfomance start to
> drop as
> wal_buffers gets too large, though I never bothered to chased down the
> cause.)
>
> At one point, I think that XLogInsert would synchronously write out the
> buffer when it noticed it was over half full, but that got ripped out (if
> I'm going to block, I might as well wait until it actually is full to
> do).
> Now that there is a background writer, maybe it would be a good idea to
> have
> XLogInserters wake it up if the buffer is half full.
> ...

Here's what I think :

Currently, using big WAL buffers decreases performance (while you'd expect
to increase it). Here's why.

Case 1) consider those concurrent transactions :
- Transaction 1 : do a big operation (vacuum, COPY, etc)
- Transaction 2 : BEGIN; do a few little UPDATEs... COMMIT

Suppose you have a large wal_buffer like 32MB. By the time transaction 2
wants to commit and sync, transaction 1 has inserted lots of WAL in the
buffer, so transaction 1 finds it needs to write and fsync() like 2x 16 MB
WAL segments. Needless to say, transaction 2 commit is going to be pretty
slow. With small buffers, all the stuff would have been in the OS cache if
not already on the disks.

Case 2) you just do a big COPY, or a big VACUUM that writes lots of xlog.
Suppose you're generating like 50 MB/s of WAL.
You've set your WAL writer delay to something very low like 20 ms, so the
buffers are nicely emptied...
However, with this throughput, you're going through 1 xlog segment every
300 ms, so when WALWrite() tries to fsync() the segment to advance to the
next, the data to write is in the OS cache, but the OS probably hasn't
decided to write it to disk yet, so your fsync call is very expensive, and
it is blocking, this means if your buffers are smaller than a WAL segment,
then the buffers are full while you wait for fsync, and everything stands
still.

Here's my current test setup :

- WAL on a dedicated disk (this is really good for performance unless you
got a battery backed up controller)
- fsync=o_dsync
- very aggressive WAL writer delay like 5 ms

Basically, all the WAL disk does is write WAL, so the head hardly moves at
all. The disk takes about 8 ms for a rotation (this is less than
wal_writer_delay). Whenever there is WAL to write, walwriter writes it,
and blocks because of O_DSYNC mode. So, basically, at each disk rotation,
the pending WAL is written. Then, fsync() is a noop. This gives excellent,
and very smooth performance, whereas fsync() gives quite random, and
sometimes pretty high, wait times, since the length of fsync() wait
depends on how much unflushed data sits in the OS buffers.

However there is a problem with O_DSYNC :

- Latency is excellent, but throughput is miserable because you can only
write 1 chunk per disk rotation.
- If all you need to write is a few pages, and you want to sync them, it's
perfect.
- If you want to write at a high throughput, though, you'd better write
data in LARGE chunks, ideally 1 WAL segment at a time.

So, O_DSYNC is much better with large WAL buffers (>16 MB) and small
walwriter delay, ideally delay < disk rotation.


> Maybe there could be some hybrid approach.  You take the spinlock, check
> that you could get the lwlock if you wanted to.  If you could get the
> lwlock
> and know you have almost no work to do, then just do it while holding the
> spinlock.  If you discover you have more work todo (xlog file switch,
> nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop
> the
> spinlock.  This *really* breaks the encapsulation of lwlock, so it is
> probably not much more than idle speculation.

That's a description of something that could look like a futex ;)


Anyway, I've made a little patch (attached).
It applies to git revision 1384847ef8731718a79a32cd354e31c31c5294a0 of
current postgres (from last week), it probably applies to current HEAD,
but I have to learn how to do a merge in git before I can tell you that ;)

What it does :

I've put extremely detailed comments in xlog.c, this is a brief summary :

Since the critical part of XLogInsert is generating a LSN, not writing the
data, I have created a LSN generator, which is extremely fast, so it is
protected by a spinlock. Using anything else than a spinlock creates a
nice point of serialization and kills all performance gains.

Concurrent XLogInserts get a LSN from the LSN generator, and then they
insert their data in the buffers, concurrently, under a LW_SHARED
WALInsertLock.

The buffer queue logic was entirely redone too.

XlogWrite marks buffers as free as soon as possible, so if you use 32 MB
wal_buffers, when XLogWrite writes data, the buffer space is reused
immediately, while the previous data is being fsync()ed.

To avoid writing partial WAL records to disk, we must be sure that all
records in the buffer are completely written. This is done by taking an
exclusive lock on WALInsertLock, which waits for all memory writes to
finish, then taking a peek at the last written record, and releasing the
lock immediately.

I've also added an adaptive walwriter delay (depending on load).

I've added a lot of comments in xlog.c, check them out.

Here is the configuration I use for testing :

fsync = on
synchronous_commit = on or off depending on test
wal_sync_method = open_sync
full_page_writes = on
wal_buffers = 32MB
wal_writer_delay = 100ms (adaptive walwriter delay lowers it if needed)

Benchmarks are parallel COPY, and parallel INSERT INTO SELECT. However, be
careful with parallel INSERT INTO SELECT if all your threads insert into
the same table : you'll be benchmarking the FSM more than the WAL itself...

On my setup, 8.5 is bound by the WALInsertLock, but with this patch, my
disks are often maxed out, so I can't really tell. I had to put the tables
on a ramdisk.
I get an almost x2 speedup on parallel COPY of a table with 9 INT fields,
then cpu is maxed out parsing integers...
If you have more than 4 cores and faster disks, I'd be very interested in
your results !

It will print some stuff to the console, if you see lots of ">>W>W>>>W",
it means your WAL writing throughput is maxed out.
"W" means it had to flush buffers in XLogInsert (bad) instead of somewhere
non-critical like the wal writer.
"<" and ">" mean start and end of exclusive lock on WALInsert taken
because it had to flush.

Attached is a little Python benchmark script. Change the definitions for
where to put the files if you want to use it.

Regards,

Pierre
































Attachment