Home > mailing lists

Re: Bulk Inserts - Mailing list pgsql-hackers

From	Jeff Janes
Subject	Re: Bulk Inserts
Date	September 18, 2009 04:52:32
Msg-id	f67928030909172152m9e54850vba8e38a61e96f93f@mail.gmail.com Whole thread Raw
In response to	Re: Bulk Inserts (Pierre Frédéric Caillaud<lists@peufeu.com>)
Responses	Re: Bulk Inserts and WAL Inserts
List	pgsql-hackers

Tree view

2009/9/16 Pierre Frédéric Caillaud <lists@peufeu.com>

OK, that makes sense. I thought you had hacked either XLogInsert or the
heap WAL replay code so that you could just accumulate tuples in the rdata
chain and then submit them all under the cover of a single WALInsertLock.

Ah, no, I did not do that.

This would be difficult to do : rdata chain contains buffer pointers, and when we are in XLogInsert, we have an exclusive lock on the buffer. If XLogInsert accumulated xlog records as you say, then later, when it's time to write them to the xlog, it would no longer hold exclusive lock on the buffer, so its contents could have changed, and if XLogInsert decides to do a full page write, the contents will be wrong.

Yes, I didn't mean to make XLogInsert unilaterally accumulate rdata (Actually I have done that, purely as a proof of concept. The resulting database is completely unrecoverable, but as long you don't bring it down, it runs fine and lets me test the speed of different concepts without going to the trouble of implementing them correctly).

heap_bulk_insert would do the accumulation. The hack to XLogInsert would involve making it insert an extra dummy xlog record header for every tuple in the rdata chain. Alternatively, the hack to heap replay would involve making it accept multiple tuples reported under a single WAL record. I don't know which one would be easier.

Besides, the LSN needs to be set in the page at every call to heap_insert (or else WAL will not be "Ahead"), so if XLogInsert accumulated stuff before writing it, it would need a mechanism to assign a LSN without having written the xlog data yet.

Right, XLogInsert would only be able to accumulate as long as it knew it was going to get called again before the buffer exclusive lock was released. That is why the accumulation is better done in the heap_bulk_insert, otherwise it would require an unfortunate amount of communication between the two.

If you haven't done that, then of course doing the bulk insert doesn't help much if you still do tuple-by-tuple XLogInsert.

Exactly.

So in the case that it is
under the limit, you first run through the tuples putting them into the
block, then run through the tuples again doing the XLogInserts?

Yes, exactly. This isn't really optimal...
I wonder if I could build one rdata chain containing all updates to the tuples and submit it in one go. Would it work ?...

I'm sure it could be made to work. I haven't checked the replay code, but I doubt it would work on this massed record right out of the box. Or we could insert dummy headers between the each tuple's WAL data.

...

So in order to benchmark the right thing, I have :
- all the tables in a big RAMDISK
- xlog on RAID5
- fsync=fdatasync

I've also found that setting wal_buffers to a large value like 128MB gives a significant speed boost when doing COPY or INSERT INTO SELECT, probably because it allows the backends to always find space in the buffers even if the walwriter is a bit busy.

That seems very large, even for the high throughput set up you describe. Is the WAL background writer set at the default interval? (On the other hand, if 128MB is just a rounding error in your machine's total RAM size, why not be generous? On the other other hand, I've seen perfomance start to drop as wal_buffers gets too large, though I never bothered to chased down the cause.)

At one point, I think that XLogInsert would synchronously write out the buffer when it noticed it was over half full, but that got ripped out (if I'm going to block, I might as well wait until it actually is full to do). Now that there is a background writer, maybe it would be a good idea to have XLogInserters wake it up if the buffer is half full.
...

Another angle of attack would be to make wal-writing more efficient...
If you mean to do this without changing the xlog interfaces, I'm not
optimistic.

Agree : that's why I didn't even try ;)

If you have to call XLogInsert once per row that is copied (or
insert...select), then my experiments show that simply taking the
WALInsertLock and immediately releasing it, doing absolutely no real work
while it is held, is already a substanial multi-core scalibility
bottleneck.

Confirmed by context switch issue above...

Having all cores block on the same lock for every row can be OK if it's a spinlock protecting just a few lines of code... which is not the present case...

Maybe there could be some hybrid approach. You take the spinlock, check that you could get the lwlock if you wanted to. If you could get the lwlock and know you have almost no work to do, then just do it while holding the spinlock. If you discover you have more work todo (xlog file switch, nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop the spinlock. This *really* breaks the encapsulation of lwlock, so it is probably not much more than idle speculation.

Jeff

pgsql-hackers by date:

From: Emmanuel Cecchet
Date: 18 September 2009, 04:14:34
Subject: Re: COPY enhancements

From: Selena Deckelmann
Date: 18 September 2009, 05:16:21
Subject: Re: patch: Review handling of MOVE and FETCH (ToDo)

Re: Bulk Inserts - Mailing list pgsql-hackers

Previous

Next