Re: Bulk Inserts and WAL Inserts - Mailing list pgsql-hackers
From | Pierre Frédéric Caillaud |
---|---|
Subject | Re: Bulk Inserts and WAL Inserts |
Date | |
Msg-id | op.u0tfrys1cke6l8@soyouz Whole thread Raw |
In response to | Re: Bulk Inserts (Jeff Janes <jeff.janes@gmail.com>) |
List | pgsql-hackers |
> heap_bulk_insert would do the accumulation. The hack to XLogInsert would > involve making it insert an extra dummy xlog record header for every > tuple WAL record header is something like 40 bytes, so if you make lots of inserts in a page, you'd be better off WALing the whole page, it takes less space, it is much faster, and if you're coming right after a checkpoint, you're going to do it anyway on the first inserted row, so it would be even better to do it on the last inserted row... WAL Insert record is 55 bytes + tuple data However, you can't hold an exclusive lock on a buffer while going out in the executor to fetch the next tuple, since that can take a undetermined amount of time : imagine the record comes from a dblink() and there is a connection loss... so you'd need a TupleBuffer, something like a tuplestore that doesn't spill to disk and holds only about 32 kB of tuples, and : - fill buffer with tuples coming from the executor (or COPY), - pass this to heap_bulk_insert(), - toasts tuples (maybe also bulk-insert in the TOAST table) - take the exclusive lock on the buffer, - insert tuples quickly, - log the whole page when it's full, or log individual inserts (in 1 operation) - release the lock > in the rdata chain. Alternatively, the hack to heap replay would involve > making it accept multiple tuples reported under a single WAL record. I > don't know which one would be easier. This one is probably easier, since all your WAL records refer to the same buffer. > I'm sure it could be made to work. I haven't checked the replay code, > but I > doubt it would work on this massed record right out of the box. Or we > could > insert dummy headers between the each tuple's WAL data. For small rows, the header is as large as the row... might as well get rid of it ! Inserting dummy headers would not work, here's why : - The critical part (lock-wise) of XLogInsert is generating a LSN - The LSN calculation needs to verify that the header isn't split between pages - Therefore, if you want to insert N records in one operation, then this critical part is going to take quite some time - And the lock should be held for as little time as possible... > That seems very large, even for the high throughput set up you > describe. Is > the WAL background writer set at the default interval? (On the other > hand, > if 128MB is just a rounding error in your machine's total RAM size, why > not > be generous? On the other other hand, I've seen perfomance start to > drop as > wal_buffers gets too large, though I never bothered to chased down the > cause.) > > At one point, I think that XLogInsert would synchronously write out the > buffer when it noticed it was over half full, but that got ripped out (if > I'm going to block, I might as well wait until it actually is full to > do). > Now that there is a background writer, maybe it would be a good idea to > have > XLogInserters wake it up if the buffer is half full. > ... Here's what I think : Currently, using big WAL buffers decreases performance (while you'd expect to increase it). Here's why. Case 1) consider those concurrent transactions : - Transaction 1 : do a big operation (vacuum, COPY, etc) - Transaction 2 : BEGIN; do a few little UPDATEs... COMMIT Suppose you have a large wal_buffer like 32MB. By the time transaction 2 wants to commit and sync, transaction 1 has inserted lots of WAL in the buffer, so transaction 1 finds it needs to write and fsync() like 2x 16 MB WAL segments. Needless to say, transaction 2 commit is going to be pretty slow. With small buffers, all the stuff would have been in the OS cache if not already on the disks. Case 2) you just do a big COPY, or a big VACUUM that writes lots of xlog. Suppose you're generating like 50 MB/s of WAL. You've set your WAL writer delay to something very low like 20 ms, so the buffers are nicely emptied... However, with this throughput, you're going through 1 xlog segment every 300 ms, so when WALWrite() tries to fsync() the segment to advance to the next, the data to write is in the OS cache, but the OS probably hasn't decided to write it to disk yet, so your fsync call is very expensive, and it is blocking, this means if your buffers are smaller than a WAL segment, then the buffers are full while you wait for fsync, and everything stands still. Here's my current test setup : - WAL on a dedicated disk (this is really good for performance unless you got a battery backed up controller) - fsync=o_dsync - very aggressive WAL writer delay like 5 ms Basically, all the WAL disk does is write WAL, so the head hardly moves at all. The disk takes about 8 ms for a rotation (this is less than wal_writer_delay). Whenever there is WAL to write, walwriter writes it, and blocks because of O_DSYNC mode. So, basically, at each disk rotation, the pending WAL is written. Then, fsync() is a noop. This gives excellent, and very smooth performance, whereas fsync() gives quite random, and sometimes pretty high, wait times, since the length of fsync() wait depends on how much unflushed data sits in the OS buffers. However there is a problem with O_DSYNC : - Latency is excellent, but throughput is miserable because you can only write 1 chunk per disk rotation. - If all you need to write is a few pages, and you want to sync them, it's perfect. - If you want to write at a high throughput, though, you'd better write data in LARGE chunks, ideally 1 WAL segment at a time. So, O_DSYNC is much better with large WAL buffers (>16 MB) and small walwriter delay, ideally delay < disk rotation. > Maybe there could be some hybrid approach. You take the spinlock, check > that you could get the lwlock if you wanted to. If you could get the > lwlock > and know you have almost no work to do, then just do it while holding the > spinlock. If you discover you have more work todo (xlog file switch, > nontrivial AdvanceXLInsert, etc.) then actually take the lwlock and drop > the > spinlock. This *really* breaks the encapsulation of lwlock, so it is > probably not much more than idle speculation. That's a description of something that could look like a futex ;) Anyway, I've made a little patch (attached). It applies to git revision 1384847ef8731718a79a32cd354e31c31c5294a0 of current postgres (from last week), it probably applies to current HEAD, but I have to learn how to do a merge in git before I can tell you that ;) What it does : I've put extremely detailed comments in xlog.c, this is a brief summary : Since the critical part of XLogInsert is generating a LSN, not writing the data, I have created a LSN generator, which is extremely fast, so it is protected by a spinlock. Using anything else than a spinlock creates a nice point of serialization and kills all performance gains. Concurrent XLogInserts get a LSN from the LSN generator, and then they insert their data in the buffers, concurrently, under a LW_SHARED WALInsertLock. The buffer queue logic was entirely redone too. XlogWrite marks buffers as free as soon as possible, so if you use 32 MB wal_buffers, when XLogWrite writes data, the buffer space is reused immediately, while the previous data is being fsync()ed. To avoid writing partial WAL records to disk, we must be sure that all records in the buffer are completely written. This is done by taking an exclusive lock on WALInsertLock, which waits for all memory writes to finish, then taking a peek at the last written record, and releasing the lock immediately. I've also added an adaptive walwriter delay (depending on load). I've added a lot of comments in xlog.c, check them out. Here is the configuration I use for testing : fsync = on synchronous_commit = on or off depending on test wal_sync_method = open_sync full_page_writes = on wal_buffers = 32MB wal_writer_delay = 100ms (adaptive walwriter delay lowers it if needed) Benchmarks are parallel COPY, and parallel INSERT INTO SELECT. However, be careful with parallel INSERT INTO SELECT if all your threads insert into the same table : you'll be benchmarking the FSM more than the WAL itself... On my setup, 8.5 is bound by the WALInsertLock, but with this patch, my disks are often maxed out, so I can't really tell. I had to put the tables on a ramdisk. I get an almost x2 speedup on parallel COPY of a table with 9 INT fields, then cpu is maxed out parsing integers... If you have more than 4 cores and faster disks, I'd be very interested in your results ! It will print some stuff to the console, if you see lots of ">>W>W>>>W", it means your WAL writing throughput is maxed out. "W" means it had to flush buffers in XLogInsert (bad) instead of somewhere non-critical like the wal writer. "<" and ">" mean start and end of exclusive lock on WALInsert taken because it had to flush. Attached is a little Python benchmark script. Change the definitions for where to put the files if you want to use it. Regards, Pierre
Attachment
pgsql-hackers by date: