Thread: double writes using "double-write buffer" approach [WIP]
I've been prototyping the double-write buffer idea that Heikki and Simon had proposed (as an alternative to a previous patch that only batched up writes by the checkpointer). I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions. It also centralizes most of the code change in smgr.c. However, it is trickier to reason about. The idea is that all page writes generally are copied to a double-write buffer, rather than being immediately written. Note that a full copy of the page is required, but can folded in with a checksum calculation. Periodically (e.g. every time a certain-size batch of writes have been added), some writes are pushed out using double writes -- the pages are first written and fsynced to a double-write file, then written to the data files, which are then fsynced. Then double writes allow for fixing torn pages, so full_page_writes can be turned off (thus greatly reducing the size of the WAL log). The key changes are conceptually simple: 1. In smgrwrite(), copy the page to the double-write buffer. If a big enough batch has accumulated, then flush the batch using double writes. [I don't think I need to intercept calls to smgrextend(), but I am not totally sure.] 2. In smgrread(), always look first in the double-write buffer for a particular page, before going to disk. 3. At the end of a checkpoint and on shutdown, always make sure that the current contents of the double-write buffer are flushed. 4. Pass flags around in some cases to indicate whether a page buffer needs a double write or not. (I think eventually this would be an attribute of the buffer, set when the page is WAL-logged, rather than a flag passed around.) 5. Deal with duplicates in the double-write buffer appropriately (very rarely happens). To get good performance, I needed to have two double-write buffers, one for the checkpointer and one for all other processes. The double-write buffers are circular buffers. The checkpointer double-write buffer is just a single batch of 64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches of 64 pages each. Each batch goes to a different double-write file, so that they can be issued independently as soon as each batch is completed. Also, I need to sort the buffers being checkpointed by file/offset (see ioseq.c), so that the checkpointer batches will most likely only have to write and fsync one data file. Interestingly, I find that the plot of tpm for DBT2 is much smoother (though still has wiggles) with double writes enabled, since there are no unpredictable long fsyncs at the end (or during) a checkpoint. Here are performance numbers for double-write buffer (same configs as previous numbers), for 2-processor, 60-minute 50-warehouse DBT2. One the right shows the size of the shared_buffers, and the size of the RAM in the virtual machine. FPW stands for full_page_writes, DW for double_writes. 'two disk' means the WAL log is on a separate ext3 filesystem from the data files. FPW off FPW on DW on, FPW off one disk: 15488 13146 11713 [5G buffers, 8G VM] two disk: 18833 16703 18013 one disk: 12908 11159 9758 [3G buffers, 6G VM] two disk: 14258 12694 11229 one disk 10829 9865 5806 [1G buffers, 8G VM] two disk 13605 12694 5682 one disk: 6752 6129 4878 two disk: 7253 6677 5239 [1G buffers, 2G VM] The performance of DW on the small cache cases (1G shared_buffers) is now much better, though still not as good as FPW on. In the medium cache case (3G buffers), where there are significant backend dirty evictions, the performance of DW is close to that of FPW on. In the large cache (5G buffers), where the checkpointer can do all the work and there are minimal dirty evictions, DW is much better than FPW in the two disk case. In the one disk case, it is somewhat worse than FPW. However, interestingly, if you just move the double-write files to a separate ext3 filesystem on the same disk as the data files, the performance goes to 13107 -- now on par with FPW on. We are obviously getting hit by the ext3 fsync slowness issues. (I believe that an fsync on a filesystem can stall on other unrelated writes to the same filesystem.) Let me know if you have any thoughts/comments, etc. The patch is enclosed, and the README.doublewrites is updated a fair bit. Thanks, Dan
Attachment
On Fri, Jan 27, 2012 at 5:31 PM, Dan Scales <scales@vmware.com> wrote: > I've been prototyping the double-write buffer idea that Heikki and Simon > had proposed (as an alternative to a previous patch that only batched up > writes by the checkpointer). I think it is a good idea, and can help > double-writes perform better in the case of lots of backend evictions. > It also centralizes most of the code change in smgr.c. However, it is > trickier to reason about. This doesn't compile on MacOS X, because there's no writev(). I don't understand how you can possibly get away with such small buffers. AIUI, you must retained every page in the double-write buffer until it's been written and fsync'd to disk. That means the most dirty data you'll ever be able to have in the operating system cache with this implementation is (128 + 64) * 8kB = 1.5MB. Granted, we currently have occasional problems with the OS caching too *much* dirty data, but that seems like it's going way, way too far in the opposite direction. That's barely enough for the system to do any write reordering at all. I am particularly worried about what happens when a ring buffer is in use. I tried running "pgbench -i -s 10" with this patch applied, full_page_writes=off, double_writes=on. It took 41.2 seconds to complete. The same test with the stock code takes 14.3 seconds; and the actual situation is worse for double-writes than those numbers might imply, because the index build time doesn't seem to be much affected, while the COPY takes a small eternity with the patch compared to the usual way of doing things. I think the slowdown on COPY once the double-write buffer fills is on the order of 10x. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Robert, Thanks for the feedback! I think you make a good point about the small size of dirty data in the OS cache. I think whatyou can say about this double-write patch is that it will work not work well for configurations that have a small Postgrescache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync. However,it should work much better for configurations with a much large Postgres cache and relatively smaller OS cache (includingthe configurations that I've given performance results for). In that case, there is a lot more capacity for dirtypages in the Postgres cache, and you won't have nearly as many dirty evictions. The checkpointer is doing a good numberof the writes, and this patch sorts the checkpointer's buffers so its IO is efficient. Of course, I can also increase the size of the non-checkpointer ring buffer to be much larger, though I wouldn't want tomake it too large, since it is consuming memory. If I increase the size of the ring buffers significantly, I will probablyneed to add some data structures so that the ring buffer lookups in smgrread() and smgrwrite() are more efficient. Can you let me know what the shared_buffers and RAM sizes were for your pgbench run? I can try running the same workload. If the size of shared_buffers is especially small compared to RAM, then we should increase the size of shared_bufferswhen using double_writes. Thanks, Dan ----- Original Message ----- From: "Robert Haas" <robertmhaas@gmail.com> To: "Dan Scales" <scales@vmware.com> Cc: "PG Hackers" <pgsql-hackers@postgresql.org> Sent: Thursday, February 2, 2012 7:19:47 AM Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP] On Fri, Jan 27, 2012 at 5:31 PM, Dan Scales <scales@vmware.com> wrote: > I've been prototyping the double-write buffer idea that Heikki and Simon > had proposed (as an alternative to a previous patch that only batched up > writes by the checkpointer). I think it is a good idea, and can help > double-writes perform better in the case of lots of backend evictions. > It also centralizes most of the code change in smgr.c. However, it is > trickier to reason about. This doesn't compile on MacOS X, because there's no writev(). I don't understand how you can possibly get away with such small buffers. AIUI, you must retained every page in the double-write buffer until it's been written and fsync'd to disk. That means the most dirty data you'll ever be able to have in the operating system cache with this implementation is (128 + 64) * 8kB = 1.5MB. Granted, we currently have occasional problems with the OS caching too *much* dirty data, but that seems like it's going way, way too far in the opposite direction. That's barely enough for the system to do any write reordering at all. I am particularly worried about what happens when a ring buffer is in use. I tried running "pgbench -i -s 10" with this patch applied, full_page_writes=off, double_writes=on. It took 41.2 seconds to complete. The same test with the stock code takes 14.3 seconds; and the actual situation is worse for double-writes than those numbers might imply, because the index build time doesn't seem to be much affected, while the COPY takes a small eternity with the patch compared to the usual way of doing things. I think the slowdown on COPY once the double-write buffer fills is on the order of 10x. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales <scales@vmware.com> wrote: > Thanks for the feedback! I think you make a good point about the small size of dirty data in the OS cache. I think whatyou can say about this double-write patch is that it will work not work well for configurations that have a small Postgrescache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync. The general guidance for setting shared_buffers these days is 25% of RAM up to a maximum of 8GB, so the configuration that you're describing as not optimal for this patch is the one normally used when running PostgreSQL. I've run across several cases where larger values of shared_buffers are a huge win, because the entire working set can then be accommodated in shared_buffers. But it's certainly not the case that all working sets fit. And in this case, I think that's beside the point anyway. I had shared_buffers set to 8GB on a machine with much more memory than that, but the database created by pgbench -i -s 10 is about 156 MB, so the problem isn't that there is too little PostgreSQL cache available.The entire database fits in shared_buffers, with mostof it left over. However, because of the BufferAccessStrategy stuff, pages start to get forced out to the OS pretty quickly. Of course, we could disable the BufferAccessStrategy stuff when double_writes is in use, but bear in mind that the reason we have it in the first place is to prevent cache trashing effects. It would be imprudent of us to throw that out the window without replacing it with something else that would provide similar protection. And even if we did, that would just delay the day of reckoning. You'd be able to blast through and dirty the entirety of shared_buffers at top speed, but then as soon as you started replacing pages performance would slow to an utter crawl, just as it did here, only you'd need a bigger scale factor to trigger the problem. The more general point here is that there are MANY aspects of PostgreSQL's design that assume that shared_buffers accounts for a relatively small percentage of system memory. Here's another one: we assume that backends that need temporary memory for sorts and hashes (i.e. work_mem) can just allocate it from the OS. If we were to start recommending setting shared_buffers to large percentages of the available memory, we'd probably have to rethink that. Most likely, we'd need some kind of in-core mechanism for allocating temporary memory from the shared memory segment. And here's yet another one: we assume that it is better to recycle old WAL files and overwrite the contents rather than create new, empty ones, because we assume that the pages from the old files may still be present in the OS cache. We also rely on the fact that an evicted CLOG page can be pulled back in quickly without (in most cases) a disk access. We also rely on shared_buffers not being too large to avoid walloping the I/O controller too hard at checkpoint time - which is forcing some people to set shared_buffers much smaller than would otherwise be ideal. In other words, even if setting shared_buffers to most of the available system memory would fix the problem I mentioned, it would create a whole bunch of new ones, many of them non-trivial. It may be a good idea to think about what we'd need to do to work efficiently in that sort of configuration, but there is going to be a very large amount of thinking, testing, and engineering that has to be done to make it a reality. There's another issue here, too. The idea that we're going to write data to the double-write buffer only when we decide to evict the pages strikes me as a bad one. We ought to proactively start dumping pages to the double-write area as soon as they're dirtied, and fsync them after every N pages, so that by the time we need to evict some page that requires a double-write, it's already durably on disk in the double-write buffer, and we can do the real write without having to wait. It's likely that, to make this perform acceptably for bulk loads, you'll need the writes to the double-write buffer and the fsyncs of that buffer to be done by separate processes, so that one backend (the background writer, perhaps) can continue spooling additional pages to the double-write files while some other process (a new auxiliary process?) fsyncs the ones that are already full. Along with that, the page replacement algorithm probably needs to be adjusted to avoid evicting pages that need an as-yet-unfinished double-write like the plague, even to the extent of allowing the BufferAccessStrategy rings to grow if the double-writes can't be finished before the ring wraps around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks for the detailed followup. I do see how Postgres is tuned for having a bunch of memory available that is not in shared_buffers, both for the OS buffer cache and other memory allocations. However, Postgres seems to run fine in many "large shared_memory" configurations that I gave performance numbers for, including 5G shared_buffers for an 8G machine, 3G shared_buffers for a 6G machine, etc. There just has to be sufficient extra memory beyond the shared_buffers cache. I think the pgbench run is pointing out a problem that this double_writes implementation has with BULK_WRITEs. As you point out, the BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions. I'm not sure if there is a great solution that always works for that issue. However, I do notice that BULK_WRITE data isn't WAL-logged unless archiving/replication is happening. As I understand it, if the BULK_WRITE data isn't being WAL-logged, then it doesn't have to be double-written either. The BULK_WRITE data is not officially synced and committed until it is all written, so there doesn't have to be any torn-page protection for that data, which is why the WAL logging can be omitted. The double-write implementation can be improved by marking each buffer if it doesn't need torn-page protection. These buffers would be those new pages that are explicitly not WAL-logged, even when full_page_writes is enabled. When such a buffer is eventually synced (perhaps because of an eviction), it would not be double-written. This would often avoid double-writes for BULK_WRITE, etc., especially since the administrator is often not archiving or doing replication when doing bulk loads. However, overall, I think the idea is that double writes are an optional optimization. The user would only turn it on in existing configurations where it helps or only slightly hurts performance, and where greatly reducing the size of the WAL logs is beneficial. It might also be especially beneficial when there is a small amount of FLASH or other kind of fast storage that the double-write files can be stored on. Thanks, Dan ----- Original Message ----- From: "Robert Haas" <robertmhaas@gmail.com> To: "Dan Scales" <scales@vmware.com> Cc: "PG Hackers" <pgsql-hackers@postgresql.org> Sent: Friday, February 3, 2012 1:48:54 PM Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP] On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales <scales@vmware.com> wrote: > Thanks for the feedback! I think you make a good point about the small size of dirty data in the OS cache. I think whatyou can say about this double-write patch is that it will work not work well for configurations that have a small Postgrescache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync. The general guidance for setting shared_buffers these days is 25% of RAM up to a maximum of 8GB, so the configuration that you're describing as not optimal for this patch is the one normally used when running PostgreSQL. I've run across several cases where larger values of shared_buffers are a huge win, because the entire working set can then be accommodated in shared_buffers. But it's certainly not the case that all working sets fit. And in this case, I think that's beside the point anyway. I had shared_buffers set to 8GB on a machine with much more memory than that, but the database created by pgbench -i -s 10 is about 156 MB, so the problem isn't that there is too little PostgreSQL cache available.The entire database fits in shared_buffers, with mostof it left over. However, because of the BufferAccessStrategy stuff, pages start to get forced out to the OS pretty quickly. Of course, we could disable the BufferAccessStrategy stuff when double_writes is in use, but bear in mind that the reason we have it in the first place is to prevent cache trashing effects. It would be imprudent of us to throw that out the window without replacing it with something else that would provide similar protection. And even if we did, that would just delay the day of reckoning. You'd be able to blast through and dirty the entirety of shared_buffers at top speed, but then as soon as you started replacing pages performance would slow to an utter crawl, just as it did here, only you'd need a bigger scale factor to trigger the problem. The more general point here is that there are MANY aspects of PostgreSQL's design that assume that shared_buffers accounts for a relatively small percentage of system memory. Here's another one: we assume that backends that need temporary memory for sorts and hashes (i.e. work_mem) can just allocate it from the OS. If we were to start recommending setting shared_buffers to large percentages of the available memory, we'd probably have to rethink that. Most likely, we'd need some kind of in-core mechanism for allocating temporary memory from the shared memory segment. And here's yet another one: we assume that it is better to recycle old WAL files and overwrite the contents rather than create new, empty ones, because we assume that the pages from the old files may still be present in the OS cache. We also rely on the fact that an evicted CLOG page can be pulled back in quickly without (in most cases) a disk access. We also rely on shared_buffers not being too large to avoid walloping the I/O controller too hard at checkpoint time - which is forcing some people to set shared_buffers much smaller than would otherwise be ideal. In other words, even if setting shared_buffers to most of the available system memory would fix the problem I mentioned, it would create a whole bunch of new ones, many of them non-trivial. It may be a good idea to think about what we'd need to do to work efficiently in that sort of configuration, but there is going to be a very large amount of thinking, testing, and engineering that has to be done to make it a reality. There's another issue here, too. The idea that we're going to write data to the double-write buffer only when we decide to evict the pages strikes me as a bad one. We ought to proactively start dumping pages to the double-write area as soon as they're dirtied, and fsync them after every N pages, so that by the time we need to evict some page that requires a double-write, it's already durably on disk in the double-write buffer, and we can do the real write without having to wait. It's likely that, to make this perform acceptably for bulk loads, you'll need the writes to the double-write buffer and the fsyncs of that buffer to be done by separate processes, so that one backend (the background writer, perhaps) can continue spooling additional pages to the double-write files while some other process (a new auxiliary process?) fsyncs the ones that are already full. Along with that, the page replacement algorithm probably needs to be adjusted to avoid evicting pages that need an as-yet-unfinished double-write like the plague, even to the extent of allowing the BufferAccessStrategy rings to grow if the double-writes can't be finished before the ring wraps around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 5, 2012 at 4:17 PM, Dan Scales <scales@vmware.com> wrote: > Thanks for the detailed followup. I do see how Postgres is tuned for > having a bunch of memory available that is not in shared_buffers, both > for the OS buffer cache and other memory allocations. However, Postgres > seems to run fine in many "large shared_memory" configurations that I > gave performance numbers for, including 5G shared_buffers for an 8G > machine, 3G shared_buffers for a 6G machine, etc. There just has to be > sufficient extra memory beyond the shared_buffers cache. I agree that you could probably set shared_buffers to 3GB on a 6GB machine and get decent performance - but would it be the optimal performance, and for what workload? To really figure out whether this patch is a win, you need to get the system optimally tuned for the unpatched sources (which we can't tell whether you've done, since you haven't posted the configuration settings or any comparative figures for different settings, or any details on which commit you tested against) and then get the system optimally tuned for the patched sources with double_writes=on, and then see whether there's a gain. > I think the pgbench run is pointing out a problem that this double_writes > implementation has with BULK_WRITEs. As you point out, the > BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions. Bulk reads will have the same problem. Consider loading a bunch of data into a new data with COPY, and then scanning the table. The table scan will be a "bulk read" and every page will be dirtied setting hint bits. Another thing to worry about is vacuum, which also uses a BufferAccessStrategy. Greg Smith has done some previous benchmarking showing that when the kernel is too aggressive about flushing dirty data to disk, vacuum becomes painfully slow. I suspect this patch is going to have that problem in spades (but it would be good to test that). Checkpoints might be a problem, too, since they flush a lot of dirty data, and that's going to require a lot of extra fsyncing with this implementation. It certainly seems that unless you have a pg_xlog and the data separated and a battery-backed write cache for each, checkpoints might be really slow. I'm not entirely convinced they'll be fast even if you have all that (but it would be good to test that, too). > I'm not sure if there is a great solution that always works for that > issue. However, I do notice that BULK_WRITE data isn't WAL-logged unless > archiving/replication is happening. As I understand it, if the > BULK_WRITE data isn't being WAL-logged, then it doesn't have to be > double-written either. The BULK_WRITE data is not officially synced and > committed until it is all written, so there doesn't have to be any > torn-page protection for that data, which is why the WAL logging can be > omitted. The double-write implementation can be improved by marking each > buffer if it doesn't need torn-page protection. These buffers would be > those new pages that are explicitly not WAL-logged, even when > full_page_writes is enabled. When such a buffer is eventually synced > (perhaps because of an eviction), it would not be double-written. This > would often avoid double-writes for BULK_WRITE, etc., especially since > the administrator is often not archiving or doing replication when doing > bulk loads. I agree - this optimization seems like a must. I'm not sure that it's sufficient, but it certainly seems necessary. It's not going to help with VACUUM, though, so I think that case needs some careful looking at to determine how bad the regression is and what can be done to mitigate it. In particular, I note that I suggested an idea that might help in the final paragraph of my last email. My general feeling about this patch is that it needs a lot more work before we should consider committing it. Your tests so far overlook quite a few important problem cases (bulk loads, SELECT on large unhinted tables, vacuum speed, checkpoint duration, and others) and still mostly show it losing to full_page_writes, sometimes by large margins. Even in the one case where you got an 8% speedup, it's not really clear that the same speedup (or an even bigger one) couldn't have been gotten by some other kind of tuning. I think you really need to spend some more time thinking about how to blunt the negative impact on the cases where it hurts, and increase the benefit in the cases where it helps. The approach seems to have potential, but it seems way to immature to think about shipping it at this point. (You may have been thinking along similar lines since I note that the patch is marked "WIP".) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales <scales@vmware.com> wrote: > Let me know if you have any thoughts/comments, etc. The patch is > enclosed, and the README.doublewrites is updated a fair bit. ISTM that the double-write can prevent torn-pages in neither double-write file nor data file in *base backup*. Because both double-write file and data file can be backed up while being written. Is this right? To avoid the torn-page problem, we should write FPI to WAL during online backup even if the double-write has been committed? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
I don't know a lot about base backup, but it sounds like full_page_writes must be turned on for base backup, in order todeal with the inconsistent reads of pages (which you might call torn pages) that can happen when you backup the data fileswhile the database is running. The relevant parts of the WAL log are then copied separately (and consistently) oncethe backup of the data files is done, and used to "recover" the database into a consistent state later. So, yes, good point -- double writes cannot replace the functionality of full_page_writes for base backup. If double writeswere in use, they might be automatically switched over to full page writes for the duration of the base backup. Andthe double write file should not be part of the base backup. Dan ----- Original Message ----- From: "Fujii Masao" <masao.fujii@gmail.com> To: "Dan Scales" <scales@vmware.com> Cc: "PG Hackers" <pgsql-hackers@postgresql.org> Sent: Monday, February 6, 2012 3:08:15 AM Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP] On Sat, Jan 28, 2012 at 7:31 AM, Dan Scales <scales@vmware.com> wrote: > Let me know if you have any thoughts/comments, etc. The patch is > enclosed, and the README.doublewrites is updated a fair bit. ISTM that the double-write can prevent torn-pages in neither double-write file nor data file in *base backup*. Because both double-write file and data file can be backed up while being written. Is this right? To avoid the torn-page problem, we should write FPI to WAL during online backup even if the double-write has been committed? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
>> I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions. I don'tunderstand this point, because from the data in your mail, it appears that when shared buffers are less means when moreevictions can happen, the performance is less. ISTM that the performance is less incase shared buffers size is less because I/O might happen by the backend process which can degrade performance. Is there any problem if the double-write happens only by bgwriter or checkpoint. Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a double-writebuffer, but bgwriter will check this double-buffer and flush from it. Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell bgwriterto flush from double-write buffer. This can ensure very less I/O by any backend. -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Dan Scales Sent: Saturday, January 28, 2012 4:02 AM To: PG Hackers Subject: [HACKERS] double writes using "double-write buffer" approach [WIP] I've been prototyping the double-write buffer idea that Heikki and Simon had proposed (as an alternative to a previous patchthat only batched up writes by the checkpointer). I think it is a good idea, and can help double-writes perform betterin the case of lots of backend evictions. It also centralizes most of the code change in smgr.c. However, it is trickier to reason about. The idea is that all page writes generally are copied to a double-write buffer, rather than being immediately written. Notethat a full copy of the page is required, but can folded in with a checksum calculation. Periodically (e.g. every time a certain-size batch of writes have been added), some writes are pushed out using double writes-- the pages are first written and fsynced to a double-write file, then written to the data files, which are then fsynced. Then double writes allow for fixing torn pages, so full_page_writes can be turned off (thus greatly reducing thesize of the WAL log). The key changes are conceptually simple: 1. In smgrwrite(), copy the page to the double-write buffer. If a big enough batch has accumulated, then flush the batchusing double writes. [I don't think I need to intercept calls to smgrextend(), but I am not totally sure.] 2. In smgrread(), always look first in the double-write buffer for a particular page, before going to disk. 3. At the end of a checkpoint and on shutdown, always make sure that the current contents of the double-write buffer areflushed. 4. Pass flags around in some cases to indicate whether a page buffer needs a double write or not. (I think eventuallythis would be an attribute of the buffer, set when the page is WAL-logged, rather than a flag passed around.) 5. Deal with duplicates in the double-write buffer appropriately (very rarely happens). To get good performance, I needed to have two double-write buffers, one for the checkpointer and one for all other processes. The double-write buffers are circular buffers. The checkpointer double-write buffer is just a single batch of64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches of 64 pages each. Each batch goes to a differentdouble-write file, so that they can be issued independently as soon as each batch is completed. Also, I need tosort the buffers being checkpointed by file/offset (see ioseq.c), so that the checkpointer batches will most likely onlyhave to write and fsync one data file. Interestingly, I find that the plot of tpm for DBT2 is much smoother (though still has wiggles) with double writes enabled,since there are no unpredictable long fsyncs at the end (or during) a checkpoint. Here are performance numbers for double-write buffer (same configs as previous numbers), for 2-processor, 60-minute 50-warehouseDBT2. One the right shows the size of the shared_buffers, and the size of the RAM in the virtual machine. FPWstands for full_page_writes, DW for double_writes. 'two disk' means the WAL log is on a separate ext3 filesystem fromthe data files. FPW off FPW on DW on, FPW off one disk: 15488 13146 11713 [5G buffers, 8G VM] two disk: 18833 16703 18013 one disk: 12908 11159 9758 [3G buffers, 6G VM] two disk: 14258 12694 11229 one disk 10829 9865 5806 [1G buffers, 8G VM] two disk 13605 12694 5682 one disk: 6752 6129 4878 two disk: 7253 6677 5239 [1G buffers, 2G VM] The performance of DW on the small cache cases (1G shared_buffers) is now much better, though still not as good as FPW on. In the medium cache case (3G buffers), where there are significant backend dirty evictions, the performance of DW isclose to that of FPW on. In the large cache (5G buffers), where the checkpointer can do all the work and there are minimaldirty evictions, DW is much better than FPW in the two disk case. In the one disk case, it is somewhat worse than FPW. However, interestingly, if you just move the double-write files toa separate ext3 filesystem on the same disk as the data files, the performance goes to 13107 -- now on par with FPW on. We are obviously getting hit by the ext3 fsync slowness issues. (I believe that an fsync on a filesystem can stall on other unrelated writes to the same filesystem.) Let me know if you have any thoughts/comments, etc. The patch is enclosed, and the README.doublewrites is updated a fairbit. Thanks, Dan
On 02/07/2012 12:09 AM, Dan Scales wrote: > So, yes, good point -- double writes cannot replace the functionality of full_page_writes for base backup. If double writeswere in use, they might be automatically switched over to full page writes for the duration of the base backup. Andthe double write file should not be part of the base backup. There is already a check for this sort of problem during the base backup. It forces full_pages_writes on for the backup, even if the running configuration has it off. So long as double writes can be smoothly turned off and back on again, that same section of code can easily be made to handle that, too. As far as not making the double write file part of the base backup, I was assuming that would go into a subdirectory under pg_xlog by default. I would think that people who relocate pg_xlog using one of the methods for doing that would want the double write buffer to move as well. And if it's inside pg_xlog, existing base backup scripts won't need to be changed--the correct ones already exclude pg_xlog files. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
> Is there any problem if the double-write happens only by bgwriter or checkpoint. > Something like whenever backend process has to evict the buffer, it will do same as you have described that write in adouble-write buffer, but > bgwriter will check this double-buffer and flush from it. > Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tellbgwriter to flush from > double-write buffer. > This can ensure very less I/O by any backend. Yes, I think this is a good idea. I could make changes so that the backends hand off the responsibility to flush batchesof the double-write buffer to the bgwriter whenever possible. This would avoid some long IO waits in the backends,though the backends may of course eventually wait anyways for the bgwriter if IO is not fast enough. I did writethe code so that any process can write a completed batch if the batch is not currently being flushed (so as to dealwith crashes by backends). Having the backends flush the batches as they fill them up was just simpler for a first prototype. Dan ----- Original Message ----- From: "Amit Kapila" <amit.kapila@huawei.com> To: "Dan Scales" <scales@vmware.com>, "PG Hackers" <pgsql-hackers@postgresql.org> Sent: Tuesday, February 7, 2012 1:08:49 AM Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP] >> I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions. I don'tunderstand this point, because from the data in your mail, it appears that when shared buffers are less means when moreevictions can happen, the performance is less. ISTM that the performance is less incase shared buffers size is less because I/O might happen by the backend process which can degrade performance. Is there any problem if the double-write happens only by bgwriter or checkpoint. Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a double-writebuffer, but bgwriter will check this double-buffer and flush from it. Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell bgwriterto flush from double-write buffer. This can ensure very less I/O by any backend. -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Dan Scales Sent: Saturday, January 28, 2012 4:02 AM To: PG Hackers Subject: [HACKERS] double writes using "double-write buffer" approach [WIP] I've been prototyping the double-write buffer idea that Heikki and Simon had proposed (as an alternative to a previous patchthat only batched up writes by the checkpointer). I think it is a good idea, and can help double-writes perform betterin the case of lots of backend evictions. It also centralizes most of the code change in smgr.c. However, it is trickier to reason about. The idea is that all page writes generally are copied to a double-write buffer, rather than being immediately written. Notethat a full copy of the page is required, but can folded in with a checksum calculation. Periodically (e.g. every time a certain-size batch of writes have been added), some writes are pushed out using double writes-- the pages are first written and fsynced to a double-write file, then written to the data files, which are then fsynced. Then double writes allow for fixing torn pages, so full_page_writes can be turned off (thus greatly reducing thesize of the WAL log). The key changes are conceptually simple: 1. In smgrwrite(), copy the page to the double-write buffer. If a big enough batch has accumulated, then flush the batchusing double writes. [I don't think I need to intercept calls to smgrextend(), but I am not totally sure.] 2. In smgrread(), always look first in the double-write buffer for a particular page, before going to disk. 3. At the end of a checkpoint and on shutdown, always make sure that the current contents of the double-write buffer areflushed. 4. Pass flags around in some cases to indicate whether a page buffer needs a double write or not. (I think eventuallythis would be an attribute of the buffer, set when the page is WAL-logged, rather than a flag passed around.) 5. Deal with duplicates in the double-write buffer appropriately (very rarely happens). To get good performance, I needed to have two double-write buffers, one for the checkpointer and one for all other processes. The double-write buffers are circular buffers. The checkpointer double-write buffer is just a single batch of64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches of 64 pages each. Each batch goes to a differentdouble-write file, so that they can be issued independently as soon as each batch is completed. Also, I need tosort the buffers being checkpointed by file/offset (see ioseq.c), so that the checkpointer batches will most likely onlyhave to write and fsync one data file. Interestingly, I find that the plot of tpm for DBT2 is much smoother (though still has wiggles) with double writes enabled,since there are no unpredictable long fsyncs at the end (or during) a checkpoint. Here are performance numbers for double-write buffer (same configs as previous numbers), for 2-processor, 60-minute 50-warehouseDBT2. One the right shows the size of the shared_buffers, and the size of the RAM in the virtual machine. FPWstands for full_page_writes, DW for double_writes. 'two disk' means the WAL log is on a separate ext3 filesystem fromthe data files. FPW off FPW on DW on, FPW off one disk: 15488 13146 11713 [5G buffers, 8G VM] two disk: 18833 16703 18013 one disk: 12908 11159 9758 [3G buffers, 6G VM] two disk: 14258 12694 11229 one disk 10829 9865 5806 [1G buffers, 8G VM] two disk 13605 12694 5682 one disk: 6752 6129 4878 two disk: 7253 6677 5239 [1G buffers, 2G VM] The performance of DW on the small cache cases (1G shared_buffers) is now much better, though still not as good as FPW on. In the medium cache case (3G buffers), where there are significant backend dirty evictions, the performance of DW isclose to that of FPW on. In the large cache (5G buffers), where the checkpointer can do all the work and there are minimaldirty evictions, DW is much better than FPW in the two disk case. In the one disk case, it is somewhat worse than FPW. However, interestingly, if you just move the double-write files toa separate ext3 filesystem on the same disk as the data files, the performance goes to 13107 -- now on par with FPW on. We are obviously getting hit by the ext3 fsync slowness issues. (I believe that an fsync on a filesystem can stall on other unrelated writes to the same filesystem.) Let me know if you have any thoughts/comments, etc. The patch is enclosed, and the README.doublewrites is updated a fairbit. Thanks, Dan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Dan, I believe your approach of double buffer write is right as it has potential that it can avoid the latency backends incurduring full page writes after checkpoint. Although there are chances that overall I/O will be more in this case butif we can make sure that in most scenarios backend has to never do I/O it can show performance improvement as well ascompare to full page writes. -----Original Message----- From: Dan Scales [mailto:scales@vmware.com] Sent: Thursday, February 09, 2012 5:30 AM To: Amit Kapila Cc: PG Hackers Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP] > Is there any problem if the double-write happens only by bgwriter or checkpoint. > Something like whenever backend process has to evict the buffer, it will do same as you have described that write in adouble-write buffer, but > bgwriter will check this double-buffer and flush from it. > Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tellbgwriter to flush from > double-write buffer. > This can ensure very less I/O by any backend. Yes, I think this is a good idea. I could make changes so that the backends hand off the responsibility to flush batchesof the double-write buffer to the bgwriter whenever possible. This would avoid some long IO waits in the backends,though the backends may of course eventually wait anyways for the bgwriter if IO is not fast enough. I did writethe code so that any process can write a completed batch if the batch is not currently being flushed (so as to dealwith crashes by backends). Having the backends flush the batches as they fill them up was just simpler for a first prototype. Dan ----- Original Message ----- From: "Amit Kapila" <amit.kapila@huawei.com> To: "Dan Scales" <scales@vmware.com>, "PG Hackers" <pgsql-hackers@postgresql.org> Sent: Tuesday, February 7, 2012 1:08:49 AM Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP] >> I think it is a good idea, and can help double-writes perform better in the case of lots of backend evictions. I don'tunderstand this point, because from the data in your mail, it appears that when shared buffers are less means when moreevictions can happen, the performance is less. ISTM that the performance is less incase shared buffers size is less because I/O might happen by the backend process which can degrade performance. Is there any problem if the double-write happens only by bgwriter or checkpoint. Something like whenever backend process has to evict the buffer, it will do same as you have described that write in a double-writebuffer, but bgwriter will check this double-buffer and flush from it. Also whenever any backend will see that the double buffer is more than 2/3rd or some threshhold value full it will tell bgwriterto flush from double-write buffer. This can ensure very less I/O by any backend. -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Dan Scales Sent: Saturday, January 28, 2012 4:02 AM To: PG Hackers Subject: [HACKERS] double writes using "double-write buffer" approach [WIP] I've been prototyping the double-write buffer idea that Heikki and Simon had proposed (as an alternative to a previous patchthat only batched up writes by the checkpointer). I think it is a good idea, and can help double-writes perform betterin the case of lots of backend evictions. It also centralizes most of the code change in smgr.c. However, it is trickier to reason about. The idea is that all page writes generally are copied to a double-write buffer, rather than being immediately written. Notethat a full copy of the page is required, but can folded in with a checksum calculation. Periodically (e.g. every time a certain-size batch of writes have been added), some writes are pushed out using double writes-- the pages are first written and fsynced to a double-write file, then written to the data files, which are then fsynced. Then double writes allow for fixing torn pages, so full_page_writes can be turned off (thus greatly reducing thesize of the WAL log). The key changes are conceptually simple: 1. In smgrwrite(), copy the page to the double-write buffer. If a big enough batch has accumulated, then flush the batchusing double writes. [I don't think I need to intercept calls to smgrextend(), but I am not totally sure.] 2. In smgrread(), always look first in the double-write buffer for a particular page, before going to disk. 3. At the end of a checkpoint and on shutdown, always make sure that the current contents of the double-write buffer areflushed. 4. Pass flags around in some cases to indicate whether a page buffer needs a double write or not. (I think eventuallythis would be an attribute of the buffer, set when the page is WAL-logged, rather than a flag passed around.) 5. Deal with duplicates in the double-write buffer appropriately (very rarely happens). To get good performance, I needed to have two double-write buffers, one for the checkpointer and one for all other processes. The double-write buffers are circular buffers. The checkpointer double-write buffer is just a single batch of64 pages; the non-checkpointer double-write buffer is 128 pages, 2 batches of 64 pages each. Each batch goes to a differentdouble-write file, so that they can be issued independently as soon as each batch is completed. Also, I need tosort the buffers being checkpointed by file/offset (see ioseq.c), so that the checkpointer batches will most likely onlyhave to write and fsync one data file. Interestingly, I find that the plot of tpm for DBT2 is much smoother (though still has wiggles) with double writes enabled,since there are no unpredictable long fsyncs at the end (or during) a checkpoint. Here are performance numbers for double-write buffer (same configs as previous numbers), for 2-processor, 60-minute 50-warehouseDBT2. One the right shows the size of the shared_buffers, and the size of the RAM in the virtual machine. FPWstands for full_page_writes, DW for double_writes. 'two disk' means the WAL log is on a separate ext3 filesystem fromthe data files. FPW off FPW on DW on, FPW off one disk: 15488 13146 11713 [5G buffers, 8G VM] two disk: 18833 16703 18013 one disk: 12908 11159 9758 [3G buffers, 6G VM] two disk: 14258 12694 11229 one disk 10829 9865 5806 [1G buffers, 8G VM] two disk 13605 12694 5682 one disk: 6752 6129 4878 two disk: 7253 6677 5239 [1G buffers, 2G VM] The performance of DW on the small cache cases (1G shared_buffers) is now much better, though still not as good as FPW on. In the medium cache case (3G buffers), where there are significant backend dirty evictions, the performance of DW isclose to that of FPW on. In the large cache (5G buffers), where the checkpointer can do all the work and there are minimaldirty evictions, DW is much better than FPW in the two disk case. In the one disk case, it is somewhat worse than FPW. However, interestingly, if you just move the double-write files toa separate ext3 filesystem on the same disk as the data files, the performance goes to 13107 -- now on par with FPW on. We are obviously getting hit by the ext3 fsync slowness issues. (I believe that an fsync on a filesystem can stall on other unrelated writes to the same filesystem.) Let me know if you have any thoughts/comments, etc. The patch is enclosed, and the README.doublewrites is updated a fairbit. Thanks, Dan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers