Re: double writes using "double-write buffer" approach [WIP] - Mailing list pgsql-hackers
From | Dan Scales |
---|---|
Subject | Re: double writes using "double-write buffer" approach [WIP] |
Date | |
Msg-id | 1871024608.1144384.1328476635051.JavaMail.root@zimbra-prod-mbox-4.vmware.com Whole thread Raw |
In response to | Re: double writes using "double-write buffer" approach [WIP] (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: double writes using "double-write buffer" approach [WIP]
|
List | pgsql-hackers |
Thanks for the detailed followup. I do see how Postgres is tuned for having a bunch of memory available that is not in shared_buffers, both for the OS buffer cache and other memory allocations. However, Postgres seems to run fine in many "large shared_memory" configurations that I gave performance numbers for, including 5G shared_buffers for an 8G machine, 3G shared_buffers for a 6G machine, etc. There just has to be sufficient extra memory beyond the shared_buffers cache. I think the pgbench run is pointing out a problem that this double_writes implementation has with BULK_WRITEs. As you point out, the BufferAccessStrategy for BULK_WRITEs will cause lots of dirty evictions. I'm not sure if there is a great solution that always works for that issue. However, I do notice that BULK_WRITE data isn't WAL-logged unless archiving/replication is happening. As I understand it, if the BULK_WRITE data isn't being WAL-logged, then it doesn't have to be double-written either. The BULK_WRITE data is not officially synced and committed until it is all written, so there doesn't have to be any torn-page protection for that data, which is why the WAL logging can be omitted. The double-write implementation can be improved by marking each buffer if it doesn't need torn-page protection. These buffers would be those new pages that are explicitly not WAL-logged, even when full_page_writes is enabled. When such a buffer is eventually synced (perhaps because of an eviction), it would not be double-written. This would often avoid double-writes for BULK_WRITE, etc., especially since the administrator is often not archiving or doing replication when doing bulk loads. However, overall, I think the idea is that double writes are an optional optimization. The user would only turn it on in existing configurations where it helps or only slightly hurts performance, and where greatly reducing the size of the WAL logs is beneficial. It might also be especially beneficial when there is a small amount of FLASH or other kind of fast storage that the double-write files can be stored on. Thanks, Dan ----- Original Message ----- From: "Robert Haas" <robertmhaas@gmail.com> To: "Dan Scales" <scales@vmware.com> Cc: "PG Hackers" <pgsql-hackers@postgresql.org> Sent: Friday, February 3, 2012 1:48:54 PM Subject: Re: [HACKERS] double writes using "double-write buffer" approach [WIP] On Fri, Feb 3, 2012 at 3:14 PM, Dan Scales <scales@vmware.com> wrote: > Thanks for the feedback! I think you make a good point about the small size of dirty data in the OS cache. I think whatyou can say about this double-write patch is that it will work not work well for configurations that have a small Postgrescache and a large OS cache, since every write from the Postgres cache requires double-writes and an fsync. The general guidance for setting shared_buffers these days is 25% of RAM up to a maximum of 8GB, so the configuration that you're describing as not optimal for this patch is the one normally used when running PostgreSQL. I've run across several cases where larger values of shared_buffers are a huge win, because the entire working set can then be accommodated in shared_buffers. But it's certainly not the case that all working sets fit. And in this case, I think that's beside the point anyway. I had shared_buffers set to 8GB on a machine with much more memory than that, but the database created by pgbench -i -s 10 is about 156 MB, so the problem isn't that there is too little PostgreSQL cache available.The entire database fits in shared_buffers, with mostof it left over. However, because of the BufferAccessStrategy stuff, pages start to get forced out to the OS pretty quickly. Of course, we could disable the BufferAccessStrategy stuff when double_writes is in use, but bear in mind that the reason we have it in the first place is to prevent cache trashing effects. It would be imprudent of us to throw that out the window without replacing it with something else that would provide similar protection. And even if we did, that would just delay the day of reckoning. You'd be able to blast through and dirty the entirety of shared_buffers at top speed, but then as soon as you started replacing pages performance would slow to an utter crawl, just as it did here, only you'd need a bigger scale factor to trigger the problem. The more general point here is that there are MANY aspects of PostgreSQL's design that assume that shared_buffers accounts for a relatively small percentage of system memory. Here's another one: we assume that backends that need temporary memory for sorts and hashes (i.e. work_mem) can just allocate it from the OS. If we were to start recommending setting shared_buffers to large percentages of the available memory, we'd probably have to rethink that. Most likely, we'd need some kind of in-core mechanism for allocating temporary memory from the shared memory segment. And here's yet another one: we assume that it is better to recycle old WAL files and overwrite the contents rather than create new, empty ones, because we assume that the pages from the old files may still be present in the OS cache. We also rely on the fact that an evicted CLOG page can be pulled back in quickly without (in most cases) a disk access. We also rely on shared_buffers not being too large to avoid walloping the I/O controller too hard at checkpoint time - which is forcing some people to set shared_buffers much smaller than would otherwise be ideal. In other words, even if setting shared_buffers to most of the available system memory would fix the problem I mentioned, it would create a whole bunch of new ones, many of them non-trivial. It may be a good idea to think about what we'd need to do to work efficiently in that sort of configuration, but there is going to be a very large amount of thinking, testing, and engineering that has to be done to make it a reality. There's another issue here, too. The idea that we're going to write data to the double-write buffer only when we decide to evict the pages strikes me as a bad one. We ought to proactively start dumping pages to the double-write area as soon as they're dirtied, and fsync them after every N pages, so that by the time we need to evict some page that requires a double-write, it's already durably on disk in the double-write buffer, and we can do the real write without having to wait. It's likely that, to make this perform acceptably for bulk loads, you'll need the writes to the double-write buffer and the fsyncs of that buffer to be done by separate processes, so that one backend (the background writer, perhaps) can continue spooling additional pages to the double-write files while some other process (a new auxiliary process?) fsyncs the ones that are already full. Along with that, the page replacement algorithm probably needs to be adjusted to avoid evicting pages that need an as-yet-unfinished double-write like the plague, even to the extent of allowing the BufferAccessStrategy rings to grow if the double-writes can't be finished before the ring wraps around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: