Thread: Load distributed checkpoint V3
Folks, Here is the latest version of Load distributed checkpoint patch. I've fixed some bugs, including in cases of missing file errors and overlapping of asynchronous checkpoint requests. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > Folks, > > Here is the latest version of Load distributed checkpoint patch. > > I've fixed some bugs, including in cases of missing file errors > and overlapping of asynchronous checkpoint requests. > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 23 Mar 2007, ITAGAKI Takahiro wrote: > Here is the latest version of Load distributed checkpoint patch. Couple of questions for you: -Is it still possible to get the original behavior by adjusting your tunables? It would be nice to do a before/after without having to recompile, and I know I'd be concerned about something so different becoming the new default behavior. -Can you suggest a current test case to demonstrate the performance improvement here? I've tried several variations on stretching out checkpoints like you're doing here and they all made slow checkpoint issues even worse on my Linux system. I'm trying to evaluate this fairly. -This code operates on the assumption you have a good value for the checkpoint timeout. Have you tested its behavior when checkpoints are being triggered by checkpoint_segments being reached instead? -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> wrote: > > Here is the latest version of Load distributed checkpoint patch. > > Couple of questions for you: > > -Is it still possible to get the original behavior by adjusting your > tunables? It would be nice to do a before/after without having to > recompile, and I know I'd be concerned about something so different > becoming the new default behavior. Yes, if you want the original behavior, please set all of checkpoint_[write|nap|sync]_percent to zero. They can be changed at SIGHUP timing (pg_ctl reload). The new default configurations are write/nap/sync = 50%/10%/20%. There might be room for discussion in choice of the values. > -Can you suggest a current test case to demonstrate the performance > improvement here? I've tried several variations on stretching out > checkpoints like you're doing here and they all made slow checkpoint > issues even worse on my Linux system. I'm trying to evaluate this fairly. You might need to increase checkpoint_segments and checkpoint_timeout. Here is the results on my machine: http://archives.postgresql.org/pgsql-hackers/2007-02/msg01613.php I've set the values to 32 segs and 15 min to take advantage of it in the case of pgbench -s100 then. > -This code operates on the assumption you have a good value for the > checkpoint timeout. Have you tested its behavior when checkpoints are > being triggered by checkpoint_segments being reached instead? This patch does not work fully when checkpoints are triggered by segments. Write phases still work because they refer to consumption of segments, but nap and fsync phases only check amount of time. I'm assuming checkpoints are triggered by timeout in normal use -- and it's my recommended configuration whether the patch is installed or not. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Mon, 26 Mar 2007, ITAGAKI Takahiro wrote: > I'm assuming checkpoints are triggered by timeout in normal use -- and > it's my recommended configuration whether the patch is installed or not. I'm curious what other people running fairly serious hardware do in this area for write-heavy loads, whether it's timeout or segment limits that normally trigger their checkpoints. I'm testing on a slightly different class of machine than your sample results, something that is in the 1500 TPS range running the pgbench test you describe. Running that test, I always hit the checkpoint_segments wall well before any reasonable timeout. With 64 segments, I get a checkpoint every two minutes or so. There's something I'm working on this week that may help out other people trying to test your patch out. I've put together some simple scripts that graph (patched) pgbench results, which make it very easy to see what changes when you alter the checkpoint behavior. Edges are still rough but the scripts work for me, will be polishing and testing over the next few days: http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm (Note that the example graphs there aren't from the production system I mentioned above, they're from my server at home, which is similar to the system your results came from). -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
ITAGAKI Takahiro wrote: > Here is the latest version of Load distributed checkpoint patch. Unfortunately because of the recent instrumentation and CheckpointStartLock patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix the bitrot and send an updated patch, please? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
ITAGAKI Takahiro wrote: > Here is the latest version of Load distributed checkpoint patch. Bgwriter has two goals: 1. keep enough buffers clean that normal backends never need to do a write 2. smooth checkpoints by writing buffers ahead of time Load distributed checkpoints will do 2. in a much better way than the bgwriter_all_* guc options. I think we should remove that aspect of bgwriter in favor of this patch. The scheduling of bgwriter gets quite complicated with the patch. If I'm reading it correctly, bgwriter will keep periodically writing buffers to achieve 1. while the "write"-phase of checkpoint is in progress. That makes sense; now that checkpoints take longer, we would miss goal 1. otherwise. But we don't do that in the "sleep-between-write-and-fsync"- and "fsync"-phases. We should, shouldn't we? I'd suggest rearranging the code so that BgBufferSync and mdsync would basically stay like they are without the patch; the signature wouldn't change. To do the naps during a checkpoint, inject calls to new functions like CheckpointWriteNap() and CheckpointFsyncNap() inside BgBufferSync and mdsync. Those nap functions would check if enough progress has been made since last call and sleep if so. The piece of code that implements 1. would be refactored to a new function, let's say BgWriteLRUBuffers(). The nap-functions would call BgWriteLRUBuffers if more than bgwriter_delay milliseconds have passed since last call to it. This way the changes to CreateCheckpoint, BgBufferSync and mdsync would be minimal, and bgwriter would keep cleaning buffers for normal backends during the whole checkpoint. Another thought is to have a separate checkpointer-process so that the bgwriter process can keep cleaning dirty buffers while the checkpoint is running in a separate process. One problem with that is that we currently collect all the fsync requests in bgwriter. If we had a separate checkpointer process, we'd need to do that in the checkpointer instead, and bgwriter would need to send a message to the checkpointer every time it flushes a buffer, which would be a lot of chatter. Alternatively, bgwriter could somehow pass the pendingOpsTable to the checkpointer process at the beginning of checkpoint, but that not exactly trivial either. PS. Great that you're working on this. It's a serious problem under heavy load. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 5 Apr 2007, Heikki Linnakangas wrote: > Unfortunately because of the recent instrumentation and CheckpointStartLock > patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix > the bitrot and send an updated patch, please? The "Logging checkpoints and other slowdown causes" patch I submitted touches some of the same code as well, that's another possible merge coming depending on what order this all gets committed in. Running into what I dubbed perpetual checkpoints was one of the reasons I started logging timing information for the various portions of the checkpoint, to tell when it was bogged down with slow writes versus being held up in sync for various (possibly fixed with your CheckpointStartLock) issues. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 5 Apr 2007, Heikki Linnakangas wrote: > Bgwriter has two goals: > 1. keep enough buffers clean that normal backends never need to do a write > 2. smooth checkpoints by writing buffers ahead of time > Load distributed checkpoints will do 2. in a much better way than the > bgwriter_all_* guc options. I think we should remove that aspect of bgwriter > in favor of this patch. My first question about the LDC patch was whether I could turn it off and return to the existing mechanism. I would like to see a large pile of data proving this new approach is better before the old one goes away. I think everyone needs to do some more research and measurement here before assuming the problem can be knocked out so easily. The reason I've been busy working on patches to gather statistics on this area of code is because I've tried most simple answers to getting the background writer to work better and made little progress, and I'd like to see everyone else doing the same at least collecting the right data. Let me suggest a different way of looking at this problem. At any moment, some percentage of your buffer pool is dirty. Whether it's 0% or 100% dramatically changes what the background writer should be doing. Whether most of the data is usage_count>0 or not also makes a difference. None of the current code has any idea what type of buffer pool they're working with, and therefore they don't have enough information to make a well-informed prediction about what is going to happen in the near future. I'll tell you what I did to the all-scan. I ran a few hundred hours worth of background writer tests to collect data on what it does wrong, then wrote a prototype automatic background writer that resets the all-scan parameters based on what I found. It keeps a running estimate of how dirty the pool at large is using a weighted average of the most recent scan with the past history. From there, I have a simple model that predicts how much of the buffer we can scan in any interval, and intends to enforce a maximum bound on the amount of physical I/O you're willing to stream out. The beta code is sitting at http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want to see what I've done so far. The parts that are done work fine--as long as you give it a reasonable % to scan by default, it will correct all_max_pages and the interval in real-time to meet the scan rate requested you want given how much is currently dirty; the I/O rate is computed but doesn't limit properly yet. Why haven't I brought this all up yet? Two reasons. The first is because it doesn't work on my system; checkpoints and overall throughput get worse when you try to shorten them by running the background writer at optimal aggressiveness. Under really heavy load, the writes slow down as all the disk caches fill, the background writer fights with reads on the data that isn't in the mostly dirty cache (introducing massive seek delays), it stops cleaning effectively, and it's better for it to not even try. My next generation of code was going to start with the LRU flush and then only move onto the all-scan if there's time leftover. The second is that I just started to get useful results here in the last few weeks, and I assumed it's too big of a topic to start suggesting major redesigns to the background writer mechanism at that point (from me at least!). I was waiting for 8.3 to freeze before even trying. If you want to push through a redesign there, maybe you can get away with it at this late moment. But I ask that you please don't remove anything from the current design until you have significant test results to back up that change. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > On Thu, 5 Apr 2007, Heikki Linnakangas wrote: > >> Bgwriter has two goals: >> 1. keep enough buffers clean that normal backends never need to do a >> write >> 2. smooth checkpoints by writing buffers ahead of time >> Load distributed checkpoints will do 2. in a much better way than the >> bgwriter_all_* guc options. I think we should remove that aspect of >> bgwriter in favor of this patch. > > ... > > Let me suggest a different way of looking at this problem. At any > moment, some percentage of your buffer pool is dirty. Whether it's 0% > or 100% dramatically changes what the background writer should be > doing. Whether most of the data is usage_count>0 or not also makes a > difference. None of the current code has any idea what type of buffer > pool they're working with, and therefore they don't have enough > information to make a well-informed prediction about what is going to > happen in the near future. The purpose of the bgwriter_all_* settings is to shorten the duration of the eventual checkpoint. The reason to shorten the checkpoint duration is to limit the damage to other I/O activity it causes. My thinking is that assuming the LDC patch is effective (agreed, needs more testing) at smoothening the checkpoint, the duration doesn't matter anymore. Do you want to argue there's other reasons to shorten the checkpoint duration? > I'll tell you what I did to the all-scan. I ran a few hundred hours > worth of background writer tests to collect data on what it does wrong, > then wrote a prototype automatic background writer that resets the > all-scan parameters based on what I found. It keeps a running estimate > of how dirty the pool at large is using a weighted average of the most > recent scan with the past history. From there, I have a simple model > that predicts how much of the buffer we can scan in any interval, and > intends to enforce a maximum bound on the amount of physical I/O you're > willing to stream out. The beta code is sitting at > http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want > to see what I've done so far. The parts that are done work fine--as > long as you give it a reasonable % to scan by default, it will correct > all_max_pages and the interval in real-time to meet the scan rate > requested you want given how much is currently dirty; the I/O rate is > computed but doesn't limit properly yet. Nice. Enforcing a max bound on the I/O seems reasonable, if we accept that shortening the checkpoint is a goal. > Why haven't I brought this all up yet? Two reasons. The first is > because it doesn't work on my system; checkpoints and overall throughput > get worse when you try to shorten them by running the background writer > at optimal aggressiveness. Under really heavy load, the writes slow > down as all the disk caches fill, the background writer fights with > reads on the data that isn't in the mostly dirty cache (introducing > massive seek delays), it stops cleaning effectively, and it's better for > it to not even try. My next generation of code was going to start with > the LRU flush and then only move onto the all-scan if there's time > leftover. > > The second is that I just started to get useful results here in the last > few weeks, and I assumed it's too big of a topic to start suggesting > major redesigns to the background writer mechanism at that point (from > me at least!). I was waiting for 8.3 to freeze before even trying. If > you want to push through a redesign there, maybe you can get away with > it at this late moment. But I ask that you please don't remove anything > from the current design until you have significant test results to back > up that change. Point taken. I need to start testing the LDC patch. Since we're discussing this, let me tell what I've been thinking about the lru cleaning behavior of bgwriter. ISTM that that's more straigthforward to tune automatically. Bgwriter basically needs to ensure that the next X buffers with usage_count=0 in the clock sweep are clean. X is the predicted number of buffers backends will evict until the next bgwriter round. The number of buffers evicted by normal backends in a bgwriter_delay period is simple to keep track of, just increase a counter in StrategyGetBuffer and reset it when bgwriter wakes up. We can use that as an estimate of X with some safety margin. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > The number of buffers evicted by normal backends in a bgwriter_delay > period is simple to keep track of, just increase a counter in > StrategyGetBuffer and reset it when bgwriter wakes up. We can use that > as an estimate of X with some safety margin. You'd want some kind of moving-average smoothing in there, probably with a lot shorter ramp-up than ramp-down time constant, but this seems reasonable enough to try. regards, tom lane
Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> The number of buffers evicted by normal backends in a bgwriter_delay >> period is simple to keep track of, just increase a counter in >> StrategyGetBuffer and reset it when bgwriter wakes up. We can use that >> as an estimate of X with some safety margin. > > You'd want some kind of moving-average smoothing in there, probably with > a lot shorter ramp-up than ramp-down time constant, but this seems > reasonable enough to try. Ironically, I just noticed that we already have a patch in the patch queue that implements exactly that, again by Itagaki. I need to start paying more attention :-). Keep up the good work! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Hello, long time no see. I'm sorry to interrupt your discussion. I'm afraid the code is getting more complicated to continue to use fsync(). Though I don't intend to say the current approach is wrong, could anyone evaluate O_SYNC approach again that commercial databases use and tell me if and why PostgreSQL's fsync() approach is better than theirs? This January, I got a good result with O_SYNC, which I haven't reported here. I'll show it briefly. Please forgive me for my abrupt email, because I don't have enough time. # Personally, I want to work in the community, if I'm allowed. And sorry again. I reported that O_SYNC resulted in very bad performance last year. But that was wrong. The PC server I borrowed was configured that all the disks form one RAID5 device. So, the disks for data and WAL (/dev/sdd and /dev/sde) came from the same RAID5 device, resulting in I/O conflict. What I modified is md.c only. I just added O_SYNC to the open flag in mdopen() and _mdfd_openseg(), if am_bgwriter is true. I didn't want backends to use O_SYNC because mdextend() does not have to transfer data to disk. My evaluation environment was: CPU: Intel Xeon 3.2GHz * 2 (HT on) Memory: 4GB Disk: Ultra320 SCSI (perhaps configured as write back) OS: RHEL3.0 Update 6 Kernel: 2.4.21-37.ELsmp PostgreSQL: 8.2.1 The relevant settings of PostgreSQL are: shared_buffers = 2GB wal_buffers = 1MB wal_sync_method = open_sync checkpoint_* and bgwriter_* parameters are left as their defaults. I used pgbench, with the data of scaling factor 50. [without O_SYNC, original behavior] - pgbench -c1 -t16000 best response: 1ms worst response: 6314ms 10th worst response: 427ms tps: 318 - pgbench -c32 -t500 best response: 1ms worst response: 8690ms 10th worst response: 8668ms tps: 330 [with O_SYNC] - pgbench -c1 -t16000 best response: 1ms worst response: 350ms 10th worst response: 91ms tps: 427 - pgbench -c32 -t500 best response: 1ms worst response: 496ms 10th worst response: 435ms tps: 1117 If the write back cache were disabled, the difference would be smaller. Windows version showed similar improvements. However, this approach has two big problems. (1) Slow down bulk updates Updates of large amount of data get much slower because bgwriter seeks and writes dirty buffers synchronously page-by-page. For example: - COPY of accounts (5m records) and CHECKPOINT command after COPY without O_SYNC: 100sec with O_SYNC: 1046sec - UPDATE of all records of accounts without O_SYNC: 139sec with O_SYNC: 639sec - CHECKPOINT command for flushing 1.6GB of dirty buffers without O_SYNC: 24sec with O_SYNC: 126sec To mitigate this problem, I sorted dirty buffers by their relfilenode and block numbers and wrote multiple pages that are adjacent both on memory and on disk. The result was: - COPY of accounts (5m records) and CHECKPOINT command after COPY 227sec - UPDATE of all records of accounts 569sec - CHECKPOINT command for flushing 1.6GB of dirty buffers 71sec Still bad... (2) Can't utilize tablespaces Though I didn't evaluate, update activity would be much less efficient with O_SYNC than with fsync() when using multiple tablespaces, because there is only one bgwriter. Anyone can solve these problems? One of my ideas is to use scattered I/O. I hear that readv()/writev() became able to do real scattered I/O since kernel 2.6 (RHEL4.0). With kernels before 2.6, readv()/writev() just performed I/Os sequentially. Windows has provided reliable scattered I/O for years. Another idea is to use async I/O, possibly combined with multiple bgwriter approach on platforms where async I/O is not available. How about the chance Josh-san has brought?
On Thu, 5 Apr 2007, Heikki Linnakangas wrote: > The purpose of the bgwriter_all_* settings is to shorten the duration of > the eventual checkpoint. The reason to shorten the checkpoint duration > is to limit the damage to other I/O activity it causes. My thinking is > that assuming the LDC patch is effective (agreed, needs more testing) at > smoothening the checkpoint, the duration doesn't matter anymore. Do you > want to argue there's other reasons to shorten the checkpoint duration? My testing results suggest that LDC doesn't smooth the checkpoint usefully when under a high (>30 client here) load, because (on Linux at least) the way the OS caches writes clashes badly with how buffers end up being evicted if the buffer pool fills back up before the checkpoint is done. In that context, anything that slows down the checkpoint duration is going to make the problem worse rather than better, because it makes it more likely that the tail end of the checkpoint will have to fight with the clients for write bandwidth, at which point they both suffer. If you just get the checkpoint done fast, the clients can't fill the pool as fast as the BufferSync is writing it out, and things are as happy as they can be without a major rewrite to all this code. I can get a tiny improvement in some respects by delaying 2-5 seconds between finishing the writes and calling fsync, because that gives Linux a moment to usefully spool some of the data to the disk controller's cache; beyond that any additional delay is a problem. Since it's only the high load cases I'm having trouble dealing with, this basically makes it a non-starter for me. The focus on checkpoint_timeout and ignoring checkpoint_segments in the patch is also a big issue for me. At the same time, I recognize that the approach taken in LDC probably is a big improvement for many systems, it's just a step backwards for my highest throughput one. I'd really enjoy hearing some results from someone else. > The number of buffers evicted by normal backends in a bgwriter_delay period > is simple to keep track of, just increase a counter in StrategyGetBuffer and > reset it when bgwriter wakes up. I see you've already found the other helpful Itagaki patch in this area. I know I would like to see his code for tracking evictions commited, then I'd like that to be added as another counter in pg_stat_bgwriter (I mentioned that to Magnus in passing when he was setting the stats up but didn't press it because of the patch dependency). Ideally, and this idea was also in Itagaki's patch with the writtenByBgWriter/ByBackEnds debug hook, I think it's important that you know how every buffer written to disk got there--was it a background writer, a checkpoint, or an eviction that wrote it out? Track all those and you can really learn something about your write performance, data that's impossible to collect right now. However, as Itagaki himself points out, doing something useful with bgwriter_lru_maxpages is only one piece of automatically tuning the background writer. I hate to join in on chopping his patches up, but without some additional work I don't think the exact auto-tuning logic he then applies will work in all cases, which could make it more a problem than the current crude yet predictable method. The whole way bgwriter_lru_maxpages and num_to_clean play off each other in his code currently has a number of failure modes I'm concerned about. I'm not sure if a re-write using a moving-average approach (as I did in my auto-tuning writer prototype and as Tom just suggested here) will be sufficient to fix all of them. Was already on my to-do list to investigate that further. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote: > could anyone evaluate O_SYNC approach again that commercial databases > use and tell me if and why PostgreSQL's fsync() approach is better than > theirs? I noticed a big improvement switching the WAL to use O_SYNC (+O_DIRECT) instead of fsync on my big and my little servers with battery-backed cache, so I know sync writes perform reasonably well on my hardware. Since I've had problems with the fsync at checkpoint time, I did a similar test to yours recently, adding O_SYNC to the open calls and pulling the fsyncs out to get a rough idea how things would work. Performance was reasonable most of the time, but when I hit a checkpoint with a lot of the buffer cache dirty it was incredibly bad. It took minutes to write everything out, compared with a few seconds for the current case, and the background writer was too sluggish as well to help. This appears to match your data. If you compare how Oracle handles their writes and checkpoints to the Postgres code, it's obvious they have a different architecture that enables them to support sync writing usefully. I'd recommend the Database Writer Process section of http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm as an introduction for those not familiar with that; it's interesting reading for anyone tinking with background writer code. It would be great to compare performance of the current PostgreSQL code with a fancy multiple background writer version using the latest sync methods or AIO; there have actually been multiple updates to improve O_SYNC writes within Linux during the 2.6 kernel series that make this more practical than ever on that platform. But as you've already seen, the performance hurdle to overcome is significant, and it would have to be optional as a result. When you add all this up--have to keep the current non-sync writes around as well, need to redesign the whole background writer/checkpoint approach around the idea of sync writes, and the OS-specific parts that would come from things like AIO--it gets real messy. Good luck drumming up support for all that when the initial benchmarks suggest it's going to be a big step back. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
From: "Greg Smith" <gsmith@gregsmith.com> > If you compare how Oracle handles their writes and checkpoints to the > Postgres code, it's obvious they have a different architecture that > enables them to support sync writing usefully. I'd recommend the Database > Writer Process section of > http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm > as an introduction for those not familiar with that; it's interesting > reading for anyone tinking with background writer code. Hmm... what makes you think that sync writes is useful for Oracle and not for PostgreSQL? The process architecture is similar; bgwriter performs most of writes in PostgreSQL, while DBWn performs all writes in Oracle. The difference is that Oracle can assure crash recovery time by writing dirby buffers periodically in the order of their LSN. > It would be great to compare performance of the current PostgreSQL code > with a fancy multiple background writer version using the latest sync > methods or AIO; there have actually been multiple updates to improve > O_SYNC writes within Linux during the 2.6 kernel series that make this > more practical than ever on that platform. But as you've already seen, > the performance hurdle to overcome is significant, and it would have to be > optional as a result. When you add all this up--have to keep the current > non-sync writes around as well, need to redesign the whole background > writer/checkpoint approach around the idea of sync writes, and the > OS-specific parts that would come from things like AIO--it gets real > messy. Good luck drumming up support for all that when the initial > benchmarks suggest it's going to be a big step back. I agree with you in that write method has to be optional until there's enough data from the field that help determine which is better. ... It's a pity not to utilize async I/O and Josh-san's offer. I hope it will be used some day. I think OS developers have evolved async I/O for databases.
On Fri, 2007-04-06 at 02:53 -0400, Greg Smith wrote: > If you compare how Oracle handles their writes and checkpoints to the > Postgres code, it's obvious they have a different architecture that > enables them to support sync writing usefully. I'd recommend the > Database > Writer Process section of > http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm > as an introduction for those not familiar with that; it's interesting > reading for anyone tinking with background writer code. Oracle does have a different checkpointing technique and we know it is patented, so we need to go carefully there, especially when directly referencing documentation. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote: > Hmm... what makes you think that sync writes is useful for Oracle and > not for PostgreSQL? They do more to push checkpoint-time work in advance, batch writes up more efficiently, and never let clients do the writing. All of which make for a different type of checkpoint. Like Simon points out, even if it were conceivable to mimic their design it might not even be legally feasible. The point I was trying to make is this: you've been saying that Oracle's writing technology has better performance in this area, which is probably true, and suggesting the cause of that was their using O_SYNC writes. I wanted to believe that and even tested out a prototype. The reality here appears to be that their checkpoints go smoother *despite* using the slower sync writes because they're built their design around the limitations of that write method. I suspect it would take a similar scale of redesign to move Postgres in that direction; the issues you identified (the same ones I ran into) are not so easy to resolve. You're certainly not going to move anybody in that direction by throwing a random comment into a discussion on the patches list about a feature useful *right now* in this area. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Here is an updated version of LDC patch (V4). - Refactor the codes to minimize the impact of changes. - Progress of checkpoint is controlled not only based on checkpoint_timeout but also checkpoint_segments. -- Now it works better with large checkpoint_timeout and small checkpoint_segments. We can control the delay of checkpoints using three parameters: checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent. If we set all of the values to zero, checkpoint behaves as it was. Heikki Linnakangas <heikki@enterprisedb.com> wrote: > I'd suggest rearranging the code so that BgBufferSync and mdsync would > basically stay like they are without the patch; the signature wouldn't > change. To do the naps during a checkpoint, inject calls to new > functions like CheckpointWriteNap() and CheckpointFsyncNap() inside > BgBufferSync and mdsync. Those nap functions would check if enough > progress has been made since last call and sleep if so. Yeah, it makes LDC less intrusive. Now the code flow in checkpoints stay as it was and the nap-functions are called periodically in BufferSync() and smgrsync(). But the signatures of some functions needed small changes; the argument 'immediate' was added. > The nap-functions would call > BgWriteLRUBuffers if more than bgwriter_delay milliseconds have passed > since last call to it. Only LRU buffers are written in nap and sync phases in the new patch. The ALL activity of bgwriter was primarily designed to write drity buffers on ahead of checkpoints, so the writes were not needed *in* checkpoints. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Attachment
ITAGAKI Takahiro wrote: > Here is an updated version of LDC patch (V4). Thanks! I'll start testing. > - Progress of checkpoint is controlled not only based on checkpoint_timeout > but also checkpoint_segments. -- Now it works better with large > checkpoint_timeout and small checkpoint_segments. Great, much better now. I like the concept of "progress" used in the calculations. We might want to call GetCheckpointProgress something else, though. It doesn't return the amount of progress made, but rather the amount of progress we should've made up to that point or we're in danger of not completing the checkpoint in time. > We can control the delay of checkpoints using three parameters: > checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent. > If we set all of the values to zero, checkpoint behaves as it was. The nap and sync phases are pretty straightforward. The write phase, however, behaves a bit differently In the nap phase, we just sleep until enough time/segments has passed, where enough is defined by checkpoint_nap_percent. However, if we're already past checkpoint_write_percent at the beginning of the nap, I think we should clamp the nap time so that we don't run out of time until the next checkpoint because of sleeping. In the sync phase, we sleep between each fsync until enough time/segments have passed, assuming that the time to fsync is proportional to the file length. I'm not sure that's a very good assumption. We might have one huge files with only very little changed data, for example a logging table that is just occasionaly appended to. If we begin by fsyncing that, it'll take a very short time to finish, and we'll then sleep for a long time. If we then have another large file to fsync, but that one has all pages dirty, we risk running out of time because of the unnecessarily long sleep. The segmentation of relations limits the risk of that, though, by limiting the max. file size, and I don't really have any better suggestions. In the write phase, bgwriter_all_maxpages is also factored in the sleeps. On each iteration, we write bgwriter_all_maxpages pages and then we sleep for bgwriter_delay msecs. checkpoint_write_percent only controls the maximum amount of time we try spend in the write phase, we skip the sleeps if we're exceeding checkpoint_write_percent, but it can very well finish earlier. IOW, bgwriter_all_maxpages is the *minimum* amount of pages to write between sleeps. If it's not set, we use WRITERS_PER_ABSORB, which is hardcoded to 1000. The approach of writing min. N pages per iteration seems sound to me. By setting N we can control the maximum impact of a checkpoint under normal circumstances. If there's very little work to do, it doesn't make sense to stretch the write of say 10 buffers across a 15 min period; it's indeed better to finish the checkpoint earlier. It's similar to vacuum_cost_limit in that sense. But using bgwriter_all_maxpages for it doesn't feel right, we should at least name it differently. The default of 1000 is a bit high as well, with the default bgwriter_delay that adds up to 39MB/s. That's ok for decent a I/O subsystem, but the default really should be something that will still leave room for other I/O on a small single-disk server. Should we try doing something similar for the sync phase? If there's only 2 small files to fsync, there's no point sleeping for 5 minutes between them just to use up the checkpoint_sync_percent budget. Should we give a warning if you set the *_percent settings so that they exceed 100%? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 19 Apr 2007, Heikki Linnakangas wrote: > In the sync phase, we sleep between each fsync until enough time/segments > have passed, assuming that the time to fsync is proportional to the file > length. I'm not sure that's a very good assumption. I've been making scatter plots of fsync time vs. amount written to the database for a couple of months now, and while there's a trend there it's not a linear one based on data written. Under Linux, to make a useful prediction about how long a fsync will take you first need to consider how much dirty data is already in the OS cache (the "Dirty:" figure in /proc/meminfo) before the write begins, relative to the kernel parameters that control write behavior. Combine that with some knowledge of the caching behavior of the controller/disk combination you're using, and it's just barely possible to make a reasonable estimate. Any less information than all that and you really have very little basis on which to guess how long it's going to take. Other operating systems are going to give completely different behavior here, which of course makes the problem even worse. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD