Thread: WAL Optimisation - configuration and usage
Hi, There alot here, so skip to the middle from my WAL settings if you like. I'm currently investigating the performance on a large database which consumes email designated as SPAM for the perusal of customers wishing to check. This incorporates a number of subprocesses - several delivery daemons, an expiry daemon and a UI which performs large selects. A considerable amount of UPDATE, SELECT and DELETE are performed continually. Starting with a stock pg config, I've well understood the importance increased shared mem, effective cache size and low random_page_cost as detailed in http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html. After some system analysis with vmstat and sar we've been able to determin that the main problem is IO bound and IMO this is due to lots of updates requiring high drive contention - the array is a RAID0 mirror and the dataset originally 79GB. Alot of SPAM is being sent from our mail scanners and coupled with the UI is creating an increasingly lagging system. Typically all our db servers have these sort of enhancements - 1GB ram, SMP boxen with SCSI 160 disks : effective_cache_size = 95694 random_page_cost = 0.5 sort_mem=65536 max_connections = 128 shared_buffers = 15732 My focus today has been on WAL - I've not looked at WAL before. By increasing the settings thus : wal_buffers = 64 # need to determin WAL usage wal_files = 64 # range 0-64 wal_sync_method = fsync # the default varies across platforms: wal_debug = 0 # range 0-16 # hopefully this should see less LogFlushes per LogInsert - use more WAL though. commit_delay = 10000 # range 0-100000 commit_siblings = 2 # range 1-1000 checkpoint_segments = 16 # in logfile segments (16MB each), min 1 checkpoint_timeout = 300 # in seconds, range 30-3600 fsync = true great improvements have been seen. A vacuumdb -f -a -z went from processing 1 table in 10 minutes to 10 tables in 1 minute. :) I actually stopped it after 80 tables (48 hours runtime) because the projected end time would have been next week. Once I restarted the postmaster with the above WAL settings, vacuumdb -f -a -z completed all 650 tables by the following day. My thinking is therefore to reduce disk context switching as best as possible within the current hardware limitiations. I'm aiming at keeping the checkpoint subprocess happy that other backends are about to commit - hence keep siblings low at 2 - and create a sufficient gap between internal commital so many commits can be done in a single sync. From the above config, I believe I've gone some way to acheive this and the performance I'm now seeing suggests this. But I think we can get more out of this as the above setting were picked from thin air and my concern here is being able to determin WAL file usage and if the system is caught out on the other extreme that we're not commiting fast enough. Currently I've read that WAL files shouldn't be more than 2*checkpoint_segments+1 however my pg_xlog directory contains 74 files. This suggests I'm using more logfiles than I should. Also I'm not sure what wal_buffers really should be set to. Can I get any feedback on this ? How to look into pg's WAL usage would be what I'm looking for. BTW this is an old install I'm afraid 7.2.2 - it's been impossible to upgrade up until now because it's been too slow. I have moved the pg_xlog onto the root SCSI disk - it doesn't appear to have made a huge difference but it could be on the same cable. Additional information as a bit of background : I can supply sar output if required. I'm currently running our expiry daemon which scans all mail for each domain (ie each table) and this seems to take a few hours to run on a 26GB archive. It's alot faster than it ever was. Load gets to about 8 as backends are all busy doing selects, updates and deletes. This process has recently already been run so it shouldn't be doing too much deleting. Still seems IO bound, and I don't think I'm going to solve that without a better disk arrangement, but this is essentially what I'm doing now - exhausting other possibilities. $ sar -B -s 16:00:00 16:35:55 pgpgin/s pgpgout/s activepg inadtypg inaclnpg inatarpg 16:36:00 3601.60 754.40 143492 87791 10230 48302 16:36:05 5766.40 552.80 143947 88039 10170 48431 16:36:10 3663.20 715.20 144578 88354 9075 48401 16:36:15 3634.40 412.00 144335 88405 9427 48433 16:36:20 5578.40 447.20 143626 88545 9817 48397 16:36:25 4154.40 469.60 143640 88654 10388 48536 16:36:30 3504.00 635.20 143538 88763 9992 48458 16:36:35 3540.80 456.00 142515 88949 10444 48381 16:36:40 3334.40 1067.20 143268 89244 9832 48468 $ vmstat 5 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 0 7 1 29588 10592 15700 809060 1 0 97 75 0 103 13 9 79 3 8 0 29588 11680 15736 807620 0 0 3313 438 1838 3559 19 13 68 2 13 1 29588 12808 15404 800328 0 0 4470 445 1515 1752 7 7 86 0 9 1 29588 10992 15728 806476 0 0 2933 781 1246 2686 14 10 76 2 5 1 29588 11336 15956 807884 0 0 3354 662 1773 5211 27 17 57 4 5 0 29696 13072 16020 813872 0 24 4282 306 2632 7862 45 25 31 4 6 1 29696 10400 16116 815084 0 0 5086 314 2668 7893 47 26 27 9 2 1 29696 13060 16308 814232 27 0 3927 748 2586 7836 48 29 23 3 8 1 29696 10444 16232 812816 3 0 4015 433 2443 7180 47 28 25 8 4 0 29696 10904 16432 812488 0 0 4537 500 2616 8418 46 30 24 4 6 2 29696 11048 16320 810276 0 0 6076 569 1893 3919 20 14 66 0 5 0 29696 10480 16600 813788 0 0 4595 435 2400 6215 33 21 46 3 6 0 29696 10536 16376 812248 0 0 3802 504 2417 7921 43 25 32 1 6 1 29696 11236 16500 809636 0 0 3691 357 2171 5199 24 15 61 0 14 1 29696 10228 16036 801368 0 0 4038 561 1566 3288 16 12 72 Sorry it's so long but I thought some brief info would be better than not. Thanks for reading, -- Rob Fielding Development Designer Servers Ltd
Rob Sir - I have to congratulate you on having the most coherently summarised and yet complex list query I have ever seen. I fear that I will be learning from this problem rather than helping, but one thing did puzzle me - you've set your random_page_cost to 0.5? I'm not sure this is sensible - you may be compensating for some other parameter out-of-range. -- Richard Huxton
> random_page_cost = 0.5 Not likely. The lowest this value should ever be is 1, and thats if you're using something like a ram drive. If you're drives are doing a ton of extra random IO due to the above (rather than sequential reads) it would lower the throughput quite a bit. Try a value of 2 for a while.
Rod Taylor wrote: >>random_page_cost = 0.5 >> >> > >Try a value of 2 for a while. > > > OK thanks Richard and Rod. I've upped this to 2. I think I left this over from a previous play with setttings on my IDE RAID 0 workstation. It seemed to have a good effect being set as a low float so it stuck. I've set it to 2. From another post off list, I've also bumped up max_fsm_relations = 1000 # min 10, fsm max_fsm_pages = 20000 # min 1000, fs vacuum_mem = 32768 # min 1024 as they did seem a little low. I'm hesitant to set them too high at this stage as I'd prefer to keep as much RAM available for runtime at this time. I'm still hoping that perhaps the uber-pgadmin Mr Lane might reply about my WAL issue :) however I'm getting the feeling now the server is running with a much higher level of performance than it has been. Won't know until tomorrow thought. Cheers, -- Rob Fielding Development Designer Servers Ltd
Further update to my WAL experimentation. pg_xlog files have increased to 81, and checking today up to 84. Currently nothing much going on with the server save a background process running a select every 30 seconds with almost no impact (according to IO from vmstats). This in itself is a good sign - an improvement on running last week, but I'd still like to get clarification on WAL file usage if possible. Log file tailing has nothing more interesting than a whole set of "recycled transaction log file" entries : 2004-03-01 16:01:55 DEBUG: recycled transaction log file 0000007100000017 2004-03-01 16:07:01 DEBUG: recycled transaction log file 0000007100000018 2004-03-01 16:17:14 DEBUG: recycled transaction log file 0000007100000019 2004-03-01 16:22:20 DEBUG: recycled transaction log file 000000710000001A 2004-03-01 16:32:31 DEBUG: recycled transaction log file 000000710000001B 2004-03-01 16:37:36 DEBUG: recycled transaction log file 000000710000001C 2004-03-01 16:47:48 DEBUG: recycled transaction log file 000000710000001D 2004-03-01 16:52:54 DEBUG: recycled transaction log file 000000710000001E 2004-03-01 17:03:05 DEBUG: recycled transaction log file 000000710000001F Looks kinda automated, but the times aren't quite even at around 6-10 minutes apart. cheers, -- Rob Fielding rob@dsvr.net www.dsvr.co.uk Development Designer Servers Ltd
>Rob Fielding wrote: > My focus today has been on WAL - I've not looked at WAL before. By > increasing the settings thus : > > wal_buffers = 64 # need to determin WAL usage > wal_files = 64 # range 0-64 > wal_sync_method = fsync # the default varies across platforms: > wal_debug = 0 # range 0-16 > > # hopefully this should see less LogFlushes per LogInsert - use more WAL > though. > commit_delay = 10000 # range 0-100000 > commit_siblings = 2 # range 1-1000 > checkpoint_segments = 16 # in logfile segments (16MB each), min 1 > checkpoint_timeout = 300 # in seconds, range 30-3600 > fsync = true > But I think we can get more out of this as the above setting were picked > from thin air and my concern here is being able to determin WAL file > usage and if the system is caught out on the other extreme that we're > not commiting fast enough. Currently I've read that WAL files shouldn't > be more than 2*checkpoint_segments+1 however my pg_xlog directory > contains 74 files. This suggests I'm using more logfiles than I should. > Also I'm not sure what wal_buffers really should be set to. As Richard Huxton says, we're all learning...I'm looking at WAL logic for other reasons right now... This is based upon my reading of the code; I think the manual contains at least one confusion that has not assisted your understanding (or mine): The WAL files limit of 2*checkpoint_segments+1 refers to the number of files allocated-in-advance of the current log, not the total number of files in use. pg uses a cycle of logs, reusing older ones when all the transactions in those log files have been checkpointed. The limit is set to allow checkpoint to release segments and have them all be reused at once. Pg stores them up for use again later when workload hots up again. If it cannot recycle a file because there is a still-current txn on the end of the cycle, then it will allocate a new file and use this instead, but still keeping everything in a cycle. Thus if transactions are particularly long running, then the number of files in the cycle will grow. So overall, normal behaviour so far. I don't think there's anything to worry about in having that many files in your xlog cycle. That behaviour is usually seen with occasional long running txns. When a long running transaction is over, pg will try to reduce the number of files in the cycle until its back to target. You seem to be reusing one file in the cycle every 10 mins - this is happening as the result of a checkpoint timeout - "kinda automated" as you say. [A checkpoint is the only time you can get the messages you're getting] At one file per checkpoint, it will take 16*2+1=33 checkpoints*10 mins = 5 hours before it hits the advance allocation file limit and then starts to reduce number of files. That's why they appear to stay constant... If you want to check whether this is correct, manually issue a number of CHECKPOINT statements. The messages should change from "recycled" to "removing" transaction log file once you've got to 33 checkpoints - the number of WAL log files should start to go down also? If so, then there's nothing too strange going on, just pg being a little slow in reducing the number of wal log files. So, it seems that you are running occasional very long transactions. During that period you run up to 60-80 wal files. That's just on the edge of your wal_buffers limit, which means you start to write wal quicker than you'd like past that point. Your checkpoint_timeout is 300 seconds, but a checkpoint will also be called every checkpoint_segments, or currently every 16 wal files. Since you go as high as 60-80 then you are checkpointing 4-5 times during the heavy transaction period - assuming it's all one block of work. In the end, each checkpoint is causing a huge I/O storm, during which not much work happens. I would suggest that you reduce the effect of checkpointing by either: - re-write app to do scan deletes in smaller chunks in quieter periods or - increase checkpoint_segments to 128, though this may effect your recoverability You can of course only do so much with the memory available to you. If you increase one allocation of memory, you may have to reduce another parameter and that may be counter productive. [An alternative view is that you should go for more frequent, not less frequent checkpoints in this situation, smoothing out the effect of the checkpoints, rather than trying to avoid them at all. On the other hand, that approach also increases total WAL log volume, which means you'll make poor use of I/O and memory buffering. I'd stay high.] However, I'm not sure - why checkpoint interval of 300 secs causes them to happen every 10 mins in quieter periods; is that an occaisional update occurring? - why checkpoint only releases single Wal file each time - but that maybe me just reading the code incorrectly. Please set WAL_DEBUG to 1 so we can see a bit more info: thanks. > Can I get any feedback on this ? How to look into pg's WAL usage would > be what I'm looking for. BTW this is an old install I'm afraid 7.2.2 - > it's been impossible to upgrade up until now because it's been too slow. > I have moved the pg_xlog onto the root SCSI disk - it doesn't appear to > have made a huge difference but it could be on the same cable. My advice is don't touch WAL_SYNC_METHOD... I **think** the WAL behaviour is still the same in 7.4.1, so no rush to upgrade on that account - unless you're using temporary tables.... Best Regards, Simon Riggs
"Simon Riggs" <simon@2ndquadrant.com> writes: > - why checkpoint interval of 300 secs causes them to happen every 10 > mins in quieter periods; is that an occaisional update occurring? There is code in there to suppress a checkpoint if no WAL-loggable activity has happened since the last checkpoint. Not sure if that's relevant to the issue or not though. regards, tom lane
Simon, > Please set WAL_DEBUG to 1 so we can see a bit more info: thanks. I'm pretty sure that WAL_DEBUG requires a compile-time option. -- -Josh Berkus Aglio Database Solutions San Francisco
>Josh Berkus wrote > >Simon Riggs wrote > > Please set WAL_DEBUG to 1 so we can see a bit more info: thanks. > > I'm pretty sure that WAL_DEBUG requires a compile-time option. In my naiveté, I just set and use it. I discovered it in the code, then set it to take advantage. I'm surprised, but you are right, the manual does SAY this requires a compile time option; it is unfortunately not correct. Maybe this was once as you say? If Rob is using 7.2.2 maybe this still applies to him? I dont know. Setting wal_debug > 0 in postgresql.conf sets a variable XLOGDEBUG, which although it is all capitals is not in fact a compiler directive, as it looks. This variable is used within xlog.c to output **too much** information to the log. However, it is the only option at present. This prompts me however to consider the idea of having various levels of WAL debug output, or using some kind of log_checkpoint mechanism to better understand what is going on. Best Regards, Simon Riggs
>Tom Lane > "Simon Riggs" <simon@2ndquadrant.com> writes: > > - why checkpoint interval of 300 secs causes them to happen every 10 > > mins in quieter periods; is that an occaisional update occurring? > > There is code in there to suppress a checkpoint if no WAL-loggable > activity has happened since the last checkpoint. Not sure if that's > relevant to the issue or not though. Thanks Tom, at least that clears up why the checkpoints are off. I must admit, I'm taken aback though: I'd prefer it if it DIDN'T do that. If the system is quiet, the odd checkpoint doesn't matter that much - however, taking a long time to return the xlog files to the *desired* state of having many pre-allocated log files is not a great thing. What do you think about continuing to checkpoint normally until the number of xlog files has returned to 2*checkpoint_segments+1, then allowing a slow down of checkpoints when quiet? It would be easy enough to set a variable true while rearranging files to the limit, then set it false when the limit has been hit and then using that to activate the slow-down code (not that I know where that is mind...). However, that would require some backend to postmaster ipc, which may be a problem. Or perhaps the real problem is only recycling one file at a time - if we're running this as the checkpoint process it wouldn't be a problem to recycle more than one at the same time would it? The reason for my interest is: when archiving logs for PITR, there may be occasional long pauses while waiting for tape mounts (typically 30 minutes from notification to change). These pauses could (and therefore will eventually for some people) cause severe log file build up, and I'm interested in making sure this build up doesn't take too long to clear. Forgetting the archival API stuff for a second, this is roughly the same situation as Rob is experiencing (or at least causing him to pause and think). Best Regards, Simon Riggs
Simon Riggs wrote: >>Josh Berkus wrote >> >>>Simon Riggs wrote >>>Please set WAL_DEBUG to 1 so we can see a bit more info: thanks. >> >>I'm pretty sure that WAL_DEBUG requires a compile-time option. > > I'm surprised, but you are right, the manual does SAY this requires a > compile time option; it is unfortunately not correct. Actually, the manual is correct: in 7.4 and earlier releases, enabling wal_debug can be done without also setting a compile-time #ifdef. As of current CVS HEAD, the WAL_DEBUG #ifdef must be defined before this variable is available. -Neil
Neil, > Actually, the manual is correct: in 7.4 and earlier releases, enabling > wal_debug can be done without also setting a compile-time #ifdef. As > of current CVS HEAD, the WAL_DEBUG #ifdef must be defined before this > variable is available. Hmmm. I was told that it was this way for 7.4 as well; that's why it's in the docs that way. -- -Josh Berkus Aglio Database Solutions San Francisco
>Neil Conway > Simon Riggs wrote: > >>Josh Berkus wrote > >> > >>>Simon Riggs wrote > >>>Please set WAL_DEBUG to 1 so we can see a bit more info: thanks. > >> > >>I'm pretty sure that WAL_DEBUG requires a compile-time option. > > > > I'm surprised, but you are right, the manual does SAY this requires a > > compile time option; it is unfortunately not correct. > > Actually, the manual is correct: in 7.4 and earlier releases, enabling > wal_debug can be done without also setting a compile-time #ifdef. As > of current CVS HEAD, the WAL_DEBUG #ifdef must be defined before this > variable is available. Touche! I stand corrected, thank you both. My suggestion does work for Rob, then. [This also implies I have a screwed version on my machine, so thank you also for flushing that lurking issue out for me. I'd had a suspicion for a few weeks. Lucky I'm still just prototyping.] On the other hand, I was just about to change the wal_debug behaviour to allow better debugging of PITR features as they're added. I think it is very important to be able to put the system fairly easily into debug mode; a recompile is easy enough, but it would be even better to avoid this completely. This would mean reversing the change you describe: here's the design: The behaviour I wish to add is: Keep wal_debug as a value between 0 and 16. If =0 then no debug output (default). Use following bitmasks against the value Mask 1 = XLOG Checkpoints get logged Mask 2 = Archive API calls get logged Mask 4 = Transaction - commits get logged Mask 8 = Flush & INSERTs get logged That way it should be fairly straightforward to control the amount and type of information available to administrators. The existing design produces too much info to be easily usable, mostly requiring a perl program to filter out the info overload and do record counts. This suggested design allows you to control the volume of messages, since the bitmasks are arranged in volume/frequency order and brings the wal_debug option back into something useful for problem diagnosis on live systems, not just hacking the code. Anybody object to these mods, or have better/different ideas? Getting the diagnostics right is fairly important, IMHO, to making PITR become real. Best regards, Simon Riggs
Josh Berkus wrote: > Hmmm. I was told that it was this way for 7.4 as well; that's why it's in > the docs that way. No such statement is made in the docs AFAIK: they merely say "If nonzero, turn on WAL-related debugging output." I invented a new #ifdef symbol when making this change in CVS HEAD, so I think you are misremembering. -Neil
"Simon Riggs" <simon@2ndquadrant.com> writes: > The behaviour I wish to add is: > Keep wal_debug as a value between 0 and 16. > If =0 then no debug output (default). > Use following bitmasks against the value > Mask 1 = XLOG Checkpoints get logged > Mask 2 = Archive API calls get logged > Mask 4 = Transaction - commits get logged > Mask 8 = Flush & INSERTs get logged I see no value in reverting Neil's change. The above looks way too much like old-line assembler-programmer thinking to me, anyway. Why not invent a separate, appropriately named boolean variable for each thing you want to control? Even C programmers manage to avoid doing the sort of mental arithmetic that the above would force onto DBAs. As for whether it should be #ifdef'd or not, I'd have no objection to turning WAL_DEBUG on by default in pg_config_manual.h for the duration of PITR development. One should not however confuse short-term debugging needs with features that the average user is going to need indefinitely. (It was not too long ago that there was still debugging code for btree index building in there, for crissakes.) regards, tom lane
Simon Riggs wrote: > On the other hand, I was just about to change the wal_debug behaviour to > allow better debugging of PITR features as they're added. That's a development activity. Enabling the WAL_DEBUG #ifdef by default during the 7.5 development cycle would be uncontroversial, I think. > I think it is very important to be able to put the system fairly > easily into debug mode It is? Why would this be useful for non-development activities? (It may well be the case that we ought to report more or better information about the status of the WAL subsystem; but WAL_DEBUG is surely not the right mechanism for emitting information intended for an administrator.) -Neil
>Neil Conway > Simon Riggs wrote: > > On the other hand, I was just about to change the wal_debug behaviour to > > allow better debugging of PITR features as they're added. > > That's a development activity. Enabling the WAL_DEBUG #ifdef by > default during the 7.5 development cycle would be uncontroversial, I > think. Yes that's the best proposal. Can I leave that with you? > > I think it is very important to be able to put the system fairly > > easily into debug mode > > It is? Why would this be useful for non-development activities? > > (It may well be the case that we ought to report more or better > information about the status of the WAL subsystem; but WAL_DEBUG is > surely not the right mechanism for emitting information intended for > an administrator.) Right again. I guess my proposal amounted to quick-and-dirty logging. I'll think some more. Best Regards, Simon Riggs