Thread: WAL Optimisation - configuration and usage

WAL Optimisation - configuration and usage

From
Rob Fielding
Date:
Hi,

There alot here, so skip to the middle from my WAL settings if you like.

I'm currently investigating the performance on a large database which
consumes email designated as SPAM for the perusal of customers wishing
to check. This incorporates a number of subprocesses - several delivery
daemons, an expiry daemon and a UI which performs large selects. A
considerable amount of UPDATE, SELECT and DELETE are performed continually.

Starting with a stock pg config, I've well understood the importance
increased shared mem, effective cache size and low random_page_cost as
detailed in
http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html. After some
system analysis with vmstat and sar we've been able to determin that the
main problem is IO bound and IMO this is due to lots of updates
requiring high drive contention - the array is a RAID0 mirror and the
dataset originally 79GB. Alot of SPAM is being sent from our mail
scanners and coupled with the UI is creating an increasingly lagging system.

Typically all our db servers have these sort of enhancements - 1GB ram,
SMP boxen with SCSI 160 disks :
effective_cache_size = 95694
random_page_cost = 0.5
sort_mem=65536
max_connections = 128
shared_buffers = 15732

My focus today has been on WAL - I've not looked at WAL before. By
increasing the settings thus :

wal_buffers = 64 # need to determin WAL usage
wal_files = 64 # range 0-64
wal_sync_method = fsync   # the default varies across platforms:
wal_debug = 0             # range 0-16

# hopefully this should see less LogFlushes per LogInsert - use more WAL
though.
commit_delay = 10000     # range 0-100000
commit_siblings = 2       # range 1-1000
checkpoint_segments = 16   # in logfile segments (16MB each), min 1
checkpoint_timeout = 300  # in seconds, range 30-3600
fsync = true

great improvements have been seen. A vacuumdb -f -a -z went from
processing 1 table in 10 minutes to 10 tables in 1 minute. :) I actually
stopped it after 80 tables (48 hours runtime) because the projected end
time would have been next week. Once I restarted the postmaster with the
above WAL settings, vacuumdb -f -a -z completed all 650 tables by the
following day.

My thinking is therefore to reduce disk context switching as best as
possible within the current hardware limitiations. I'm aiming at keeping
the checkpoint subprocess happy that other backends are about to commit
- hence keep siblings low at 2 - and create a sufficient gap between
internal commital so many commits can be done in a single sync. From the
above config, I believe I've gone some way to acheive this and the
performance I'm now seeing suggests this.

But I think we can get more out of this as the above setting were picked
from thin air  and my concern here is being able to determin WAL file
usage and if the system is caught out on the other extreme that we're
not commiting fast enough. Currently I've read that WAL files shouldn't
be more than 2*checkpoint_segments+1 however my pg_xlog directory
contains 74 files. This suggests I'm using more logfiles than I should.
Also I'm not sure what wal_buffers really should be set to.

Can I get any feedback on this ? How to look into pg's WAL usage would
be what I'm looking for. BTW this is an old install I'm afraid 7.2.2 -
it's been impossible to upgrade up until now because it's been too slow.
I have moved the pg_xlog onto the root SCSI disk - it doesn't appear to
have made a huge difference but it could be on the same cable.

Additional information as a bit of background :
I can supply sar output if required. I'm currently running our expiry
daemon which scans all mail for each domain (ie each table) and this
seems to take a few hours to run on a 26GB archive. It's alot faster
than it ever was. Load gets to about 8 as backends are all busy doing
selects, updates and deletes. This process has recently already been run
so it shouldn't be doing too much deleting. Still seems IO bound, and I
don't think I'm going to solve that without a better disk arrangement,
but this is essentially what I'm doing now - exhausting other possibilities.

$ sar -B -s 16:00:00

16:35:55     pgpgin/s pgpgout/s  activepg  inadtypg  inaclnpg  inatarpg
16:36:00      3601.60    754.40    143492     87791     10230     48302
16:36:05      5766.40    552.80    143947     88039     10170     48431
16:36:10      3663.20    715.20    144578     88354      9075     48401
16:36:15      3634.40    412.00    144335     88405      9427     48433
16:36:20      5578.40    447.20    143626     88545      9817     48397
16:36:25      4154.40    469.60    143640     88654     10388     48536
16:36:30      3504.00    635.20    143538     88763      9992     48458
16:36:35      3540.80    456.00    142515     88949     10444     48381
16:36:40      3334.40   1067.20    143268     89244      9832     48468

$ vmstat 5
   procs                      memory    swap          io
system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us
sy  id
 0  7  1  29588  10592  15700 809060   1   0    97    75    0   103
13   9  79
 3  8  0  29588  11680  15736 807620   0   0  3313   438 1838  3559  19
13  68
 2 13  1  29588  12808  15404 800328   0   0  4470   445 1515  1752
7   7  86
 0  9  1  29588  10992  15728 806476   0   0  2933   781 1246  2686  14
10  76
 2  5  1  29588  11336  15956 807884   0   0  3354   662 1773  5211  27
17  57
 4  5  0  29696  13072  16020 813872   0  24  4282   306 2632  7862  45
25  31
 4  6  1  29696  10400  16116 815084   0   0  5086   314 2668  7893  47
26  27
 9  2  1  29696  13060  16308 814232  27   0  3927   748 2586  7836  48
29  23
 3  8  1  29696  10444  16232 812816   3   0  4015   433 2443  7180  47
28  25
 8  4  0  29696  10904  16432 812488   0   0  4537   500 2616  8418  46
30  24
 4  6  2  29696  11048  16320 810276   0   0  6076   569 1893  3919  20
14  66
 0  5  0  29696  10480  16600 813788   0   0  4595   435 2400  6215  33
21  46
 3  6  0  29696  10536  16376 812248   0   0  3802   504 2417  7921  43
25  32
 1  6  1  29696  11236  16500 809636   0   0  3691   357 2171  5199  24
15  61
 0 14  1  29696  10228  16036 801368   0   0  4038   561 1566  3288  16
12  72

Sorry it's so long but I thought some brief info would be better than
not. Thanks for reading,

--

Rob Fielding
Development
Designer Servers Ltd


Re: WAL Optimisation - configuration and usage

From
Richard Huxton
Date:
Rob

Sir - I have to congratulate you on having the most coherently summarised and
yet complex list query I have ever seen.

I fear that I will be learning from this problem rather than helping, but one
thing did puzzle me - you've set your random_page_cost to 0.5? I'm not sure
this is sensible - you may be compensating for some other parameter
out-of-range.


--
  Richard Huxton

Re: WAL Optimisation - configuration and usage

From
Rod Taylor
Date:
> random_page_cost = 0.5

Not likely. The lowest this value should ever be is 1, and thats if
you're using something like a ram drive.

If you're drives are doing a ton of extra random IO due to the above
(rather than sequential reads) it would lower the throughput quite a
bit.

Try a value of 2 for a while.



Re: WAL Optimisation - configuration and usage

From
Rob Fielding
Date:
Rod Taylor wrote:

>>random_page_cost = 0.5
>>
>>
>
>Try a value of 2 for a while.
>
>
>
OK thanks Richard and Rod. I've upped this to 2. I think I left this
over from a previous play with setttings on my IDE RAID 0 workstation.
It seemed to have a good effect being set as a low float so it stuck.

I've set it to 2.

 From another post off list, I've also bumped up

max_fsm_relations = 1000   # min 10, fsm
max_fsm_pages = 20000      # min 1000, fs
vacuum_mem = 32768          # min 1024

as they did seem a little low. I'm hesitant to set them too high at this
stage as I'd prefer to keep as much RAM available for runtime at this time.

I'm still hoping that perhaps the uber-pgadmin Mr Lane might reply about
my WAL issue :) however I'm getting the feeling now the server is
running with a much higher level of performance than it has been. Won't
know until tomorrow thought.

Cheers,

--
Rob Fielding
Development
Designer Servers Ltd


Re: WAL Optimisation - configuration and usage

From
Rob Fielding
Date:
Further update to my WAL experimentation. pg_xlog files have increased
to 81, and checking today up to 84. Currently nothing much going on with
the server save a background process running a select every 30 seconds
with almost no impact (according to IO from vmstats).

This in itself is a good sign - an improvement on running last week, but
I'd still like to get clarification on WAL file usage if possible.

Log file tailing has nothing more interesting than a whole set of
"recycled transaction log file" entries :

2004-03-01 16:01:55 DEBUG:  recycled transaction log file 0000007100000017
2004-03-01 16:07:01 DEBUG:  recycled transaction log file 0000007100000018
2004-03-01 16:17:14 DEBUG:  recycled transaction log file 0000007100000019
2004-03-01 16:22:20 DEBUG:  recycled transaction log file 000000710000001A
2004-03-01 16:32:31 DEBUG:  recycled transaction log file 000000710000001B
2004-03-01 16:37:36 DEBUG:  recycled transaction log file 000000710000001C
2004-03-01 16:47:48 DEBUG:  recycled transaction log file 000000710000001D
2004-03-01 16:52:54 DEBUG:  recycled transaction log file 000000710000001E
2004-03-01 17:03:05 DEBUG:  recycled transaction log file 000000710000001F

Looks kinda automated, but the times aren't quite even at around 6-10
minutes apart.

cheers,
--

Rob Fielding
rob@dsvr.net

www.dsvr.co.uk              Development             Designer Servers Ltd

Re: WAL Optimisation - configuration and usage

From
"Simon Riggs"
Date:
>Rob Fielding wrote:
> My focus today has been on WAL - I've not looked at WAL before. By
> increasing the settings thus :
>
> wal_buffers = 64 # need to determin WAL usage
> wal_files = 64 # range 0-64
> wal_sync_method = fsync   # the default varies across platforms:
> wal_debug = 0             # range 0-16
>
> # hopefully this should see less LogFlushes per LogInsert - use more
WAL
> though.
> commit_delay = 10000     # range 0-100000
> commit_siblings = 2       # range 1-1000
> checkpoint_segments = 16   # in logfile segments (16MB each), min 1
> checkpoint_timeout = 300  # in seconds, range 30-3600
> fsync = true

> But I think we can get more out of this as the above setting were
picked
> from thin air  and my concern here is being able to determin WAL file
> usage and if the system is caught out on the other extreme that we're
> not commiting fast enough. Currently I've read that WAL files
shouldn't
> be more than 2*checkpoint_segments+1 however my pg_xlog directory
> contains 74 files. This suggests I'm using more logfiles than I
should.
> Also I'm not sure what wal_buffers really should be set to.

As Richard Huxton says, we're all learning...I'm looking at WAL logic
for other reasons right now...

This is based upon my reading of the code; I think the manual contains
at least one confusion that has not assisted your understanding (or
mine):

The WAL files limit of 2*checkpoint_segments+1 refers to the number of
files allocated-in-advance of the current log, not the total number of
files in use. pg uses a cycle of logs, reusing older ones when all the
transactions in those log files have been checkpointed. The limit is set
to allow checkpoint to release segments and have them all be reused at
once. Pg stores them up for use again later when workload hots up again.

If it cannot recycle a file because there is a still-current txn on the
end of the cycle, then it will allocate a new file and use this instead,
but still keeping everything in a cycle. Thus if transactions are
particularly long running, then the number of files in the cycle will
grow. So overall, normal behaviour so far. I don't think there's
anything to worry about in having that many files in your xlog cycle.

That behaviour is usually seen with occasional long running txns. When a
long running transaction is over, pg will try to reduce the number of
files in the cycle until its back to target.

You seem to be reusing one file in the cycle every 10 mins - this is
happening as the result of a checkpoint timeout - "kinda automated" as
you say. [A checkpoint is the only time you can get the messages you're
getting] At one file per checkpoint, it will take 16*2+1=33
checkpoints*10 mins = 5 hours before it hits the advance allocation file
limit and then starts to reduce number of files. That's why they appear
to stay constant...

If you want to check whether this is correct, manually issue a number of
CHECKPOINT statements. The messages should change from "recycled" to
"removing" transaction log file once you've got to 33 checkpoints - the
number of WAL log files should start to go down also? If so, then
there's nothing too strange going on, just pg being a little slow in
reducing the number of wal log files.

So, it seems that you are running occasional very long transactions.
During that period you run up to 60-80 wal files. That's just on the
edge of your wal_buffers limit, which means you start to write wal
quicker than you'd like past that point. Your checkpoint_timeout is 300
seconds, but a checkpoint will also be called every checkpoint_segments,
or currently every 16 wal files. Since you go as high as 60-80 then you
are checkpointing 4-5 times during the heavy transaction period -
assuming it's all one block of work. In the end, each checkpoint is
causing a huge I/O storm, during which not much work happens.

I would suggest that you reduce the effect of checkpointing by either:
- re-write app to do scan deletes in smaller chunks in quieter periods
or
- increase checkpoint_segments to 128, though this may effect your
recoverability

You can of course only do so much with the memory available to you. If
you increase one allocation of memory, you may have to reduce another
parameter and that may be counter productive.

[An alternative view is that you should go for more frequent, not less
frequent checkpoints in this situation, smoothing out the effect of the
checkpoints, rather than trying to avoid them at all. On the other hand,
that approach also increases total WAL log volume, which means you'll
make poor use of I/O and memory buffering. I'd stay high.]

However, I'm not sure
- why checkpoint interval of 300 secs causes them to happen every 10
mins in quieter periods; is that an occaisional update occurring?
- why checkpoint only releases single Wal file each time - but that
maybe me just reading the code incorrectly.

Please set WAL_DEBUG to 1 so we can see a bit more info: thanks.

> Can I get any feedback on this ? How to look into pg's WAL usage would
> be what I'm looking for. BTW this is an old install I'm afraid 7.2.2 -
> it's been impossible to upgrade up until now because it's been too
slow.
> I have moved the pg_xlog onto the root SCSI disk - it doesn't appear
to
> have made a huge difference but it could be on the same cable.

My advice is don't touch WAL_SYNC_METHOD...

I **think** the WAL behaviour is still the same in 7.4.1, so no rush to
upgrade on that account - unless you're using temporary tables....

Best Regards, Simon Riggs


Re: WAL Optimisation - configuration and usage

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> - why checkpoint interval of 300 secs causes them to happen every 10
> mins in quieter periods; is that an occaisional update occurring?

There is code in there to suppress a checkpoint if no WAL-loggable
activity has happened since the last checkpoint.  Not sure if that's
relevant to the issue or not though.

            regards, tom lane

Re: WAL Optimisation - configuration and usage

From
Josh Berkus
Date:
Simon,

> Please set WAL_DEBUG to 1 so we can see a bit more info: thanks.

I'm pretty sure that WAL_DEBUG requires a compile-time option.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: WAL Optimisation - configuration and usage

From
"Simon Riggs"
Date:
>Josh Berkus wrote
> >Simon Riggs wrote
> > Please set WAL_DEBUG to 1 so we can see a bit more info: thanks.
>
> I'm pretty sure that WAL_DEBUG requires a compile-time option.

In my naiveté, I just set and use it. I discovered it in the code, then
set it to take advantage.

I'm surprised, but you are right, the manual does SAY this requires a
compile time option; it is unfortunately not correct. Maybe this was
once as you say? If Rob is using 7.2.2 maybe this still applies to him?
I don’t know.

Setting wal_debug > 0 in postgresql.conf sets a variable XLOGDEBUG,
which although it is all capitals is not in fact a compiler directive,
as it looks. This variable is used within xlog.c to output **too much**
information to the log. However, it is the only option at present.

This prompts me however to consider the idea of having various levels of
WAL debug output, or using some kind of log_checkpoint mechanism to
better understand what is going on.

Best Regards, Simon Riggs


Re: WAL Optimisation - configuration and usage

From
"Simon Riggs"
Date:
>Tom Lane
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > - why checkpoint interval of 300 secs causes them to happen every 10
> > mins in quieter periods; is that an occaisional update occurring?
>
> There is code in there to suppress a checkpoint if no WAL-loggable
> activity has happened since the last checkpoint.  Not sure if that's
> relevant to the issue or not though.

Thanks Tom, at least that clears up why the checkpoints are off.

I must admit, I'm taken aback though:

I'd prefer it if it DIDN'T do that. If the system is quiet, the odd
checkpoint doesn't matter that much - however, taking a long time to
return the xlog files to the *desired* state of having many
pre-allocated log files is not a great thing.

What do you think about continuing to checkpoint normally until the
number of xlog files has returned to 2*checkpoint_segments+1, then
allowing a slow down of checkpoints when quiet? It would be easy enough
to set a variable true while rearranging files to the limit, then set it
false when the limit has been hit and then using that to activate the
slow-down code (not that I know where that is mind...). However, that
would require some backend to postmaster ipc, which may be a problem.

Or perhaps the real problem is only recycling one file at a time - if
we're running this as the checkpoint process it wouldn't be a problem to
recycle more than one at the same time would it?

The reason for my interest is: when archiving logs for PITR, there may
be occasional long pauses while waiting for tape mounts (typically 30
minutes from notification to change). These pauses could (and therefore
will eventually for some people) cause severe log file build up, and I'm
interested in making sure this build up doesn't take too long to clear.
Forgetting the archival API stuff for a second, this is roughly the same
situation as Rob is experiencing (or at least causing him to pause and
think).

Best Regards, Simon Riggs


Re: WAL Optimisation - configuration and usage

From
Neil Conway
Date:
Simon Riggs wrote:
>>Josh Berkus wrote
>>
>>>Simon Riggs wrote
>>>Please set WAL_DEBUG to 1 so we can see a bit more info: thanks.
>>
>>I'm pretty sure that WAL_DEBUG requires a compile-time option.
>
> I'm surprised, but you are right, the manual does SAY this requires a
> compile time option; it is unfortunately not correct.

Actually, the manual is correct: in 7.4 and earlier releases, enabling
wal_debug can be done without also setting a compile-time #ifdef. As
of current CVS HEAD, the WAL_DEBUG #ifdef must be defined before this
variable is available.

-Neil

Re: WAL Optimisation - configuration and usage

From
Josh Berkus
Date:
Neil,

> Actually, the manual is correct: in 7.4 and earlier releases, enabling
> wal_debug can be done without also setting a compile-time #ifdef. As
> of current CVS HEAD, the WAL_DEBUG #ifdef must be defined before this
> variable is available.

Hmmm.   I was told that it was this way for 7.4 as well; that's why it's in
the docs that way.


--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: WAL Optimisation - configuration and usage

From
"Simon Riggs"
Date:
>Neil Conway
> Simon Riggs wrote:
> >>Josh Berkus wrote
> >>
> >>>Simon Riggs wrote
> >>>Please set WAL_DEBUG to 1 so we can see a bit more info: thanks.
> >>
> >>I'm pretty sure that WAL_DEBUG requires a compile-time option.
> >
> > I'm surprised, but you are right, the manual does SAY this requires
a
> > compile time option; it is unfortunately not correct.
>
> Actually, the manual is correct: in 7.4 and earlier releases, enabling
> wal_debug can be done without also setting a compile-time #ifdef. As
> of current CVS HEAD, the WAL_DEBUG #ifdef must be defined before this
> variable is available.

Touche! I stand corrected, thank you both. My suggestion does work for
Rob, then.

[This also implies I have a screwed version on my machine, so thank you
also for flushing that lurking issue out for me. I'd had a suspicion for
a few weeks. Lucky I'm still just prototyping.]

On the other hand, I was just about to change the wal_debug behaviour to
allow better debugging of PITR features as they're added. I think it is
very important to be able to put the system fairly easily into debug
mode; a recompile is easy enough, but it would be even better to avoid
this completely. This would mean reversing the change you describe:
here's the design:

The behaviour I wish to add is:
Keep wal_debug as a value between 0 and 16.
If =0 then no debug output (default).
Use following bitmasks against the value
Mask 1 = XLOG Checkpoints get logged
Mask 2 = Archive API calls get logged
Mask 4 = Transaction - commits get logged
Mask 8 = Flush & INSERTs get logged

That way it should be fairly straightforward to control the amount and
type of information available to administrators. The existing design
produces too much info to be easily usable, mostly requiring a perl
program to filter out the info overload and do record counts. This
suggested design allows you to control the volume of messages, since the
bitmasks are arranged in volume/frequency order and brings the wal_debug
option back into something useful for problem diagnosis on live systems,
not just hacking the code.

Anybody object to these mods, or have better/different ideas? Getting
the diagnostics right is fairly important, IMHO, to making PITR become
real.

Best regards, Simon Riggs


Re: WAL Optimisation - configuration and usage

From
Neil Conway
Date:
Josh Berkus wrote:
> Hmmm.   I was told that it was this way for 7.4 as well; that's why it's in
> the docs that way.

No such statement is made in the docs AFAIK: they merely say "If
nonzero, turn on WAL-related debugging output."

I invented a new #ifdef symbol when making this change in CVS HEAD, so
I think you are misremembering.

-Neil

Re: [HACKERS] WAL Optimisation - configuration and usage

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> The behaviour I wish to add is:
> Keep wal_debug as a value between 0 and 16.
> If =0 then no debug output (default).
> Use following bitmasks against the value
> Mask 1 = XLOG Checkpoints get logged
> Mask 2 = Archive API calls get logged
> Mask 4 = Transaction - commits get logged
> Mask 8 = Flush & INSERTs get logged

I see no value in reverting Neil's change.  The above looks way too much
like old-line assembler-programmer thinking to me, anyway.  Why not
invent a separate, appropriately named boolean variable for each thing
you want to control?  Even C programmers manage to avoid doing the sort
of mental arithmetic that the above would force onto DBAs.

As for whether it should be #ifdef'd or not, I'd have no objection to
turning WAL_DEBUG on by default in pg_config_manual.h for the duration
of PITR development.  One should not however confuse short-term
debugging needs with features that the average user is going to need
indefinitely.  (It was not too long ago that there was still debugging
code for btree index building in there, for crissakes.)

            regards, tom lane

Re: WAL Optimisation - configuration and usage

From
Neil Conway
Date:
Simon Riggs wrote:
> On the other hand, I was just about to change the wal_debug behaviour to
> allow better debugging of PITR features as they're added.

That's a development activity. Enabling the WAL_DEBUG #ifdef by
default during the 7.5 development cycle would be uncontroversial, I
think.

> I think it is very important to be able to put the system fairly
> easily into debug mode

It is? Why would this be useful for non-development activities?

(It may well be the case that we ought to report more or better
information about the status of the WAL subsystem; but WAL_DEBUG is
surely not the right mechanism for emitting information intended for
an administrator.)

-Neil

Re: [HACKERS] WAL Optimisation - configuration and usage

From
"Simon Riggs"
Date:
>Neil Conway
> Simon Riggs wrote:
> > On the other hand, I was just about to change the wal_debug
behaviour to
> > allow better debugging of PITR features as they're added.
>
> That's a development activity. Enabling the WAL_DEBUG #ifdef by
> default during the 7.5 development cycle would be uncontroversial, I
> think.

Yes that's the best proposal. Can I leave that with you?

> > I think it is very important to be able to put the system fairly
> > easily into debug mode
>
> It is? Why would this be useful for non-development activities?
>
> (It may well be the case that we ought to report more or better
> information about the status of the WAL subsystem; but WAL_DEBUG is
> surely not the right mechanism for emitting information intended for
> an administrator.)

Right again. I guess my proposal amounted to quick-and-dirty logging.

I'll think some more.

Best Regards, Simon Riggs