Re: Redesigning checkpoint_segments - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Redesigning checkpoint_segments
Date
Msg-id 51B0C5BD.80200@agliodbs.com
Whole thread Raw
In response to Redesigning checkpoint_segments  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Redesigning checkpoint_segments
List pgsql-hackers
>> Then I suggest we not use exactly that name.  I feel quite sure we
>> would get complaints from people if something labeled as "max" was
>> exceeded -- especially if they set that to the actual size of a
>> filesystem dedicated to WAL files.
> 
> You're probably right. Any suggestions for a better name?
> wal_size_soft_limit?

"checkpoint_size_limit", or something similar.  That is, what you're
defining is:

"this is the size at which we trigger a checkpoint even if
checkpoint_timeout has not been exceeded".

However, I think it's worth considering: if we're doing this "sizing
checkpoints based on prior cycles" thing, do we really need a size_limit
*at all* for most users?   I can see how a hard limit is useful, but not
how a soft limit is.

Most of our users most of the time don't care how large WAL is as long
as it doesn't exceed disk space.  And on most databases, hitting
checkpoint_timeout is more frequent than hitting checkpoint_segments --
at least in my substantial performance-tuning experience.  So I think
most users would prefer a setting which essentially says "make WAL as
big as it has to be in order to maximize throughput", and wouldn't worry
about the disk space.

>
> Yeah, something like that :-). I was thinking of letting the estimate
> decrease like a moving average, but react to any increases immediately.
> Same thing we do in bgwriter to track buffer allocations:

Seems reasonable.  Given the behavior of xlog, I'd want to adjust the
algo so that peak usage on a 24-hour basis would affect current
preallocation.  That is, if a site regularly has a peak from 2-3pm where
they're using 180 segments/cycle, then they should still be somewhat
higher at 2am than a database which doesn't have that peak.  I'm pretty
sure that the bgwriter's moving average cycles much shorter time scales
than that.

>> Well, the ideal unit from the user's point of view is *time*, not space.
>>   That is, the user wants the master to keep, say, "8 hours of
>> transaction logs", not any amount of MB.  I don't want to complicate
>> this proposal by trying to deliver that, though.
>
> OTOH, if you specify it in terms of time, then you don't have any limit
> on the amount of disk space required.

Well, the best setup from my perspective as a remote DBA for a lot of
clients would be two-factor:

wal_keep_time: ##hr
wal_keep_size_limit: ##GB

That is, we would try to keep ##hr of WAL around for the standbys,
unless that amount exceeded ##GB (at which point we'd write a warning to
the logs).  If max_wal_size was a hard limit, we wouldn't need
wal_keep_size_limit, of course.

However, to some degree Andres' work will render all this
wal_keep_segments stuff obsolete by letting the master track what
segment was last consumed by each replica, so I don't think it's worth
pursuing this line of thinking a lot further.

In any case, I'm just pointing out that we need to think of
wal_keep_segments as part of the total WAL size, and not as something
seperate, because that's confusing our users.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



pgsql-hackers by date:

Previous
From: Dmitriy Igrishin
Date:
Subject: Multiple error reports.
Next
From: Jeff Janes
Date:
Subject: Re: Redesigning checkpoint_segments