Thread: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Simon Riggs"

Date:

27 November 2006, 10:30:59

It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have
them set within pg_config_manual. There are a number of use-cases where
varying these values will offer increased performance (and for many
cases, no difference at all, I accept).

Most of the PostgreSQL user base don't recompile their own versions, let
alone know how to edit the source to change these parameters. Those
people should now have the benefits known and used by a small few.

- BLCKSZ can be set at initdb, so an additional option -Z to allow
setting the value to 4, 8, 16 or 32 KB at that point (default 8 KB).

- XLOGSEGSZ can also be set at initdb, though could also be set using
resetxlog following a clean shutdown. (This would, for example, require
a Warm Standby server to need reconfiguration.)

Both of these changes would require updates to the control file to allow
those aspects to be set prior to the initial write of the control file.
Values would be re-read from the control file on startup.

Some refactoring would be required to make BLCKSZ usable, touching a
number of parts of the code in minor ways. There aren't many cases where
the BLCKSZ is used repeatedly at run time; mostly it is used during
startup of the server or some parts of executor start. So, AFAICS, there
is relatively low overhead from supporting variable BLCKSZ.

The infrastructure to support variable XLOGSEGSZ is already there, so
few changes are required to make this variable.

Comments?

(Sorry for raising so many threads at once; the 8.3 cycle is fairly
short, so I want to get going, now that 8.2 seems almost there)

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Peter Eisentraut

Date:

27 November 2006, 12:01:18

Am Montag, 27. November 2006 12:30 schrieb Simon Riggs:
> It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have
> them set within pg_config_manual. There are a number of use-cases where
> varying these values will offer increased performance

Such as?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Mike Rylander"

Date:

27 November 2006, 15:04:00

On 11/27/06, Peter Eisentraut <peter_e@gmx.net> wrote:
> Am Montag, 27. November 2006 12:30 schrieb Simon Riggs:
> > It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have
> > them set within pg_config_manual. There are a number of use-cases where
> > varying these values will offer increased performance
>
> Such as?

Reading 32k at a time from my SAN, instead of 8k, gave me a ~15%
increase in overall I/O throughput.  Now, I'm not certain that I
didn't do something else stupid that the BLCKSZ partially counter
acts, but dd bears out my results (and 64k is even faster for dd).

>
> --
> Peter Eisentraut
> http://developer.postgresql.org/~petere/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>


-- 
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Simon Riggs"

Date:

27 November 2006, 16:02:55

On Mon, 2006-11-27 at 14:01 +0100, Peter Eisentraut wrote:
> Am Montag, 27. November 2006 12:30 schrieb Simon Riggs:
> > It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have
> > them set within pg_config_manual. There are a number of use-cases where
> > varying these values will offer increased performance
> 
> Such as?

Increasing XLOGSEGSZ improves performance with write intensive
workloads, where WAL is sufficiently active that switching WAL files and
fsyncing causes all commits to freeze momentarily.
http://blogs.sun.com/jkshah/category/Databases?page=1
Sun think so as well, but that does seem to be rare knowledge, AFAICS.

Increasing BLCKSZ has been claimed to help by
http://archives.postgresql.org/pgsql-performance/2006-05/msg00444.php
http://archives.postgresql.org/pgsql-performance/2005-12/msg00139.php
http://archives.postgresql.org/pgsql-performance/2004-12/msg00271.php
Discussion on that does seem somewhat inconclusive, but that maybe just
that test results are rather thin on the ground because of lack of
ability to test this without recompilation. One commentator says that
the gain isn't worth the pain of having to re-compile to get it, even
though there is measured benefit.
Personally, I've not measured any benefit for OLTP workloads, but there
are many other workloads to try out.

Increasing BLCKSZ would also allow increasing the size of GIST indexes
(IIRC?). It would certainly allow larger TOAST_TARGETs to allow more
data to be held in a single longer tuple than is currently possible,
which would allow many text-based applications to avoid various
overheads.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Tom Lane

Date:

27 November 2006, 17:51:29

"Simon Riggs" <simon@2ndquadrant.com> writes:
> It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have
> them set within pg_config_manual.

The work required for this is much larger than you make it out to be,
and zero evidence has been offered for any benefit.  I have not heard of
anyone bothering to use a custom BLCKSZ since we added TOAST to get rid
of the row-length limitation ...
        regards, tom lane

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Peter Eisentraut

Date:

27 November 2006, 20:08:15

Simon Riggs wrote:
> Increasing XLOGSEGSZ improves performance with write intensive
> workloads, where WAL is sufficiently active that switching WAL files
> and fsyncing causes all commits to freeze momentarily.
> http://blogs.sun.com/jkshah/category/Databases?page=1

He increased the WAL segment size from 16 MB to 256 MB.  Without any 
further information about the system configuration, that seems to be 
mostly equivalent to increasing the number of checkpoint segments.

> Increasing BLCKSZ has been claimed to help by
[snip]
> Discussion on that does seem somewhat inconclusive, but that maybe
> just that test results are rather thin on the ground because of lack
> of ability to test this without recompilation.

I don't doubt that there may be a positive effect from increasing the 
block size.  But we haven't seen any analysis of why that might be.  If 
it's just to use the disk system bandwith better, maybe we should 
combine page writes instead or something.

> Increasing BLCKSZ would also allow increasing the size of GIST
> indexes (IIRC?). It would certainly allow larger TOAST_TARGETs to
> allow more data to be held in a single longer tuple than is currently
> possible, which would allow many text-based applications to avoid
> various overheads.

Have there ever been demands for reconfiguring the TOASTing behavior?  
The TOAST system seems to think that values larger than about 2 kB will 
rarely or never be used in computations, only for retrieval.  What was 
the reason for choosing this particular limit?  It seems to me that the 
maximum size of useful lookup keys is mostly influenced by human 
intelligence, not by the available computing hardware, so 2 kB seems to 
be just fine.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Tom Lane

Date:

27 November 2006, 20:48:10

Peter Eisentraut <peter_e@gmx.net> writes:
> I don't doubt that there may be a positive effect from increasing the 
> block size.  But we haven't seen any analysis of why that might be.

It seems at least as likely that increased block size would *decrease*
performance by requiring even small writes to do more physical I/O.
This applies to both data files and xlog.

But the real issue here is whether there are grounds for supporting
run-time changes in the block size.  AFAICS the evidence for supporting
even compile-time changes is pretty weak; why should we take the likely
complexity and performance costs of making it run-time changeable?
        regards, tom lane

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Simon Riggs"

Date:

27 November 2006, 21:52:41

On Mon, 2006-11-27 at 22:08 +0100, Peter Eisentraut wrote:
> Simon Riggs wrote:
> > Increasing XLOGSEGSZ improves performance with write intensive
> > workloads, where WAL is sufficiently active that switching WAL files
> > and fsyncing causes all commits to freeze momentarily.
> > http://blogs.sun.com/jkshah/category/Databases?page=1
> 
> He increased the WAL segment size from 16 MB to 256 MB.  Without any 
> further information about the system configuration, that seems to be 
> mostly equivalent to increasing the number of checkpoint segments.

On a busy system you can switch WAL segments every few seconds at 16MB.
Fsync can freeze commits for more than a second, so raising the segment
size reduces the fsync overhead considerably. This doesn't drop away
fully with any of the various wal_fsync_method settings.

256MB is good, 1GB is better. Obviously changes the on-disk footprint
considerably, so some flexibility is needed to accommodate small PC
configs and large performance servers.

It does also have the same effect as changing checkpoint segments, but
we already have variability in that dimension.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Tom Lane

Date:

27 November 2006, 22:26:17

"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Mon, 2006-11-27 at 22:08 +0100, Peter Eisentraut wrote:
>> He increased the WAL segment size from 16 MB to 256 MB.  Without any 
>> further information about the system configuration, that seems to be 
>> mostly equivalent to increasing the number of checkpoint segments.

> On a busy system you can switch WAL segments every few seconds at 16MB.
> Fsync can freeze commits for more than a second, so raising the segment
> size reduces the fsync overhead considerably.

Sorry, but that's just handwaving.  The amount of data to be written for
any specific commit isn't going to change in the least if you change
XLOGSEGSZ --- it's still going to be whatever has been written since the
last commit.  I agree with Peter that the quoted Sun test appears to have
failed to control the frequency of checkpoints, and that that was what
really accounted for the performance change.  So he'd have gotten the
same result from increasing checkpoint_segments without bothering with
a change in XLOGSEGSZ.

I do note that XLogWrite() does this in the foreground path of control:
            * If we just wrote the whole last page of a logfile segment,            * fsync the segment immediately.
Thisavoids having to go back            * and re-open prior segments when an fsync request comes along            *
later.Doing it here ensures that one and only one backend will            * perform this fsync.

This coding predates the existence of the bgwriter; now that we have
that, it'd perhaps be interesting to try to put the burden on the
bgwriter instead.  (However, if a backend is trying to fsync a commit
record just after the segment switch, it'd have to wait for the previous
segment to be fsync'd anyway.  The complexity and likely performance
costs of arranging for that synchronization might outweigh any gains.)
In any case, the existence of this code isn't an argument for raising
XLOGSEGSZ, more the reverse --- the bigger the segment the more painful
the fsync is likely to be.

[ studies code a bit more... ]  I'm also wondering whether the forced
pg_control update at each xlog seg switch is worth its keep.  Offhand it
seems like the checkpoint pointer is enough; why are we maintaining
logId/logSeg in pg_control?
        regards, tom lane

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Simon Riggs"

Date:

28 November 2006, 08:23:00

On Mon, 2006-11-27 at 18:26 -0500, Tom Lane wrote:

> Sorry, but that's just handwaving. 

Thats fine. I was responding to private comments that I was trying to
test things that had not been subject to community design. My response
was that the community would react badly if conjectures are discussed
without presenting firm performance evidence. Chicken and egg...

> [ studies code a bit more... ]  I'm also wondering whether the forced
> pg_control update at each xlog seg switch is worth its keep.  Offhand it
> seems like the checkpoint pointer is enough; why are we maintaining
> logId/logSeg in pg_control?
ControlFile->logId = openLogId;ControlFile->logSeg = openLogSeg + 1;ControlFile->time =
time(NULL);UpdateControlFile();

I've looked through the code paths related to the above code just at
xlog switch. There doesn't seem to be any useful effect of storing these
values in the control file. The logId and logSeg is never read, only
written.

There is a slight impact in that when the server crashes it will say the
database crashed at ControlFile->time, so if we remove the update the
crash information will be slightly more out of date than it is now in
many cases. In the case of a long checkpoint_timeout that could be as
much as an hour, but then thats no worse than it is now potentially on a
system in a slack period when little WAL is written.

Perhaps we can say if its within a minute of the last switch time, then
we update the control file, otherwise don't. That seems like coding for
the sake of it though and if we wanted that then we'd get the bgwriter
to do it, not a random backend.

Anyway, we can skip updating the control file and its fsync. IMHO
touching the control file less is likely to make us more robust.

I'll code up a patch for that and test to see if that improves things.

Not sure if this is RC material? No, OK, don't shout.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Zeugswetter Andreas ADI SD"

Date:

28 November 2006, 11:04:35

> I don't doubt that there may be a positive effect from increasing the
> block size.  But we haven't seen any analysis of why that
> might be.  If it's just to use the disk system bandwith better, maybe
we should
> combine page writes instead or something.

It is usually the reads that need aid. Writes are fast on modern
disksystems since they are
cached.

I think the main effect is reduced OS overhead for the readahead
prediction logic and
reduced system call overhead.

Andreas

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Andrew Sullivan

Date:

28 November 2006, 15:41:03

On Mon, Nov 27, 2006 at 04:47:57PM -0500, Tom Lane wrote:
> It seems at least as likely that increased block size would *decrease*
> performance by requiring even small writes to do more physical I/O.
> This applies to both data files and xlog.

FWIW, a test we performed on just this some time ago was
inconclusive, and I chalked up the inconclusiveness to exactly the
increase in physical I/O for small writes.  I couldn't release the
results, just because I wasn't in a position to release the test
data, but we had a fairly eclectic mixture of big and small rows.  On
certain workloads, it was in fact slower than the stock size (IIRC we
tried both 16k and 32k), which is what led me to that speculation. 
But I never chased any of it down, because the preliminary results
were so unpromising.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what 
you told them to.  That actually seems sort of quaint now.    --J.D. Baldwin

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Zdenek Kotala

Date:

01 December 2006, 19:24:44

Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> It seems possible to vary both BLCKSZ and XLOGSEGSZ rather than have
>> them set within pg_config_manual.
> 
> The work required for this is much larger than you make it out to be,
> and zero evidence has been offered for any benefit.  I have not heard of
> anyone bothering to use a custom BLCKSZ since we added TOAST to get rid
> of the row-length limitation ...

I think, Configurable BLCKSZ could be useful for regression test. See 
hash index problem. Some kind of stress test could be check behavior on 
different BLCKSZ without recompilation.
    Zdenek

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Simon Riggs"

Date:

05 December 2006, 18:53:45

On Mon, 2006-11-27 at 18:26 -0500, Tom Lane wrote:
> [ studies code a bit more... ]  I'm also wondering whether the forced
> pg_control update at each xlog seg switch is worth its keep.  Offhand
> it
> seems like the checkpoint pointer is enough; why are we maintaining
> logId/logSeg in pg_control?

We maintain the values in shared memory to allow us to determine whether
or not its time to checkpoint, and also to ensure that there is one and
only one call to checkpoint. So we need to keep track of this somewhere
and that may as well be where it already is.

However, that doesn't mean we need to update the file on disk each time
we switch xlog files, so I've removed the UpdateControlFile() at that
point. That fsync was done while holding WALWriteLock() so removing it
should be good for a few extra points of speed - at least we know there
were some problems in that area.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Attachment

xlogswitchtuning.patch

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Tom Lane

Date:

05 December 2006, 19:14:59

"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Mon, 2006-11-27 at 18:26 -0500, Tom Lane wrote:
>> [ studies code a bit more... ]  I'm also wondering whether the forced
>> pg_control update at each xlog seg switch is worth its keep.  Offhand
>> it seems like the checkpoint pointer is enough; why are we maintaining
>> logId/logSeg in pg_control?

> We maintain the values in shared memory to allow us to determine whether
> or not its time to checkpoint, and also to ensure that there is one and
> only one call to checkpoint. So we need to keep track of this somewhere
> and that may as well be where it already is.

Say again?  AFAICT those fields are write-only; the only place we
consult them is to decide whether they need to be updated.  My thought
was to remove 'em altogether.

            regards, tom lane

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Simon Riggs"

Date:

05 December 2006, 20:02:02

On Tue, 2006-12-05 at 15:14 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Mon, 2006-11-27 at 18:26 -0500, Tom Lane wrote:
> >> [ studies code a bit more... ]  I'm also wondering whether the forced
> >> pg_control update at each xlog seg switch is worth its keep.  Offhand
> >> it seems like the checkpoint pointer is enough; why are we maintaining
> >> logId/logSeg in pg_control?
>
> > We maintain the values in shared memory to allow us to determine whether
> > or not its time to checkpoint, and also to ensure that there is one and
> > only one call to checkpoint. So we need to keep track of this somewhere
> > and that may as well be where it already is.
>
> Say again?  AFAICT those fields are write-only; the only place we
> consult them is to decide whether they need to be updated.  My thought
> was to remove 'em altogether.

Thats what I thought originally.

However, they guard the entrance to RequestCheckpoint() and after they
have been set nobody else will call it - look at the test immediately
prior to the rows changed by the patch. That comparison is why we still
need them and why they aren't just write-only.

So they need to be there, but we just don't need to write them to
pg_control.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Tom Lane

Date:

05 December 2006, 20:25:09

"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Tue, 2006-12-05 at 15:14 -0500, Tom Lane wrote:
>> Say again?  AFAICT those fields are write-only; the only place we
>> consult them is to decide whether they need to be updated.  My thought
>> was to remove 'em altogether.

> Thats what I thought originally.

> However, they guard the entrance to RequestCheckpoint() and after they
> have been set nobody else will call it - look at the test immediately
> prior to the rows changed by the patch.

Sure, what would happen is that every backend passing through this code
would execute the several lines of computation needed to decide whether
to call RequestCheckpoint.  That's still way cheaper than an xlog switch
as a whole, so it doesn't bother me.  I think the first test is probably
effectively redundant anyway, since the whole thing is executed with
WALWriteLock held and so there can be only one backend doing it at a
time --- it's not apparent to me that it's possible for someone else to
have updated pg_control before the backend executing XLogWrite does.

But in any case, the point here is that it doesn't matter whether the
RequestCheckpoint code is inside the update-pg_control test or not.
It was only put there on the thought that we could save some small
number of cycles by not doing it if the update-pg_control test failed.

            regards, tom lane

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Simon Riggs"

Date:

05 December 2006, 20:58:47

On Tue, 2006-12-05 at 16:24 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Tue, 2006-12-05 at 15:14 -0500, Tom Lane wrote:
> >> Say again?  AFAICT those fields are write-only; the only place we
> >> consult them is to decide whether they need to be updated.  My thought
> >> was to remove 'em altogether.
>
> > Thats what I thought originally.
>
> > However, they guard the entrance to RequestCheckpoint() and after they
> > have been set nobody else will call it - look at the test immediately
> > prior to the rows changed by the patch.
>
> Sure, what would happen is that every backend passing through this code
> would execute the several lines of computation needed to decide whether
> to call RequestCheckpoint.  That's still way cheaper than an xlog switch
> as a whole, so it doesn't bother me.  I think the first test is probably
> effectively redundant anyway, since the whole thing is executed with
> WALWriteLock held and so there can be only one backend doing it at a
> time --- it's not apparent to me that it's possible for someone else to
> have updated pg_control before the backend executing XLogWrite does.

Right, but the calculation uses RedoRecPtr, which may not be completely
up to date. So presumably you want to re-read the shared memory value
again to make sure we are exactly accurate and allow only one person to
call checkpoint? Either way we have to take a lock. Insert lock causes
deadlock, so we would need to use infolock.

Yes, one backend at a time executes this code, but we need a way to tell
whether the backend is the first to come through that code.

I just left it with the lock it was already requesting. If you really
think it should use infolock then I'll code it that way instead.

> But in any case, the point here is that it doesn't matter whether the
> RequestCheckpoint code is inside the update-pg_control test or not.
> It was only put there on the thought that we could save some small
> number of cycles by not doing it if the update-pg_control test failed.

Understood, that wasn't why I left it that way.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Tom Lane

Date:

05 December 2006, 21:26:23

"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Tue, 2006-12-05 at 16:24 -0500, Tom Lane wrote:
>> Sure, what would happen is that every backend passing through this code
>> would execute the several lines of computation needed to decide whether
>> to call RequestCheckpoint.

> Right, but the calculation uses RedoRecPtr, which may not be completely
> up to date. So presumably you want to re-read the shared memory value
> again to make sure we are exactly accurate and allow only one person to
> call checkpoint? Either way we have to take a lock. Insert lock causes
> deadlock, so we would need to use infolock.

Not at all.  It's highly unlikely that RedoRecPtr would be so out of
date as to result in a false request for a checkpoint, and if it does,
so what?  Worst case is we perform an extra checkpoint.

Also, given the current structure of the routine, this is probably not
the best place for that code at all --- it'd make more sense for it to
be in the just-finished-a-segment code stretch, which would ensure that
it's only done by one backend once per segment.

            regards, tom lane

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Simon Riggs"

Date:

05 December 2006, 23:54:39

On Tue, 2006-12-05 at 17:26 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Tue, 2006-12-05 at 16:24 -0500, Tom Lane wrote:
> >> Sure, what would happen is that every backend passing through this code
> >> would execute the several lines of computation needed to decide whether
> >> to call RequestCheckpoint.
>
> > Right, but the calculation uses RedoRecPtr, which may not be completely
> > up to date. So presumably you want to re-read the shared memory value
> > again to make sure we are exactly accurate and allow only one person to
> > call checkpoint? Either way we have to take a lock. Insert lock causes
> > deadlock, so we would need to use infolock.
>
> Not at all.  It's highly unlikely that RedoRecPtr would be so out of
> date as to result in a false request for a checkpoint, and if it does,
> so what?  Worst case is we perform an extra checkpoint.

On its own, I wouldn't normally agree...

> Also, given the current structure of the routine, this is probably not
> the best place for that code at all --- it'd make more sense for it to
> be in the just-finished-a-segment code stretch, which would ensure that
> it's only done by one backend once per segment.

But thats a much better plan since it requires no locking.

There's a lot more changes there for such a simple fix though and lots
more potential bugs, but I've coded it as you suggest and removed the
fields from pg_control.

Patch passes make check, applies cleanly on HEAD.
pg_resetxlog and pgcontroldata tested.


--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Attachment

xlogswitchtuning2.patch

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Dawid Kuroczko"

Date:

06 December 2006, 23:20:48

On 11/27/06, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Mon, 2006-11-27 at 22:08 +0100, Peter Eisentraut wrote:
> > Simon Riggs wrote:
> > > Increasing XLOGSEGSZ improves performance with write intensive
> > > workloads, where WAL is sufficiently active that switching WAL files
> > > and fsyncing causes all commits to freeze momentarily.
> > > http://blogs.sun.com/jkshah/category/Databases?page=1
> >
> > He increased the WAL segment size from 16 MB to 256 MB.  Without any
> > further information about the system configuration, that seems to be
> > mostly equivalent to increasing the number of checkpoint segments.
>
> On a busy system you can switch WAL segments every few seconds at 16MB.
> Fsync can freeze commits for more than a second, so raising the segment
> size reduces the fsync overhead considerably. This doesn't drop away
> fully with any of the various wal_fsync_method settings.
>
> 256MB is good, 1GB is better. Obviously changes the on-disk footprint
> considerably, so some flexibility is needed to accommodate small PC
> configs and large performance servers.

Also, 16MB WALs are quite a burden for backup systems (that's a lot of
files that just keep coming and coming). [1]
  Regards,      Dawid

[1]: It really does the difference, especially if you have a centralized backup.
And as for recovery, we have pg_xlogfile_name_offset(), the size of the
WAL file should not be a problem in HA setups.

Re: [PATCHES] Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Tom Lane

Date:

08 December 2006, 17:18:48

"Simon Riggs" <simon@2ndquadrant.com> writes:
> [ patch to remove logId/logSeg from pg_control ]

Looking this over, I realize that there's an unresolved problem.
Although it's true that xlog.c itself doesn't use the logId/logSeg
fields for anything interesting, pg_resetxlog relies on them to
determine how far the old WAL extends, so that it can determine a
safely higher start address for the new WAL.  This puts a damper
both on my thought of removing the fields altogether, and on Simon's
earlier proposal to update them in shared memory but not immediately
write pg_control during a segment switch.

The proposed patch uses pg_control's last checkpoint location to
drive the end-of-WAL computation, but that is obviously not good
enough, as WAL might have gone many segments beyond that.

Now, underestimating the WAL end address is not fatal; AFAIK the
only consequence would be some complaints about "xlog flush request is
not satisfied" until we had managed to advance the end of WAL past the
largest page LSN present in the data files.  But it's still annoying.

What I'm considering is having pg_resetxlog scan the pg_xlog directory
and assume that any segment files present might have been used.

Thoughts?

            regards, tom lane

Re: [PATCHES] Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Tom Lane

Date:

08 December 2006, 18:52:55

"Simon Riggs" <simon@2ndquadrant.com> writes:
> [ patch to remove logId/logSeg from pg_control ]

Applied with revisions.

            regards, tom lane

Re: [PATCHES] Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

"Simon Riggs"

Date:

10 December 2006, 11:57:40

On Fri, 2006-12-08 at 13:18 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:

> What I'm considering is having pg_resetxlog scan the pg_xlog directory
> and assume that any segment files present might have been used.

[Reading committed code...]

That's a very neat short cut - just using the file names rather than
trying to read the files themselves as the abortive patch a few months
back tried.

Question: What happens when we run out of LogIds? I know we don't wrap
onto the next timeline, but do we start from LogId=1 again? Looks to me
like we just fall on the floor right now.

We're probably not pressed for an answer... :-)

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: [PATCHES] Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Tom Lane

Date:

10 December 2006, 16:15:45

"Simon Riggs" <simon@2ndquadrant.com> writes:
> Question: What happens when we run out of LogIds?

We die.  That will occur after generating 2^64 bytes of WAL, which for
an installation generating 100MB/second would be something over 5000
years if I'm counting correctly.

            regards, tom lane

Remove log segment and log_id fields from pg_controldata

From

Bruce Momjian

Date:

04 February 2007, 00:37:46

The original discussion of this patch was here:

    http://archives.postgresql.org/pgsql-hackers/2006-11/msg00876.php

Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Simon Riggs wrote:
> On Tue, 2006-12-05 at 17:26 -0500, Tom Lane wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> > > On Tue, 2006-12-05 at 16:24 -0500, Tom Lane wrote:
> > >> Sure, what would happen is that every backend passing through this code
> > >> would execute the several lines of computation needed to decide whether
> > >> to call RequestCheckpoint.
> >
> > > Right, but the calculation uses RedoRecPtr, which may not be completely
> > > up to date. So presumably you want to re-read the shared memory value
> > > again to make sure we are exactly accurate and allow only one person to
> > > call checkpoint? Either way we have to take a lock. Insert lock causes
> > > deadlock, so we would need to use infolock.
> >
> > Not at all.  It's highly unlikely that RedoRecPtr would be so out of
> > date as to result in a false request for a checkpoint, and if it does,
> > so what?  Worst case is we perform an extra checkpoint.
>
> On its own, I wouldn't normally agree...
>
> > Also, given the current structure of the routine, this is probably not
> > the best place for that code at all --- it'd make more sense for it to
> > be in the just-finished-a-segment code stretch, which would ensure that
> > it's only done by one backend once per segment.
>
> But thats a much better plan since it requires no locking.
>
> There's a lot more changes there for such a simple fix though and lots
> more potential bugs, but I've coded it as you suggest and removed the
> fields from pg_control.
>
> Patch passes make check, applies cleanly on HEAD.
> pg_resetxlog and pgcontroldata tested.
>
>
> --
>   Simon Riggs
>   EnterpriseDB   http://www.enterprisedb.com
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Remove log segment and log_id fields from pg_controldata

From

"Simon Riggs"

Date:

04 February 2007, 08:27:54

On Sat, 2007-02-03 at 20:37 -0500, Bruce Momjian wrote:

> Your patch has been added to the PostgreSQL unapplied patches list at:
>
>     http://momjian.postgresql.org/cgi-bin/pgpatches
>
> It will be applied as soon as one of the PostgreSQL committers reviews
> and approves it.

Tom applied the patch a few months ago.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Configuring BLCKSZ and XLOGSEGSZ (in 8.3)

From

Bruce Momjian

Date:

05 February 2007, 15:26:47

Patch already applied by Tom.  Removed from queue.

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Tue, 2006-12-05 at 17:26 -0500, Tom Lane wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> > > On Tue, 2006-12-05 at 16:24 -0500, Tom Lane wrote:
> > >> Sure, what would happen is that every backend passing through this code
> > >> would execute the several lines of computation needed to decide whether
> > >> to call RequestCheckpoint.
> >
> > > Right, but the calculation uses RedoRecPtr, which may not be completely
> > > up to date. So presumably you want to re-read the shared memory value
> > > again to make sure we are exactly accurate and allow only one person to
> > > call checkpoint? Either way we have to take a lock. Insert lock causes
> > > deadlock, so we would need to use infolock.
> >
> > Not at all.  It's highly unlikely that RedoRecPtr would be so out of
> > date as to result in a false request for a checkpoint, and if it does,
> > so what?  Worst case is we perform an extra checkpoint.
>
> On its own, I wouldn't normally agree...
>
> > Also, given the current structure of the routine, this is probably not
> > the best place for that code at all --- it'd make more sense for it to
> > be in the just-finished-a-segment code stretch, which would ensure that
> > it's only done by one backend once per segment.
>
> But thats a much better plan since it requires no locking.
>
> There's a lot more changes there for such a simple fix though and lots
> more potential bugs, but I've coded it as you suggest and removed the
> fields from pg_control.
>
> Patch passes make check, applies cleanly on HEAD.
> pg_resetxlog and pgcontroldata tested.
>
>
> --
>   Simon Riggs
>   EnterpriseDB   http://www.enterprisedb.com
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +