Thread: Cost of creating an emply WAL segment

Cost of creating an emply WAL segment

From
Devrim GÜNDÜZ
Date:
Hi,

Has anyone measured the cost of creating empty WAL segments while the
database is running?

Actually, when is the new file created? Just after one segment is filled
up, or some time before then? What happens during WAL segment creation?
If there are pending transactions to be committed, do we see a delay?

I was looking at how Oracle manages this, and I was told that you can
create empty segments during installation, so I'm wondering whether it
might be a good addition to initdb or not:

initdb -S 30 -- create 30 empty segments during initdb

Regards,
--
Devrim GÜNDÜZ
Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org  Twitter: http://twitter.com/devrimgunduz

Attachment

Re: Cost of creating an emply WAL segment

From
Greg Smith
Date:
On 06/24/2011 10:43 AM, Devrim GÜNDÜZ wrote:
> Has anyone measured the cost of creating empty WAL segments while the
> database is running?
>
> Actually, when is the new file created? Just after one segment is filled
> up, or some time before then? What happens during WAL segment creation?
> If there are pending transactions to be committed, do we see a delay?
>

Excellent set of questions.  Yes, and it can be disturbingly high on a
fresh system that's immediately hit with lots of transactions, or on one
where checkpoint_segments was just raised a lot and then nailed with
activity.  The problem also pops up in an even worse way when you have a
server that's overrun its normal checkpoint finish time, usually due to
a slow sync phase.  That then also results in creation of new
segments--necessary because all of the old ones that would normally be
recycled already have been recycled--and that work is now competing
against the already slow checkpoint writes, and backends running their
own fsync too just to make the mix extra exciting.

I know the server code does try to stay a bit ahead of this problem, by
creating one segment in advance under conditions I forget the details
of, to reduce the odds a client will actually hit a delay there.  It
helps the average case.  But since it doesn't do much for the worst-case
ones people that make my life miserable, working on that doesn't make my
life easier; therefore I don't.

The main argument in favor of pre-populating WAL segments early after
the database starts is that it would probably be a small cheat on some
benchmarks, moving a bit of work that happens during the test to happen
before then instead when it isn't counted.  But I've never been excited
about the idea of creating empty segments near initdb time for 3 reasons:

-Default postgresql.conf at initdb time has the pathetically small 3
checkpoint_segments, so it won't actually work unless we increase it first.
-Most servers go through a data loading stage before they hit production
that takes care of this anyway.
-The worst problems I've ever seen in this area, by far (as in:  at
least 100X worst than what you're asking about), are when new segments
are created due to heavy write activity exhausting the list of ones to
be recycled during a checkpoint.

To give you an example of *that*, here is the new record-setting slow
checkpoint I just found in my inbox this morning, from a system that's
been instrumented for its increasing checkpoint issues the last month:

checkpoint complete: wrote 63297 buffers (6.0%); 0 transaction log
file(s) added, 938 removed, 129 recycled; write=250.384 s,
sync=14525.296 s, total=14786.868 s

Here checkpoint_segments=64 and shared_buffers=8GB.  The fact that this
checkpoint hit the "create a new empty WAL segment" code 938 times
during its 4 hour checkpoint sync phase is much more troubling than the
pauses people run into when making a new segments on a fresh server.  I
would argue that if your database is new enough to not have populated a
full set of checkpoint_segments yet, it's also new enough that it can't
possibly have enough data in it yet for that to really be a problem.

(Note for hackers following along this far:  I strongly suspect today's
latest pathological example is the sort improved by the "compact fsync
request" feature added to 9.1.  I'm now seeing so many of these in the
field that I'm collecting up data to support the idea of backporting
that in 8.3-9.0, as a bug fix because this turns into a "system is too
slow to be considered operational" performance problem when it flares
up.  This was a rare thing I'd never seen before back in September when
I started working on this area again; now I see it once a month on a new
system somewhere.)

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Cost of creating an emply WAL segment

From
Shaun Thomas
Date:
On 06/24/2011 11:18 AM, Greg Smith wrote:

> sync=14525.296 s, total=14786.868 s

Whaaaaaaaaat!? 6% of 8GB is just shy of 500MB. That's not a small
amount, exactly, but it took 14525 seconds to call syncs for those
writes? What kind of ridiculous IO would cause something like that?
That's even way beyond an OS-level dirty buffer flush on a massive
system. Wow!

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
sthomas@peak6.com

______________________________________________

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

Re: Cost of creating an emply WAL segment

From
Greg Smith
Date:
On 06/24/2011 02:55 PM, Shaun Thomas wrote:
> On 06/24/2011 11:18 AM, Greg Smith wrote:
>
>> sync=14525.296 s, total=14786.868 s
>
> Whaaaaaaaaat!? 6% of 8GB is just shy of 500MB. That's not a small
> amount, exactly, but it took 14525 seconds to call syncs for those
> writes? What kind of ridiculous IO would cause something like that?
> That's even way beyond an OS-level dirty buffer flush on a massive
> system. Wow!

It is in fact smaller than the write cache on the disk array involved.
The mystery is explained at
http://projects.2ndquadrant.it/sites/default/files/WriteStuff-PGCon2011.pdf

The most relevant part:

-Background writer stop working normally while running sync
-Never pauses to fully consume the fsync queues backends fill
-Once filled, all backend writes do their own fsync
-Serious competition for the checkpoint writes

When the background writer's fsync queue fills, and you have 100 clients
all doing their own writes and making an fsync call after each one of
them (the case on this server), the background writer ends up only
getting around 1/100 of the I/O capabilities of the server available in
its time slice.  And that's how a sync phase that might normally take
two minutes on a really busy server ends up running for hours instead.
The improvement in 9.1 gets the individual backends involved in trying
to compact the fsync queue when they discover it is full, which seems to
make the worst case behavior here much better.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books