Thread: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues

BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues

From
jeff@pgexperts.com
Date:
The following bug has been logged on the website:

Bug reference:      7902
Logged by:          Jeff Frost
Email address:      jeff@pgexperts.com
PostgreSQL version: 9.2.3
Operating system:   Ubuntu 12.04
Description:        =


While doing acceptance testing on a new Ubuntu 12.04 PostgreSQL server
running 9.2.3, we set checkpoint_segments =3D 128,
checkpoint_completion_target =3D 0.9 and placed pg_xlog on a separate 20G
partition. Also, archive_mode =3D off on this system.

According to the docs, you would expect the system to attempt to keep the
WAL files down close to 3 * checkpoint_segments + 1.  Unfortunately, this
does not appear to be the case because a pgbench run would run the pg_xlog
partition out of space.

The pgbench run script looks like this:

#!/bin/bash

dropdb bench
createdb bench
pgbench -i -s 1000 bench
vacuumdb -a --analyze-only
psql -c "checkpoint"
pgbench -c 64 -j 16 -r -T 600 bench

While the pgbench does cause lots of xlog based checkpoints, they never seem
to remove more than a few files and often pg_xlog grows to more than 20G and
the postgresql service falls over.

After moving pg_xlog to a larger partition, it seems it peaks at about 22G
in size. =


A manual checkpoint after the run always brings it back down to ~ 4G in
size.

Interestingly, I was unable to reproduce this with 9.2.3 on our inhouse test
system; however, the inhouse system has much less RAM and CPU resources, so
this may only be an issue on larger systems. The system that exhibits the
issue has 128G of RAM and 16 cores (32 with hyperthreading). =


I also tested 9.2.2 on the affected system and it acted the same.

Hope to test 9.1.8 in the next few days.

Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues

From
Rafael Martinez Guerrero
Date:
On 02/22/2013 11:55 PM, jeff@pgexperts.com wrote:
>
> While doing acceptance testing on a new Ubuntu 12.04 PostgreSQL server
> running 9.2.3, we set checkpoint_segments = 128,
> checkpoint_completion_target = 0.9 and placed pg_xlog on a separate 20G
> partition. Also, archive_mode = off on this system.
>
> According to the docs, you would expect the system to attempt to keep the
> WAL files down close to 3 * checkpoint_segments + 1.  Unfortunately, this
> does not appear to be the case because a pgbench run would run the pg_xlog
> partition out of space.
>
[................]

Hello Jeff

We reported this back in 2011, but we did not get to any conclusion:
http://www.postgresql.org/message-id/4DE89072.7070305@usit.uio.no

In our case, we had this problem when creating a GIN index.

I think the problem has something to do with checkpoints, I think the
number of WAL files will grow beyond the limit defined because the
system can not be finish with checkpoints. A good start to try to
explain what it is happening would be to try to identify or describe the
situations where checkpoints can take very long to complete or fail
altogether.

We are interested in this subject because we have had this problem a few
times. But we have not been able to identify the reason that triggers this.

regards,
--
Rafael Martinez Guerrero
Center for Information Technology
University of Oslo, Norway

PGP Public Key: http://folk.uio.no/rafael/
On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero =
<r.m.guerrero@usit.uio.no> wrote:

> We reported this back in 2011, but we did not get to any conclusion:
> http://www.postgresql.org/message-id/4DE89072.7070305@usit.uio.no
>=20
> In our case, we had this problem when creating a GIN index.
>=20
> I think the problem has something to do with checkpoints, I think the
> number of WAL files will grow beyond the limit defined because the
> system can not be finish with checkpoints. A good start to try to
> explain what it is happening would be to try to identify or describe =
the
> situations where checkpoints can take very long to complete or fail
> altogether.
>=20
> We are interested in this subject because we have had this problem a =
few
> times. But we have not been able to identify the reason that triggers =
this.

Rafael, did you do a followup post on pgsql-performance, or did the =
thread die at the end of the one you post above?


---
Jeff Frost <jeff@pgexperts.com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/=20
On Feb 24, 2013, at 1:05 PM, Jeff Frost <jeff@pgexperts.com> wrote:

>=20
> On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero =
<r.m.guerrero@usit.uio.no> wrote:
>=20
>> We reported this back in 2011, but we did not get to any conclusion:
>> http://www.postgresql.org/message-id/4DE89072.7070305@usit.uio.no
>>=20
>> In our case, we had this problem when creating a GIN index.
>>=20
>> I think the problem has something to do with checkpoints, I think the
>> number of WAL files will grow beyond the limit defined because the
>> system can not be finish with checkpoints. A good start to try to
>> explain what it is happening would be to try to identify or describe =
the
>> situations where checkpoints can take very long to complete or fail
>> altogether.
>>=20
>> We are interested in this subject because we have had this problem a =
few
>> times. But we have not been able to identify the reason that triggers =
this.
>=20
> Rafael, did you do a followup post on pgsql-performance, or did the =
thread die at the end of the one you post above?

This is how the log_checkpoint output looks during the run:

2013-02-24 21:06:31.156 UTC,,,1624,,51282598.658,114,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: immediate force =
wait",,,,,,,,,""
2013-02-24 21:06:31.216 UTC,,,1624,,51282598.658,115,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 108 buffers =
(0.0%); 0 transaction log file(s) added, 0 removed, 4 recycled; =
write=3D0.054 s, sync=3D0.002 s, total=3D0.059 s; sync files=3D10, =
longest=3D0.000 s, average=3D0.000 s",,,,,,,,,""

So, this ^^^^^ is the manual checkpoint prior to the pgbench run and =
everything else is part of the pgbench run.

2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,116,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoints are occurring too frequently (14 =
seconds apart)",,"Consider increasing the configuration parameter =
""checkpoint_segments"".",,,,,,,""
2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,117,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:08:35.239 UTC,,,1624,,51282598.658,118,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 266935 buffers =
(12.7%); 0 transaction log file(s) added, 0 removed, 1 recycled; =
write=3D10.889 s, sync=3D99.079 s, total=3D110.011 s; sync files=3D29, =
longest=3D12.888 s, average=3D3.416 s",,,,,,,,,""
2013-02-24 21:08:35.241 UTC,,,1624,,51282598.658,119,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,120,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 801517 buffers =
(38.2%); 0 transaction log file(s) added, 0 removed, 128 recycled; =
write=3D15.484 s, sync=3D148.739 s, total=3D164.266 s; sync files=3D32, =
longest=3D11.773 s, average=3D4.648 s",,,,,,,,,""
2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,121,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:14:20.486 UTC,,,1624,,51282598.658,122,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 1009466 buffers =
(48.1%); 0 transaction log file(s) added, 140 removed, 257 recycled; =
write=3D13.690 s, sync=3D167.245 s, total=3D180.980 s; sync files=3D33, =
longest=3D14.442 s, average=3D5.067 s",,,,,,,,,""
2013-02-24 21:14:20.487 UTC,,,1624,,51282598.658,123,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,124,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 1055059 buffers =
(50.3%); 0 transaction log file(s) added, 259 removed, 257 recycled; =
write=3D22.731 s, sync=3D168.239 s, total=3D191.020 s; sync files=3D32, =
longest=3D13.650 s, average=3D5.257 s",,,,,,,,,""
2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,125,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
2013-02-24 21:23:36.201 UTC,,,1624,,51282598.658,126,,2013-02-23 =
02:12:40 UTC,,0,LOG,00000,"checkpoint complete: wrote 814767 buffers =
(38.9%); 0 transaction log file(s) added, 546 removed, 0 recycled; =
write=3D269.553 s, sync=3D95.042 s, total=3D364.693 s; sync files=3D31, =
longest=3D11.211 s, average=3D3.065 s",,,,,,,,,""

They're all number of xlog based and the first one doesn't remove any, =
that's not surprising, but I'm surprised that the second one doesn't =
remove any.  It does recycle quite a few.  I guess the issue is that it =
can generate WAL files way faster than it can checkpoint and cleanup.  I =
think I need to write a script to output the number of WAL files each =
minute so we can better correlate what's going on.=

Re: BUG #7902: lazy cleanup of extraneous WAL files can cause out of disk issues

From
Rafael Martinez Guerrero
Date:
On 02/24/2013 10:05 PM, Jeff Frost wrote:
>
> On Feb 24, 2013, at 7:16 AM, Rafael Martinez Guerrero
> <r.m.guerrero@usit.uio.no <mailto:r.m.guerrero@usit.uio.no>> wrote:
>
>> We reported this back in 2011, but we did not get to any conclusion:
>> http://www.postgresql.org/message-id/4DE89072.7070305@usit.uio.no
>>
>> In our case, we had this problem when creating a GIN index.
>>
>> I think the problem has something to do with checkpoints, I think the
>> number of WAL files will grow beyond the limit defined because the
>> system can not be finish with checkpoints. A good start to try to
>> explain what it is happening would be to try to identify or describe the
>> situations where checkpoints can take very long to complete or fail
>> altogether.
>>
>> We are interested in this subject because we have had this problem a few
>> times. But we have not been able to identify the reason that triggers
>> this.
>
> Rafael, did you do a followup post on pgsql-performance, or did the
> thread die at the end of the one you post above?
>

Jeff, we didn't have the time to continue investigating this problem
that time. Our workaround was to increase the size of our pg_xlog
partition so we could finish the export/import job we had to do.

regards,
--
Rafael Martinez Guerrero
Center for Information Technology
University of Oslo, Norway

PGP Public Key: http://folk.uio.no/rafael/
On Friday, February 22, 2013, wrote:

> The following bug has been logged on the website:
>
> Bug reference:      7902
> Logged by:          Jeff Frost
> Email address:      jeff@pgexperts.com <javascript:;>
> PostgreSQL version: 9.2.3
> Operating system:   Ubuntu 12.04
> Description:
>
> While doing acceptance testing on a new Ubuntu 12.04 PostgreSQL server
> running 9.2.3, we set checkpoint_segments = 128,
> checkpoint_completion_target = 0.9 and placed pg_xlog on a separate 20G
> partition. Also, archive_mode = off on this system.
>
> According to the docs, you would expect the system to attempt to keep the
> WAL files down close to 3 * checkpoint_segments + 1.


Assuming checkpoints are driven by checkpoint_segments and not time or
manual checkpoints, this actually seems to be an error in the docs.  What
it brings it back down to is something more like
(3 + checkpoint_completion_target) * checkpoint_segments + 1.

At the point in the code where it does the removal, it has all of the logs
since the start of the just-finished checkpoint
(completion_target*checkpoint_segments) that can't be removed, all of the
logs since the start of the checkpoint before that one
(=checkpoint_segments) that can't be removed, plus it will recycle enough
to have checkpoint_segments*2+1 future logs before it removes the rest.

But this describes what happens once things have gotten out of control but
then calms down again.  In your case it never does calm down again.

Cheers,

Jeff
On Sunday, February 24, 2013, Jeff Frost wrote:

>
> This is how the log_checkpoint output looks during the run:
>
> 2013-02-24 21:06:31.156 UTC,,,1624,,51282598.658,114,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint starting: immediate force wait",,,,,,,,,""
> 2013-02-24 21:06:31.216 UTC,,,1624,,51282598.658,115,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint complete: wrote 108 buffers (0.0%); 0
> transaction log file(s) added, 0 removed, 4 recycled; write=0.054 s,
> sync=0.002 s, total=0.059 s; sync files=10, longest=0.000 s, average=0.000
> s",,,,,,,,,""
>
> So, this ^^^^^ is the manual checkpoint prior to the pgbench run and
> everything else is part of the pgbench run.
>
> 2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,116,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoints are occurring too frequently (14 seconds
> apart)",,"Consider increasing the configuration parameter
> ""checkpoint_segments"".",,,,,,,""
>

It took 14 seconds to go fill up 128 files.


> 2013-02-24 21:06:45.227 UTC,,,1624,,51282598.658,117,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
> 2013-02-24 21:08:35.239 UTC,,,1624,,51282598.658,118,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint complete: wrote 266935 buffers (12.7%); 0
> transaction log file(s) added, 0 removed, 1 recycled; write=10.889 s,
> sync=99.079 s, total=110.011 s; sync files=29, longest=12.888 s,
> average=3.416 s",,,,,,,,,""
>

It spent 10.9 seconds writing buffers, either because it knew it was behind
time and was writing as fast as it could, or because it figured it needed
to be done with everything in 10.9 / 0.9 = 12.1 seconds, as that was when
the next checkpoint was due to start (i.e. when the next 128 was going be
filled).  But it took 99 seconds to accomplish the part of the checkpoint
that is supposed to fit in 1.2 seconds or less.  Assuming it was still
generating log files at the same rate of 128 every 12 to 14 seconds, there
would be an extra 1000 files by the time it is done.  That is almost enough
there to fill up your hard drive.  Now it probably won't manage to do that,
as I imagine the long fsync times would also lead to stalls in throughput.


> 2013-02-24 21:08:35.241 UTC,,,1624,,51282598.658,119,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
> 2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,120,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint complete: wrote 801517 buffers (38.2%); 0
> transaction log file(s) added, 0 removed, 128 recycled; write=15.484 s,
> sync=148.739 s, total=164.266 s; sync files=32, longest=11.773 s,
> average=4.648 s",,,,,,,,,""
>

Here again, either the writes are being metered so they finish after 15.484
s when PG estimates that 0.9 * 128 log files would be filled, or they were
going as fast as possible because it thought they would be filled even
earlier.  It then took nearly ten times that long to finish the checkpoint
(which according to completion_target, is supposed to be take 1/10 the time
of the write, not 10 times longer.  So of course the actual number of log
files is going to greatly exceed the planned 128.


> 2013-02-24 21:11:19.506 UTC,,,1624,,51282598.658,121,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
> 2013-02-24 21:14:20.486 UTC,,,1624,,51282598.658,122,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint complete: wrote 1009466 buffers (48.1%); 0
> transaction log file(s) added, 140 removed, 257 recycled; write=13.690 s,
> sync=167.245 s, total=180.980 s; sync files=33, longest=14.442 s,
> average=5.067 s",,,,,,,,,""
> 2013-02-24 21:14:20.487 UTC,,,1624,,51282598.658,123,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
> 2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,124,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint complete: wrote 1055059 buffers (50.3%); 0
> transaction log file(s) added, 259 removed, 257 recycled; write=22.731 s,
> sync=168.239 s, total=191.020 s; sync files=32, longest=13.650 s,
> average=5.257 s",,,,,,,,,""
> 2013-02-24 21:17:31.507 UTC,,,1624,,51282598.658,125,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint starting: xlog",,,,,,,,,""
> 2013-02-24 21:23:36.201 UTC,,,1624,,51282598.658,126,,2013-02-23 02:12:40
> UTC,,0,LOG,00000,"checkpoint complete: wrote 814767 buffers (38.9%); 0
> transaction log file(s) added, 546 removed, 0 recycled; write=269.553 s,
> sync=95.042 s, total=364.693 s; sync files=31, longest=11.211 s,
> average=3.065 s",,,,,,,,,""
>
> They're all number of xlog based and the first one doesn't remove any,
> that's not surprising,
>

What was the state of the pg_xlog directory before you started the test?
 If it were not full up to check_point segments * 2, (which is likely was
not, since populating pgbench_accounts avoids WAL when not archiving), this
I agree it is not surprising.

> but I'm surprised that the second one doesn't remove any.  It does
recycle quite a few.

PG keep two checkpoints worth or WAL files (that doesn't really seem to be
documented, and I'm not sure why it is done, other than an abundance of
caution).  So after the 2nd checkpoint is done, it keeps the ~500 or so log
files that were filled since the 2nd one started, and the ~400 or so filled
since the first one started.  The 128 that were filled to trigger the first
checkpoint can now be recycled, though.  Then after the next checkpoint,
those 400 are freed up and 257 are recycled and 140 are removed.  And so on


>  I guess the issue is that it can generate WAL files way faster than it
> can checkpoint and cleanup.
>

Yep, I don't have any doubt about that.


>  I think I need to write a script to output the number of WAL files each
> minute so we can better correlate what's going on.
>

I think we can already be confident about what is going on.   But perhaps
the log_checkpoint should record the number of log files retained, as well
as the numbers recycled and removed.  And also it would nice to know the
difference between time spend actually writing buffers, and time spent
napping during the buffer-write phase.

Cheers,

Jeff
On 02/26/13 09:08, Jeff Janes wrote:
>
>     They're all number of xlog based and the first one doesn't remove any,
>     that's not surprising,
>
>
> What was the state of the pg_xlog directory before you started the test?  If
> it were not full up to check_point segments * 2, (which is likely was not,
> since populating pgbench_accounts avoids WAL when not archiving), this I
> agree it is not surprising.

It always start at around 258 files.

The total WAL volume during the 10min pgbench run as evidenced by enabling an
archive command that just touches the filename is 1,972 WAL files archived, or
31G.  So, it is cleaning up or recycling about 9G, unfortunately, that's just
2G too few for a 20G filesystem.



--
Jeff Frost <jeff@pgexperts.com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/