Thread: wal-size limited to 16MB - Performance issue for subsequent backup

wal-size limited to 16MB - Performance issue for subsequent backup

From
jesper@krogh.cc
Date:
Hi.

One of our "production issues" is that the system generates lots of
wal-files, lots is like 151952 files over the last 24h, which is about
2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
but the system does indeed work fine. We do subsequently gzip the files to
limit actual disk-usage, this makes the files roughly 30-50% in size.

That being said, along comes the backup, scheduled ones a day and tries to
read off these wal-files, which to the backup looks like "an awfull lot of
small files", our backup utillized a single thread to read of those files
and levels of at reading through 30-40MB/s from a 21 drive Raid50 of
rotating drives, which is quite bad. That causes a daily incremental run
to take in the order of 24h. Differential picking up larger deltas and
full are even worse.

One could of-course question a lot of the other things here, but
currently the only limiting factor is actually the backup being
able to keep up transporting the wal-log away from the system due
to the small wal size.

A short test like:
find . -type f -ctime -1 | tail -n 50 | xargs cat | pipebench > /dev/null
confirms the backup speed to be roughly the same as seen by the backup
software.
Another test from the same volume doing:
find . -type f -ctime -1 | tail -n 50 | xargs cat  > largefile
And then wait for the fs to not cache the file any more and
cat largefile | pipebench > /dev/null
confirms that the disk-subsystem can do way (150-200MB/s) better on larger
files.

Thread here around the same topic:
http://postgresql.1045698.n5.nabble.com/How-do-you-change-the-size-of-the-WAL-files-td3425516.html
But not a warm welcoming workaround.

Suggestions are welcome. An archive-command/restore command that could
combine/split wal-segments might be the easiest workaround, but how about
crash-safeness?





Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
Tom Lane
Date:
jesper@krogh.cc writes:
> Thread here around the same topic:
> http://postgresql.1045698.n5.nabble.com/How-do-you-change-the-size-of-the-WAL-files-td3425516.html
> But not a warm welcoming workaround.

configure --with-wal-segsize=something ?
        regards, tom lane



Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
jesper@krogh.cc
Date:
> jesper@krogh.cc writes:
>> Thread here around the same topic:
>> http://postgresql.1045698.n5.nabble.com/How-do-you-change-the-size-of-the-WAL-files-td3425516.html
>> But not a warm welcoming workaround.
>
> configure --with-wal-segsize=something ?

Yes, but there are good reasons not to go down that route. One is that:

1) It looks like I'am going to be the only one, beware of the dragons.
2) It requires apparently a re-initdb, thus dump/restore of 4.5TB of data

The latter can absolutely be done by scheduling downtime, but the former
would compromise any level of good practice around our DB-operations.

Jesper





Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
"ktm@rice.edu"
Date:
On Mon, Oct 20, 2014 at 09:03:59PM +0200, jesper@krogh.cc wrote:
> Hi.
> 
> One of our "production issues" is that the system generates lots of
> wal-files, lots is like 151952 files over the last 24h, which is about
> 2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
> but the system does indeed work fine. We do subsequently gzip the files to
> limit actual disk-usage, this makes the files roughly 30-50% in size.
> 
> ...
> 
> Suggestions are welcome. An archive-command/restore command that could
> combine/split wal-segments might be the easiest workaround, but how about
> crash-safeness?
> 

Hi,

Have you considered using something like tar/star in the archive command
to pack them into much larger tar archives?

Regards,
Ken



Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
Andres Freund
Date:
Hi,

On 2014-10-20 21:03:59 +0200, jesper@krogh.cc wrote:
> One of our "production issues" is that the system generates lots of
> wal-files, lots is like 151952 files over the last 24h, which is about
> 2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
> but the system does indeed work fine. We do subsequently gzip the files to
> limit actual disk-usage, this makes the files roughly 30-50% in size.

Have you analyzed what the source of that volume is? Which version of
postgres are you using? What's your checkpoint_timeout/segments
settings?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
Tom Lane
Date:
jesper@krogh.cc writes:
>> configure --with-wal-segsize=something ?

> Yes, but there are good reasons not to go down that route. One is that:

> 1) It looks like I'am going to be the only one, beware of the dragons.
> 2) It requires apparently a re-initdb, thus dump/restore of 4.5TB of data

I think a clean shutdown and pg_resetxlog would be sufficient.  I agree
you'd need to do some testing though ...
        regards, tom lane



Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
jesper@krogh.cc
Date:
> On 2014-10-20 21:03:59 +0200, jesper@krogh.cc wrote:
>> One of our "production issues" is that the system generates lots of
>> wal-files, lots is like 151952 files over the last 24h, which is about
>> 2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
>> but the system does indeed work fine. We do subsequently gzip the files
>> to
>> limit actual disk-usage, this makes the files roughly 30-50% in size.
>
> Have you analyzed what the source of that volume is? Which version of
> postgres are you using? What's your checkpoint_timeout/segments
> settings?

Suggestions are surely welcome. I do suspect the majority is from 30
concurrent processes updating an 506GB GIN index, but it would be nice to
confirm that. There is also a message-queue in the DB with a fairly high
turnaround.

Currently PG 9.2 moving to 9.3 hopefully before end-of-year,
checkpoint_timeout = 30min, checkpoint_segments = 4096.

According to logs checkpoints are roughly 15 minutes apart.

Jesper





Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
Andres Freund
Date:
On 2014-10-20 21:41:26 +0200, jesper@krogh.cc wrote:
> 
> > On 2014-10-20 21:03:59 +0200, jesper@krogh.cc wrote:
> >> One of our "production issues" is that the system generates lots of
> >> wal-files, lots is like 151952 files over the last 24h, which is about
> >> 2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
> >> but the system does indeed work fine. We do subsequently gzip the files
> >> to limit actual disk-usage, this makes the files roughly 30-50% in size.

I'm a bit doubtful that 16MB vs., say, 64MB files really changes
anything substantial for you. If it indeed is a problem, it's simple
enough to join the files temporarily?

> > Have you analyzed what the source of that volume is? Which version of
> > postgres are you using? What's your checkpoint_timeout/segments
> > settings?
> 
> Suggestions are surely welcome.

Once you're on 9.3 I'd suggest using pg_xlogdump --stats on it. There's
a backport of the facility for 9.3 (looking somewhat different than what
is now in 9.5) at
http://archives.postgresql.org/message-id/CABRT9RAzGowqLFcEE8aF6VdPoFEy%2BP9gmu7ktGRzw0dgRwVr9Q%40mail.gmail.com

That'd tell you a fair bit more. It's noticeably harder to backport to <
9.3.

> I do suspect the majority is from 30 concurrent processes updating an
> 506GB GIN index, but it would be nice to confirm that. There is also a
> message-queue in the DB with a fairly high turnaround.

A 506GB GIN index? Uh, interesting :). What's it used for? Trigrams?

I'd suspect that the message queue isn't the primary culprit, but it's
hard to say for sure.

> Currently PG 9.2 moving to 9.3 hopefully before end-of-year,
> checkpoint_timeout = 30min, checkpoint_segments = 4096.

Generally a high checkpoint_timeout can significantly reduce the WAL
volume because of fewer full page writes. I've seen cases where spacing
checkpoint further apart by a factor of two reduced the overall WAL
volume by more than two.

> According to logs checkpoints are roughly 15 minutes apart.

Can you show log_checkpoints output?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
jesper@krogh.cc
Date:
> On 2014-10-20 21:41:26 +0200, jesper@krogh.cc wrote:
>>
>> > On 2014-10-20 21:03:59 +0200, jesper@krogh.cc wrote:
>> >> One of our "production issues" is that the system generates lots of
>> >> wal-files, lots is like 151952 files over the last 24h, which is
>> about
>> >> 2.4TB worth of WAL files. I wouldn't say that isn't an issue by
>> itself,
>> >> but the system does indeed work fine. We do subsequently gzip the
>> files
>> >> to limit actual disk-usage, this makes the files roughly 30-50% in
>> size.
>
> I'm a bit doubtful that 16MB vs., say, 64MB files really changes
> anything substantial for you. If it indeed is a problem, it's simple
> enough to join the files temporarily?

I am trying to get my head around a good way to do that. 64MB probably
isn't a silverbullet. But it would definetely benefit the backup in terms
of single thread access to data on rotating drives.

>> > Have you analyzed what the source of that volume is? Which version of
>> > postgres are you using? What's your checkpoint_timeout/segments
>> > settings?
>>
>> Suggestions are surely welcome.
>
> Once you're on 9.3 I'd suggest using pg_xlogdump --stats on it. There's
> a backport of the facility for 9.3 (looking somewhat different than what
> is now in 9.5) at
> http://archives.postgresql.org/message-id/CABRT9RAzGowqLFcEE8aF6VdPoFEy%2BP9gmu7ktGRzw0dgRwVr9Q%40mail.gmail.com
>
> That'd tell you a fair bit more. It's noticeably harder to backport to <
> 9.3.

I'll bookmark that one.

>> I do suspect the majority is from 30 concurrent processes updating an
>> 506GB GIN index, but it would be nice to confirm that. There is also a
>> message-queue in the DB with a fairly high turnaround.
>
> A 506GB GIN index? Uh, interesting :). What's it used for? Trigrams?

It is for full-text-search, but it is being updated entirely regulary,
~100M records. A dump/restore cycle typically reduces the size to 30-40%
of current size.


>> Currently PG 9.2 moving to 9.3 hopefully before end-of-year,
>> checkpoint_timeout = 30min, checkpoint_segments = 4096.
>
> Generally a high checkpoint_timeout can significantly reduce the WAL
> volume because of fewer full page writes. I've seen cases where spacing
> checkpoint further apart by a factor of two reduced the overall WAL
> volume by more than two.

I'll work with that, I was just uncomfortable bumping checkpoint_segments
up much higher, any field experience in that corner?

>> According to logs checkpoints are roughly 15 minutes apart.
>
> Can you show log_checkpoints output?

2014-10-20 18:10:22 CEST LOG:  checkpoint starting: time
2014-10-20 18:15:44 CEST LOG:  checkpoint complete: wrote 76851 buffers
(7.3%); 0 transaction log file(s) added, 0 removed, 3238 recycled;
write=295.834 s, sync=23.903 s, total=322.011 s; sync files=2115,
longest=0.278 s, average=0.011 s
2014-10-20 18:40:22 CEST LOG:  checkpoint starting: time
2014-10-20 18:44:30 CEST LOG:  checkpoint complete: wrote 60550 buffers
(5.8%); 0 transaction log file(s) added, 0 removed, 3460 recycled;
write=224.678 s, sync=21.795 s, total=248.340 s; sync files=2090,
longest=0.963 s, average=0.010 s
2014-10-20 19:10:22 CEST LOG:  checkpoint starting: time
2014-10-20 19:14:11 CEST LOG:  checkpoint complete: wrote 42720 buffers
(4.1%); 0 transaction log file(s) added, 0 removed, 3598 recycled;
write=206.259 s, sync=21.185 s, total=229.254 s; sync files=2065,
longest=0.945 s, average=0.010 s
2014-10-20 19:40:22 CEST LOG:  checkpoint starting: time
2014-10-20 19:43:31 CEST LOG:  checkpoint complete: wrote 32897 buffers
(3.1%); 0 transaction log file(s) added, 0 removed, 3626 recycled;
write=161.801 s, sync=26.936 s, total=189.635 s; sync files=2115,
longest=0.458 s, average=0.012 s
2014-10-20 20:10:22 CEST LOG:  checkpoint starting: time
2014-10-20 20:14:04 CEST LOG:  checkpoint complete: wrote 37557 buffers
(3.6%); 0 transaction log file(s) added, 0 removed, 3285 recycled;
write=205.011 s, sync=16.550 s, total=222.442 s; sync files=2113,
longest=0.935 s, average=0.007 s
2014-10-20 20:40:22 CEST LOG:  checkpoint starting: time
2014-10-20 20:45:18 CEST LOG:  checkpoint complete: wrote 58012 buffers
(5.5%); 0 transaction log file(s) added, 0 removed, 3678 recycled;
write=252.750 s, sync=39.178 s, total=295.107 s; sync files=2075,
longest=0.990 s, average=0.018 s
2014-10-20 21:10:22 CEST LOG:  checkpoint starting: time
2014-10-20 21:13:31 CEST LOG:  checkpoint complete: wrote 40530 buffers
(3.9%); 0 transaction log file(s) added, 0 removed, 3652 recycled;
write=167.925 s, sync=19.719 s, total=189.057 s; sync files=2077,
longest=0.470 s, average=0.009 s
2014-10-20 21:40:22 CEST LOG:  checkpoint starting: time
2014-10-20 21:44:20 CEST LOG:  checkpoint complete: wrote 45158 buffers
(4.3%); 0 transaction log file(s) added, 0 removed, 3449 recycled;
write=202.986 s, sync=32.564 s, total=237.441 s; sync files=2100,
longest=0.445 s, average=0.015 s






Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
Jeff Janes
Date:
On Mon, Oct 20, 2014 at 12:03 PM, <jesper@krogh.cc> wrote:
Hi.

One of our "production issues" is that the system generates lots of
wal-files, lots is like 151952 files over the last 24h, which is about
2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
but the system does indeed work fine. We do subsequently gzip the files to
limit actual disk-usage, this makes the files roughly 30-50% in size.

That being said, along comes the backup, scheduled ones a day and tries to
read off these wal-files, which to the backup looks like "an awfull lot of
small files", our backup utillized a single thread to read of those files
and levels of at reading through 30-40MB/s from a 21 drive Raid50 of
rotating drives, which is quite bad. That causes a daily incremental run
to take in the order of 24h. Differential picking up larger deltas and
full are even worse.

Why not have archive_command (which gets the files while they are still cached) put the files directly into their final destination on the backup server?


Suggestions are welcome. An archive-command/restore command that could
combine/split wal-segments might be the easiest workaround, but how about
crash-safeness?

I think you would just have to combine them by looking at the file name and seeking to a specific spot in the large file (rather than just appending to it) so that if the archive_command fails and gets rerun, it will still end up in the correct place.  I don't see what other crash-safeness issues you would have, other than the ones you already have.  You would want to do the compression afterward combining, not before, so that all segments are of predictable size.

It should be pretty easy as long as want your combined files to consist of either 16 or 256 (or 255 in older versions) WAL files.

You would have to pass through directly any files not matching the filename pattern of ordinary WAL files.

Cheers,

Jeff

Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
Craig Ringer
Date:
On 10/21/2014 03:03 AM, jesper@krogh.cc wrote:

> That being said, along comes the backup, scheduled ones a day and tries to
> read off these wal-files, which to the backup looks like "an awfull lot of
> small files", our backup utillized a single thread to read of those files
> and levels of at reading through 30-40MB/s from a 21 drive Raid50 of
> rotating drives, which is quite bad. That causes a daily incremental run
> to take in the order of 24h. Differential picking up larger deltas and
> full are even worse.

What's the backup system?

151952 files should be a trivial matter for any backup system. I'm very
surprised you're seeing those kind of run times for 2TB of WAL, and
think it's worth investigating just why the backup system is behaving
this way.

What does 'filefrag' say about the WAL segments? Are they generally a
single extent each? If not, how many extents?

It'd be useful to know the kernel version, file system, RAID controller,
whether you use LVM, and other relevant details? What's your RAID
array's stripe size?

> A short test like:
> find . -type f -ctime -1 | tail -n 50 | xargs cat | pipebench > /dev/null
> confirms the backup speed to be roughly the same as seen by the backup
> software.
> Another test from the same volume doing:
> find . -type f -ctime -1 | tail -n 50 | xargs cat  > largefile
> And then wait for the fs to not cache the file any more and
> cat largefile | pipebench > /dev/null
> confirms that the disk-subsystem can do way (150-200MB/s) better on larger
> files.

OK, so a larger contiguously allocated file looks like it's probably
read faster. That doesn't mean there's any guarantee that big WAL
segment would be allocated contiguously if there are lots of other
writes interspersed, but the FS will try.

(What does 'filefrag' say about your 'largefile'?)

I'm wondering if you're having issues related to a RAID stripe size that
is close to, or bigger than, your WAL segment size. So each segment is
only being read from one disk or a couple of disks. If that's the case
you're probably not getting ideal write performance either.

That said, I don't see any particular reason why readahead wouldn't
result in you getting similar results from multiple smaller WAL segments
that're allocated contiguously, and they usually would be if they're
created one after the other. What are your readahead settings? (There
are often several at different levels; what exists depends on how your
storage is configured, use of LVM, use of SW RAID, etc).

In my opinion RAID 50, or RAID 5, are generally pretty poor options for
a database file system in performance terms anyway. Especially for
transaction logs. RAID 50 is also not wonderfully durable for arrays of
larger numbers of bigger disks given modern disks' sizes, even with the
low block error rates and relatively low disk failure rates. I
personally tend to consider two parity disks the minimum acceptable for
arrays of more than four or five disks. I'd certainly want continuous
archiving or streaming replication in place if I was running RAID 50 on
a big array.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: wal-size limited to 16MB - Performance issue for subsequent backup

From
Heikki Linnakangas
Date:
On 10/20/2014 11:02 PM, jesper@krogh.cc wrote:
>>> >>I do suspect the majority is from 30 concurrent processes updating an
>>> >>506GB GIN index, but it would be nice to confirm that. There is also a
>>> >>message-queue in the DB with a fairly high turnaround.
>> >
>> >A 506GB GIN index? Uh, interesting :). What's it used for? Trigrams?
> It is for full-text-search, but it is being updated entirely regulary,
> ~100M records. A dump/restore cycle typically reduces the size to 30-40%
> of current size.

Try 9.4 beta. The on-disk format of GIN indexes was rewritten in 9.4, 
making them a lot smaller. That might help with WAL volume too. Or not, 
but I'd love to hear what the impact is, in a real-life database :-).

- Heikki