Thread: wal-size limited to 16MB - Performance issue for subsequent backup
Hi. One of our "production issues" is that the system generates lots of wal-files, lots is like 151952 files over the last 24h, which is about 2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself, but the system does indeed work fine. We do subsequently gzip the files to limit actual disk-usage, this makes the files roughly 30-50% in size. That being said, along comes the backup, scheduled ones a day and tries to read off these wal-files, which to the backup looks like "an awfull lot of small files", our backup utillized a single thread to read of those files and levels of at reading through 30-40MB/s from a 21 drive Raid50 of rotating drives, which is quite bad. That causes a daily incremental run to take in the order of 24h. Differential picking up larger deltas and full are even worse. One could of-course question a lot of the other things here, but currently the only limiting factor is actually the backup being able to keep up transporting the wal-log away from the system due to the small wal size. A short test like: find . -type f -ctime -1 | tail -n 50 | xargs cat | pipebench > /dev/null confirms the backup speed to be roughly the same as seen by the backup software. Another test from the same volume doing: find . -type f -ctime -1 | tail -n 50 | xargs cat > largefile And then wait for the fs to not cache the file any more and cat largefile | pipebench > /dev/null confirms that the disk-subsystem can do way (150-200MB/s) better on larger files. Thread here around the same topic: http://postgresql.1045698.n5.nabble.com/How-do-you-change-the-size-of-the-WAL-files-td3425516.html But not a warm welcoming workaround. Suggestions are welcome. An archive-command/restore command that could combine/split wal-segments might be the easiest workaround, but how about crash-safeness?
jesper@krogh.cc writes: > Thread here around the same topic: > http://postgresql.1045698.n5.nabble.com/How-do-you-change-the-size-of-the-WAL-files-td3425516.html > But not a warm welcoming workaround. configure --with-wal-segsize=something ? regards, tom lane
> jesper@krogh.cc writes: >> Thread here around the same topic: >> http://postgresql.1045698.n5.nabble.com/How-do-you-change-the-size-of-the-WAL-files-td3425516.html >> But not a warm welcoming workaround. > > configure --with-wal-segsize=something ? Yes, but there are good reasons not to go down that route. One is that: 1) It looks like I'am going to be the only one, beware of the dragons. 2) It requires apparently a re-initdb, thus dump/restore of 4.5TB of data The latter can absolutely be done by scheduling downtime, but the former would compromise any level of good practice around our DB-operations. Jesper
On Mon, Oct 20, 2014 at 09:03:59PM +0200, jesper@krogh.cc wrote: > Hi. > > One of our "production issues" is that the system generates lots of > wal-files, lots is like 151952 files over the last 24h, which is about > 2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself, > but the system does indeed work fine. We do subsequently gzip the files to > limit actual disk-usage, this makes the files roughly 30-50% in size. > > ... > > Suggestions are welcome. An archive-command/restore command that could > combine/split wal-segments might be the easiest workaround, but how about > crash-safeness? > Hi, Have you considered using something like tar/star in the archive command to pack them into much larger tar archives? Regards, Ken
Hi, On 2014-10-20 21:03:59 +0200, jesper@krogh.cc wrote: > One of our "production issues" is that the system generates lots of > wal-files, lots is like 151952 files over the last 24h, which is about > 2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself, > but the system does indeed work fine. We do subsequently gzip the files to > limit actual disk-usage, this makes the files roughly 30-50% in size. Have you analyzed what the source of that volume is? Which version of postgres are you using? What's your checkpoint_timeout/segments settings? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
jesper@krogh.cc writes: >> configure --with-wal-segsize=something ? > Yes, but there are good reasons not to go down that route. One is that: > 1) It looks like I'am going to be the only one, beware of the dragons. > 2) It requires apparently a re-initdb, thus dump/restore of 4.5TB of data I think a clean shutdown and pg_resetxlog would be sufficient. I agree you'd need to do some testing though ... regards, tom lane
> On 2014-10-20 21:03:59 +0200, jesper@krogh.cc wrote: >> One of our "production issues" is that the system generates lots of >> wal-files, lots is like 151952 files over the last 24h, which is about >> 2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself, >> but the system does indeed work fine. We do subsequently gzip the files >> to >> limit actual disk-usage, this makes the files roughly 30-50% in size. > > Have you analyzed what the source of that volume is? Which version of > postgres are you using? What's your checkpoint_timeout/segments > settings? Suggestions are surely welcome. I do suspect the majority is from 30 concurrent processes updating an 506GB GIN index, but it would be nice to confirm that. There is also a message-queue in the DB with a fairly high turnaround. Currently PG 9.2 moving to 9.3 hopefully before end-of-year, checkpoint_timeout = 30min, checkpoint_segments = 4096. According to logs checkpoints are roughly 15 minutes apart. Jesper
On 2014-10-20 21:41:26 +0200, jesper@krogh.cc wrote: > > > On 2014-10-20 21:03:59 +0200, jesper@krogh.cc wrote: > >> One of our "production issues" is that the system generates lots of > >> wal-files, lots is like 151952 files over the last 24h, which is about > >> 2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself, > >> but the system does indeed work fine. We do subsequently gzip the files > >> to limit actual disk-usage, this makes the files roughly 30-50% in size. I'm a bit doubtful that 16MB vs., say, 64MB files really changes anything substantial for you. If it indeed is a problem, it's simple enough to join the files temporarily? > > Have you analyzed what the source of that volume is? Which version of > > postgres are you using? What's your checkpoint_timeout/segments > > settings? > > Suggestions are surely welcome. Once you're on 9.3 I'd suggest using pg_xlogdump --stats on it. There's a backport of the facility for 9.3 (looking somewhat different than what is now in 9.5) at http://archives.postgresql.org/message-id/CABRT9RAzGowqLFcEE8aF6VdPoFEy%2BP9gmu7ktGRzw0dgRwVr9Q%40mail.gmail.com That'd tell you a fair bit more. It's noticeably harder to backport to < 9.3. > I do suspect the majority is from 30 concurrent processes updating an > 506GB GIN index, but it would be nice to confirm that. There is also a > message-queue in the DB with a fairly high turnaround. A 506GB GIN index? Uh, interesting :). What's it used for? Trigrams? I'd suspect that the message queue isn't the primary culprit, but it's hard to say for sure. > Currently PG 9.2 moving to 9.3 hopefully before end-of-year, > checkpoint_timeout = 30min, checkpoint_segments = 4096. Generally a high checkpoint_timeout can significantly reduce the WAL volume because of fewer full page writes. I've seen cases where spacing checkpoint further apart by a factor of two reduced the overall WAL volume by more than two. > According to logs checkpoints are roughly 15 minutes apart. Can you show log_checkpoints output? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 2014-10-20 21:41:26 +0200, jesper@krogh.cc wrote: >> >> > On 2014-10-20 21:03:59 +0200, jesper@krogh.cc wrote: >> >> One of our "production issues" is that the system generates lots of >> >> wal-files, lots is like 151952 files over the last 24h, which is >> about >> >> 2.4TB worth of WAL files. I wouldn't say that isn't an issue by >> itself, >> >> but the system does indeed work fine. We do subsequently gzip the >> files >> >> to limit actual disk-usage, this makes the files roughly 30-50% in >> size. > > I'm a bit doubtful that 16MB vs., say, 64MB files really changes > anything substantial for you. If it indeed is a problem, it's simple > enough to join the files temporarily? I am trying to get my head around a good way to do that. 64MB probably isn't a silverbullet. But it would definetely benefit the backup in terms of single thread access to data on rotating drives. >> > Have you analyzed what the source of that volume is? Which version of >> > postgres are you using? What's your checkpoint_timeout/segments >> > settings? >> >> Suggestions are surely welcome. > > Once you're on 9.3 I'd suggest using pg_xlogdump --stats on it. There's > a backport of the facility for 9.3 (looking somewhat different than what > is now in 9.5) at > http://archives.postgresql.org/message-id/CABRT9RAzGowqLFcEE8aF6VdPoFEy%2BP9gmu7ktGRzw0dgRwVr9Q%40mail.gmail.com > > That'd tell you a fair bit more. It's noticeably harder to backport to < > 9.3. I'll bookmark that one. >> I do suspect the majority is from 30 concurrent processes updating an >> 506GB GIN index, but it would be nice to confirm that. There is also a >> message-queue in the DB with a fairly high turnaround. > > A 506GB GIN index? Uh, interesting :). What's it used for? Trigrams? It is for full-text-search, but it is being updated entirely regulary, ~100M records. A dump/restore cycle typically reduces the size to 30-40% of current size. >> Currently PG 9.2 moving to 9.3 hopefully before end-of-year, >> checkpoint_timeout = 30min, checkpoint_segments = 4096. > > Generally a high checkpoint_timeout can significantly reduce the WAL > volume because of fewer full page writes. I've seen cases where spacing > checkpoint further apart by a factor of two reduced the overall WAL > volume by more than two. I'll work with that, I was just uncomfortable bumping checkpoint_segments up much higher, any field experience in that corner? >> According to logs checkpoints are roughly 15 minutes apart. > > Can you show log_checkpoints output? 2014-10-20 18:10:22 CEST LOG: checkpoint starting: time 2014-10-20 18:15:44 CEST LOG: checkpoint complete: wrote 76851 buffers (7.3%); 0 transaction log file(s) added, 0 removed, 3238 recycled; write=295.834 s, sync=23.903 s, total=322.011 s; sync files=2115, longest=0.278 s, average=0.011 s 2014-10-20 18:40:22 CEST LOG: checkpoint starting: time 2014-10-20 18:44:30 CEST LOG: checkpoint complete: wrote 60550 buffers (5.8%); 0 transaction log file(s) added, 0 removed, 3460 recycled; write=224.678 s, sync=21.795 s, total=248.340 s; sync files=2090, longest=0.963 s, average=0.010 s 2014-10-20 19:10:22 CEST LOG: checkpoint starting: time 2014-10-20 19:14:11 CEST LOG: checkpoint complete: wrote 42720 buffers (4.1%); 0 transaction log file(s) added, 0 removed, 3598 recycled; write=206.259 s, sync=21.185 s, total=229.254 s; sync files=2065, longest=0.945 s, average=0.010 s 2014-10-20 19:40:22 CEST LOG: checkpoint starting: time 2014-10-20 19:43:31 CEST LOG: checkpoint complete: wrote 32897 buffers (3.1%); 0 transaction log file(s) added, 0 removed, 3626 recycled; write=161.801 s, sync=26.936 s, total=189.635 s; sync files=2115, longest=0.458 s, average=0.012 s 2014-10-20 20:10:22 CEST LOG: checkpoint starting: time 2014-10-20 20:14:04 CEST LOG: checkpoint complete: wrote 37557 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 3285 recycled; write=205.011 s, sync=16.550 s, total=222.442 s; sync files=2113, longest=0.935 s, average=0.007 s 2014-10-20 20:40:22 CEST LOG: checkpoint starting: time 2014-10-20 20:45:18 CEST LOG: checkpoint complete: wrote 58012 buffers (5.5%); 0 transaction log file(s) added, 0 removed, 3678 recycled; write=252.750 s, sync=39.178 s, total=295.107 s; sync files=2075, longest=0.990 s, average=0.018 s 2014-10-20 21:10:22 CEST LOG: checkpoint starting: time 2014-10-20 21:13:31 CEST LOG: checkpoint complete: wrote 40530 buffers (3.9%); 0 transaction log file(s) added, 0 removed, 3652 recycled; write=167.925 s, sync=19.719 s, total=189.057 s; sync files=2077, longest=0.470 s, average=0.009 s 2014-10-20 21:40:22 CEST LOG: checkpoint starting: time 2014-10-20 21:44:20 CEST LOG: checkpoint complete: wrote 45158 buffers (4.3%); 0 transaction log file(s) added, 0 removed, 3449 recycled; write=202.986 s, sync=32.564 s, total=237.441 s; sync files=2100, longest=0.445 s, average=0.015 s
On Mon, Oct 20, 2014 at 12:03 PM, <jesper@krogh.cc> wrote:
Hi.
One of our "production issues" is that the system generates lots of
wal-files, lots is like 151952 files over the last 24h, which is about
2.4TB worth of WAL files. I wouldn't say that isn't an issue by itself,
but the system does indeed work fine. We do subsequently gzip the files to
limit actual disk-usage, this makes the files roughly 30-50% in size.
That being said, along comes the backup, scheduled ones a day and tries to
read off these wal-files, which to the backup looks like "an awfull lot of
small files", our backup utillized a single thread to read of those files
and levels of at reading through 30-40MB/s from a 21 drive Raid50 of
rotating drives, which is quite bad. That causes a daily incremental run
to take in the order of 24h. Differential picking up larger deltas and
full are even worse.
Why not have archive_command (which gets the files while they are still cached) put the files directly into their final destination on the backup server?
Suggestions are welcome. An archive-command/restore command that could
combine/split wal-segments might be the easiest workaround, but how about
crash-safeness?
I think you would just have to combine them by looking at the file name and seeking to a specific spot in the large file (rather than just appending to it) so that if the archive_command fails and gets rerun, it will still end up in the correct place. I don't see what other crash-safeness issues you would have, other than the ones you already have. You would want to do the compression afterward combining, not before, so that all segments are of predictable size.
It should be pretty easy as long as want your combined files to consist of either 16 or 256 (or 255 in older versions) WAL files.
You would have to pass through directly any files not matching the filename pattern of ordinary WAL files.
Cheers,
Jeff
On 10/21/2014 03:03 AM, jesper@krogh.cc wrote: > That being said, along comes the backup, scheduled ones a day and tries to > read off these wal-files, which to the backup looks like "an awfull lot of > small files", our backup utillized a single thread to read of those files > and levels of at reading through 30-40MB/s from a 21 drive Raid50 of > rotating drives, which is quite bad. That causes a daily incremental run > to take in the order of 24h. Differential picking up larger deltas and > full are even worse. What's the backup system? 151952 files should be a trivial matter for any backup system. I'm very surprised you're seeing those kind of run times for 2TB of WAL, and think it's worth investigating just why the backup system is behaving this way. What does 'filefrag' say about the WAL segments? Are they generally a single extent each? If not, how many extents? It'd be useful to know the kernel version, file system, RAID controller, whether you use LVM, and other relevant details? What's your RAID array's stripe size? > A short test like: > find . -type f -ctime -1 | tail -n 50 | xargs cat | pipebench > /dev/null > confirms the backup speed to be roughly the same as seen by the backup > software. > Another test from the same volume doing: > find . -type f -ctime -1 | tail -n 50 | xargs cat > largefile > And then wait for the fs to not cache the file any more and > cat largefile | pipebench > /dev/null > confirms that the disk-subsystem can do way (150-200MB/s) better on larger > files. OK, so a larger contiguously allocated file looks like it's probably read faster. That doesn't mean there's any guarantee that big WAL segment would be allocated contiguously if there are lots of other writes interspersed, but the FS will try. (What does 'filefrag' say about your 'largefile'?) I'm wondering if you're having issues related to a RAID stripe size that is close to, or bigger than, your WAL segment size. So each segment is only being read from one disk or a couple of disks. If that's the case you're probably not getting ideal write performance either. That said, I don't see any particular reason why readahead wouldn't result in you getting similar results from multiple smaller WAL segments that're allocated contiguously, and they usually would be if they're created one after the other. What are your readahead settings? (There are often several at different levels; what exists depends on how your storage is configured, use of LVM, use of SW RAID, etc). In my opinion RAID 50, or RAID 5, are generally pretty poor options for a database file system in performance terms anyway. Especially for transaction logs. RAID 50 is also not wonderfully durable for arrays of larger numbers of bigger disks given modern disks' sizes, even with the low block error rates and relatively low disk failure rates. I personally tend to consider two parity disks the minimum acceptable for arrays of more than four or five disks. I'd certainly want continuous archiving or streaming replication in place if I was running RAID 50 on a big array. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: wal-size limited to 16MB - Performance issue for subsequent backup
From
Heikki Linnakangas
Date:
On 10/20/2014 11:02 PM, jesper@krogh.cc wrote: >>> >>I do suspect the majority is from 30 concurrent processes updating an >>> >>506GB GIN index, but it would be nice to confirm that. There is also a >>> >>message-queue in the DB with a fairly high turnaround. >> > >> >A 506GB GIN index? Uh, interesting :). What's it used for? Trigrams? > It is for full-text-search, but it is being updated entirely regulary, > ~100M records. A dump/restore cycle typically reduces the size to 30-40% > of current size. Try 9.4 beta. The on-disk format of GIN indexes was rewritten in 9.4, making them a lot smaller. That might help with WAL volume too. Or not, but I'd love to hear what the impact is, in a real-life database :-). - Heikki