Thread: pg_xlog size growing untill it fills the partition

pg_xlog size growing untill it fills the partition

From

Michal TOMA

Date:

03 October 2013, 21:56:49

Hello,

I have a problem on my pg 9.2.4 setup (OpenSuse 12.2, kernel 3.2.13).
My pg_xlog directory is growing uncontrolably untill it fills the partition. The database is under heavy write load and
isspread on two tablesapces one on a ssd software raid1 partition and a second one on a hdd software raid1 partition. 
I have no wal archiving enabled nor any replication.

I have tried different checkpoint related parameters without any noticable improvement.
Now I have:
    checkpoint_completion_target = 0.9
    wal_buffers = 8MB
    checkpoint_segments = 16
    checkpoint_timeout = 20min
    shared_buffers = 2GB
    log_checkpoints = on

This is what I can see in the log:
2013-10-03 13:58:56 CEST   LOG:  checkpoint starting: xlog
2013-10-03 13:59:56 CEST   LOG:  checkpoint complete: wrote 448 buffers (0.2%); 0 transaction log file(s) added, 9
removed,18 recycled; write=39.144 s, sync=21.136 s, total=60.286 s; sync files=380, longest=14.517 s, average=0.055 s 
2013-10-03 14:04:07 CEST   LOG:  checkpoint starting: xlog
2013-10-03 15:27:01 CEST   LOG:  checkpoint complete: wrote 693 buffers (0.3%); 0 transaction log file(s) added, 0
removed,16 recycled; write=90.775 s, sync=4883.295 s, total=4974.074 s; sync files=531, longest=152.855 s,
average=9.196s 
2013-10-03 15:27:01 CEST   LOG:  checkpoint starting: xlog time
2013-10-03 19:06:30 CEST   LOG:  checkpoint complete: wrote 3467 buffers (1.3%); 0 transaction log file(s) added, 0
removed,16 recycled; write=122.555 s, sync=13046.077 s, total=13168.637 s; sync files=650, longest=234.697 s,
average=20.069s 
2013-10-03 19:06:30 CEST   LOG:  checkpoint starting: xlog time
2013-10-03 22:30:25 CEST   LOG:  checkpoint complete: wrote 10198 buffers (3.9%); 0 transaction log file(s) added, 216
removed,33 recycled; write=132.229 s, sync=12102.311 s, total=12234.608 s; sync files=667, longest=181.374 s,
average=18.144s 
2013-10-03 22:30:25 CEST   LOG:  checkpoint starting: xlog time

I ran pg_test_fsync on the hard disk when postgresql is down and I have the following results:
-------------------------------------------------------------------------------------------------------------
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      59.639 ops/sec
        fdatasync                          43.959 ops/sec
        fsync                              34.497 ops/sec
        fsync_writethrough                            n/a
        open_sync                          35.612 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                      24.199 ops/sec
        fdatasync                          38.680 ops/sec
        fsync                              35.412 ops/sec
        fsync_writethrough                            n/a
        open_sync                          13.337 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write           5.499 ops/sec
         2 *  8kB open_sync writes         21.412 ops/sec
         4 *  4kB open_sync writes         13.065 ops/sec
         8 *  2kB open_sync writes          6.720 ops/sec
        16 *  1kB open_sync writes          3.320 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close                36.353 ops/sec
        write, close, fsync                37.347 ops/sec

Non-Sync'ed 8kB writes:
        write                           500365.249 ops/sec
-------------------------------------------------------------------------------------------------------------

When the server is up and running under the usual load I get the following results:
-------------------------------------------------------------------------------------------------------------
2 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                       0.369 ops/sec
        fdatasync                           0.575 ops/sec
        fsync                               0.125 ops/sec
        fsync_writethrough                            n/a
        open_sync                           0.222 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
        open_datasync                       0.383 ops/sec
        fdatasync                           2.171 ops/sec
        fsync                               1.318 ops/sec
        fsync_writethrough                            n/a
        open_sync                           0.929 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
         1 * 16kB open_sync write           0.079 ops/sec
         2 *  8kB open_sync writes          0.041 ops/sec
         4 *  4kB open_sync writes          0.194 ops/sec
         8 *  2kB open_sync writes          0.013 ops/sec
        16 *  1kB open_sync writes          0.005 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
        write, fsync, close                 0.098 ops/sec
        write, close, fsync                 0.067 ops/sec

Non-Sync'ed 8kB writes:
        write                               0.102 ops/sec
-------------------------------------------------------------------------------------------------------------

I need to tell to the server to limit the amount of wal files in pg_xlog somehow whatever the efect on the performance
couldbe. 

As for now I have to monitor the disk size and manually restart the server. It takes a few seconds to process all the
walfiles and resume. 
I have to do this a few times a day as it takes about 10 hours to get 20GB of wal files in the pg_xlog directory (what
fillsmy partion on the ssd). 

Any help with this issue would be greatly appreciated as I don't see what else I can try.

Michal

Re: pg_xlog size growing untill it fills the partition

From

Ken Brush

Date:

03 October 2013, 22:16:36

Try setting the following in your postgresql.conf:

wal_keep_segments = 0

On Thu, Oct 3, 2013 at 2:56 PM, Michal TOMA <mt@sicoop.com> wrote:

Hello,

I have a problem on my pg 9.2.4 setup (OpenSuse 12.2, kernel 3.2.13).
My pg_xlog directory is growing uncontrolably untill it fills the partition. The database is under heavy write load and is spread on two tablesapces one on a ssd software raid1 partition and a second one on a hdd software raid1 partition.
I have no wal archiving enabled nor any replication.

I have tried different checkpoint related parameters without any noticable improvement.
Now I have:
checkpoint_completion_target = 0.9
wal_buffers = 8MB
checkpoint_segments = 16
checkpoint_timeout = 20min
shared_buffers = 2GB
log_checkpoints = on

This is what I can see in the log:
2013-10-03 13:58:56 CEST LOG: checkpoint starting: xlog
2013-10-03 13:59:56 CEST LOG: checkpoint complete: wrote 448 buffers (0.2%); 0 transaction log file(s) added, 9 removed, 18 recycled; write=39.144 s, sync=21.136 s, total=60.286 s; sync files=380, longest=14.517 s, average=0.055 s
2013-10-03 14:04:07 CEST LOG: checkpoint starting: xlog
2013-10-03 15:27:01 CEST LOG: checkpoint complete: wrote 693 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=90.775 s, sync=4883.295 s, total=4974.074 s; sync files=531, longest=152.855 s, average=9.196 s
2013-10-03 15:27:01 CEST LOG: checkpoint starting: xlog time
2013-10-03 19:06:30 CEST LOG: checkpoint complete: wrote 3467 buffers (1.3%); 0 transaction log file(s) added, 0 removed, 16 recycled; write=122.555 s, sync=13046.077 s, total=13168.637 s; sync files=650, longest=234.697 s, average=20.069 s
2013-10-03 19:06:30 CEST LOG: checkpoint starting: xlog time
2013-10-03 22:30:25 CEST LOG: checkpoint complete: wrote 10198 buffers (3.9%); 0 transaction log file(s) added, 216 removed, 33 recycled; write=132.229 s, sync=12102.311 s, total=12234.608 s; sync files=667, longest=181.374 s, average=18.144 s
2013-10-03 22:30:25 CEST LOG: checkpoint starting: xlog time

I ran pg_test_fsync on the hard disk when postgresql is down and I have the following results:
-------------------------------------------------------------------------------------------------------------
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 59.639 ops/sec
fdatasync 43.959 ops/sec
fsync 34.497 ops/sec
fsync_writethrough n/a
open_sync 35.612 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 24.199 ops/sec
fdatasync 38.680 ops/sec
fsync 35.412 ops/sec
fsync_writethrough n/a
open_sync 13.337 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 5.499 ops/sec
2 * 8kB open_sync writes 21.412 ops/sec
4 * 4kB open_sync writes 13.065 ops/sec
8 * 2kB open_sync writes 6.720 ops/sec
16 * 1kB open_sync writes 3.320 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 36.353 ops/sec
write, close, fsync 37.347 ops/sec

Non-Sync'ed 8kB writes:
write 500365.249 ops/sec
-------------------------------------------------------------------------------------------------------------

When the server is up and running under the usual load I get the following results:
-------------------------------------------------------------------------------------------------------------
2 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 0.369 ops/sec
fdatasync 0.575 ops/sec
fsync 0.125 ops/sec
fsync_writethrough n/a
open_sync 0.222 ops/sec

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 0.383 ops/sec
fdatasync 2.171 ops/sec
fsync 1.318 ops/sec
fsync_writethrough n/a
open_sync 0.929 ops/sec

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 0.079 ops/sec
2 * 8kB open_sync writes 0.041 ops/sec
4 * 4kB open_sync writes 0.194 ops/sec
8 * 2kB open_sync writes 0.013 ops/sec
16 * 1kB open_sync writes 0.005 ops/sec

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 0.098 ops/sec
write, close, fsync 0.067 ops/sec

Non-Sync'ed 8kB writes:
write 0.102 ops/sec
-------------------------------------------------------------------------------------------------------------

I need to tell to the server to limit the amount of wal files in pg_xlog somehow whatever the efect on the performance could be.

As for now I have to monitor the disk size and manually restart the server. It takes a few seconds to process all the wal files and resume.
I have to do this a few times a day as it takes about 10 hours to get 20GB of wal files in the pg_xlog directory (what fills my partion on the ssd).

Any help with this issue would be greatly appreciated as I don't see what else I can try.

Michal

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: pg_xlog size growing untill it fills the partition

From

Michal TOMA

Date:

04 October 2013, 05:54:23

I have it set to 0 already.

On Friday 04 October 2013 00:16:28 Ken Brush wrote:
> Try setting the following in your postgresql.conf:
>
> wal_keep_segments = 0
>
> On Thu, Oct 3, 2013 at 2:56 PM, Michal TOMA <mt@sicoop.com> wrote:
> > Hello,
> >
> > I have a problem on my pg 9.2.4 setup (OpenSuse 12.2, kernel 3.2.13).
> > My pg_xlog directory is growing uncontrolably untill it fills the
> > partition. The database is under heavy write load and is spread on two
> > tablesapces one on a ssd software raid1 partition and a second one on a
> > hdd software raid1 partition.
> > I have no wal archiving enabled nor any replication.
> >
> > I have tried different checkpoint related parameters without any
> > noticable improvement.
> > Now I have:
> >         checkpoint_completion_target = 0.9
> >         wal_buffers = 8MB
> >         checkpoint_segments = 16
> >         checkpoint_timeout = 20min
> >         shared_buffers = 2GB
> >         log_checkpoints = on
> >
> > This is what I can see in the log:
> > 2013-10-03 13:58:56 CEST   LOG:  checkpoint starting: xlog
> > 2013-10-03 13:59:56 CEST   LOG:  checkpoint complete: wrote 448 buffers
> > (0.2%); 0 transaction log file(s) added, 9 removed, 18 recycled;
> > write=39.144 s, sync=21.136 s, total=60.286 s; sync files=380,
> > longest=14.517 s, average=0.055 s
> > 2013-10-03 14:04:07 CEST   LOG:  checkpoint starting: xlog
> > 2013-10-03 15:27:01 CEST   LOG:  checkpoint complete: wrote 693 buffers
> > (0.3%); 0 transaction log file(s) added, 0 removed, 16 recycled;
> > write=90.775 s, sync=4883.295 s, total=4974.074 s; sync files=531,
> > longest=152.855 s, average=9.196 s
> > 2013-10-03 15:27:01 CEST   LOG:  checkpoint starting: xlog time
> > 2013-10-03 19:06:30 CEST   LOG:  checkpoint complete: wrote 3467 buffers
> > (1.3%); 0 transaction log file(s) added, 0 removed, 16 recycled;
> > write=122.555 s, sync=13046.077 s, total=13168.637 s; sync files=650,
> > longest=234.697 s, average=20.069 s
> > 2013-10-03 19:06:30 CEST   LOG:  checkpoint starting: xlog time
> > 2013-10-03 22:30:25 CEST   LOG:  checkpoint complete: wrote 10198 buffers
> > (3.9%); 0 transaction log file(s) added, 216 removed, 33 recycled;
> > write=132.229 s, sync=12102.311 s, total=12234.608 s; sync files=667,
> > longest=181.374 s, average=18.144 s
> > 2013-10-03 22:30:25 CEST   LOG:  checkpoint starting: xlog time
> >
> > I ran pg_test_fsync on the hard disk when postgresql is down and I have
> > the following results:
> >
> > -------------------------------------------------------------------------
> >------------------------------------ O_DIRECT supported on this platform
> > for open_datasync and open_sync.
> >
> > Compare file sync methods using one 8kB write:
> > (in wal_sync_method preference order, except fdatasync
> > is Linux's default)
> >         open_datasync                      59.639 ops/sec
> >         fdatasync                          43.959 ops/sec
> >         fsync                              34.497 ops/sec
> >         fsync_writethrough                            n/a
> >         open_sync                          35.612 ops/sec
> >
> > Compare file sync methods using two 8kB writes:
> > (in wal_sync_method preference order, except fdatasync
> > is Linux's default)
> >         open_datasync                      24.199 ops/sec
> >         fdatasync                          38.680 ops/sec
> >         fsync                              35.412 ops/sec
> >         fsync_writethrough                            n/a
> >         open_sync                          13.337 ops/sec
> >
> > Compare open_sync with different write sizes:
> > (This is designed to compare the cost of writing 16kB
> > in different write open_sync sizes.)
> >          1 * 16kB open_sync write           5.499 ops/sec
> >          2 *  8kB open_sync writes         21.412 ops/sec
> >          4 *  4kB open_sync writes         13.065 ops/sec
> >          8 *  2kB open_sync writes          6.720 ops/sec
> >         16 *  1kB open_sync writes          3.320 ops/sec
> >
> > Test if fsync on non-write file descriptor is honored:
> > (If the times are similar, fsync() can sync data written
> > on a different descriptor.)
> >         write, fsync, close                36.353 ops/sec
> >         write, close, fsync                37.347 ops/sec
> >
> > Non-Sync'ed 8kB writes:
> >         write                           500365.249 ops/sec
> >
> > -------------------------------------------------------------------------
> >------------------------------------
> >
> > When the server is up and running under the usual load I get the
> > following results:
> >
> > -------------------------------------------------------------------------
> >------------------------------------ 2 seconds per test
> > O_DIRECT supported on this platform for open_datasync and open_sync.
> >
> > Compare file sync methods using one 8kB write:
> > (in wal_sync_method preference order, except fdatasync
> > is Linux's default)
> >         open_datasync                       0.369 ops/sec
> >         fdatasync                           0.575 ops/sec
> >         fsync                               0.125 ops/sec
> >         fsync_writethrough                            n/a
> >         open_sync                           0.222 ops/sec
> >
> > Compare file sync methods using two 8kB writes:
> > (in wal_sync_method preference order, except fdatasync
> > is Linux's default)
> >         open_datasync                       0.383 ops/sec
> >         fdatasync                           2.171 ops/sec
> >         fsync                               1.318 ops/sec
> >         fsync_writethrough                            n/a
> >         open_sync                           0.929 ops/sec
> >
> > Compare open_sync with different write sizes:
> > (This is designed to compare the cost of writing 16kB
> > in different write open_sync sizes.)
> >          1 * 16kB open_sync write           0.079 ops/sec
> >          2 *  8kB open_sync writes          0.041 ops/sec
> >          4 *  4kB open_sync writes          0.194 ops/sec
> >          8 *  2kB open_sync writes          0.013 ops/sec
> >         16 *  1kB open_sync writes          0.005 ops/sec
> >
> > Test if fsync on non-write file descriptor is honored:
> > (If the times are similar, fsync() can sync data written
> > on a different descriptor.)
> >         write, fsync, close                 0.098 ops/sec
> >         write, close, fsync                 0.067 ops/sec
> >
> > Non-Sync'ed 8kB writes:
> >         write                               0.102 ops/sec
> >
> > -------------------------------------------------------------------------
> >------------------------------------
> >
> > I need to tell to the server to limit the amount of wal files in pg_xlog
> > somehow whatever the efect on the performance could be.
> >
> > As for now I have to monitor the disk size and manually restart the
> > server. It takes a few seconds to process all the wal files and resume.
> > I have to do this a few times a day as it takes about 10 hours to get
> > 20GB of wal files in the pg_xlog directory (what fills my partion on the
> > ssd).
> >
> > Any help with this issue would be greatly appreciated as I don't see what
> > else I can try.
> >
> > Michal
> >
> >
> > --
> > Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-general

Re: pg_xlog size growing untill it fills the partition

From

Alban Hertroys

Date:

04 October 2013, 07:02:03

On Oct 3, 2013, at 23:56, Michal TOMA <mt@sicoop.com> wrote:

> I have a problem on my pg 9.2.4 setup (OpenSuse 12.2, kernel 3.2.13).
> My pg_xlog directory is growing uncontrolably untill it fills the partition. The database is under heavy write load
andis spread on two tablesapces one on a ssd software raid1 partition and a second one on a hdd software raid1
partition.
> I have no wal archiving enabled nor any replication.
>
> I have tried different checkpoint related parameters without any noticable improvement.
> Now I have:
>     checkpoint_completion_target = 0.9
>     wal_buffers = 8MB
>     checkpoint_segments = 16
>     checkpoint_timeout = 20min
>     shared_buffers = 2GB
>     log_checkpoints = on
>
> This is what I can see in the log:
> 2013-10-03 13:58:56 CEST   LOG:  checkpoint starting: xlog
> 2013-10-03 13:59:56 CEST   LOG:  checkpoint complete: wrote 448 buffers (0.2%); 0 transaction log file(s) added, 9
removed,18 recycled; write=39.144 s, sync=21.136 s, total=60.286 s; sync files=380, longest=14.517 s, average=0.055 s 

> 2013-10-03 14:04:07 CEST   LOG:  checkpoint starting: xlog
> 2013-10-03 15:27:01 CEST   LOG:  checkpoint complete: wrote 693 buffers (0.3%); 0 transaction log file(s) added, 0
removed,16 recycled; write=90.775 s, sync=4883.295 s, total=4974.074 s; sync files=531, longest=152.855 s,
average=9.196s 
> 2013-10-03 15:27:01 CEST   LOG:  checkpoint starting: xlog time
> 2013-10-03 19:06:30 CEST   LOG:  checkpoint complete: wrote 3467 buffers (1.3%); 0 transaction log file(s) added, 0
removed,16 recycled; write=122.555 s, sync=13046.077 s, total=13168.637 s; sync files=650, longest=234.697 s,
average=20.069s 
> 2013-10-03 19:06:30 CEST   LOG:  checkpoint starting: xlog time
> 2013-10-03 22:30:25 CEST   LOG:  checkpoint complete: wrote 10198 buffers (3.9%); 0 transaction log file(s) added,
216removed, 33 recycled; write=132.229 s, sync=12102.311 s, total=12234.608 s; sync files=667, longest=181.374 s,
average=18.144s 
> 2013-10-03 22:30:25 CEST   LOG:  checkpoint starting: xlog time

I'm not too familiar with checkpoint logging output, but from the looks of it you're literally spending hours on
syncingcheckpoints. 

Are those disks on a RAID controller with a failed cache battery or something?
You aren't using RAID-5, are you?

> -------------------------------------------------------------------------------------------------------------
>
> When the server is up and running under the usual load I get the following results:
> -------------------------------------------------------------------------------------------------------------
> 2 seconds per test
> O_DIRECT supported on this platform for open_datasync and open_sync.
>
> Compare file sync methods using one 8kB write:
> (in wal_sync_method preference order, except fdatasync
> is Linux's default)
>        open_datasync                       0.369 ops/sec
>        fdatasync                           0.575 ops/sec
>        fsync                               0.125 ops/sec
>        fsync_writethrough                            n/a
>        open_sync                           0.222 ops/sec
>
> Compare file sync methods using two 8kB writes:
> (in wal_sync_method preference order, except fdatasync
> is Linux's default)
>        open_datasync                       0.383 ops/sec
>        fdatasync                           2.171 ops/sec
>        fsync                               1.318 ops/sec
>        fsync_writethrough                            n/a
>        open_sync                           0.929 ops/sec
>
> Compare open_sync with different write sizes:
> (This is designed to compare the cost of writing 16kB
> in different write open_sync sizes.)
>         1 * 16kB open_sync write           0.079 ops/sec
>         2 *  8kB open_sync writes          0.041 ops/sec
>         4 *  4kB open_sync writes          0.194 ops/sec
>         8 *  2kB open_sync writes          0.013 ops/sec
>        16 *  1kB open_sync writes          0.005 ops/sec
>
> Test if fsync on non-write file descriptor is honored:
> (If the times are similar, fsync() can sync data written
> on a different descriptor.)
>        write, fsync, close                 0.098 ops/sec
>        write, close, fsync                 0.067 ops/sec
>
> Non-Sync'ed 8kB writes:
>        write                               0.102 ops/sec
> -------------------------------------------------------------------------------------------------------------

Those numbers look bad.

Are these the SSD's or the software RAID?

It's almost as if you're saturating your disk I/O bandwidth. What hardware is involved here? Or is it a kernel
limitation,perhaps? 

> I need to tell to the server to limit the amount of wal files in pg_xlog somehow whatever the efect on the
performancecould be. 


I think more's at play here. Unfortunately, if it's not directly related to the things I mentioned I can't help much.
I'ma bit out of my league here though - I already made lots of assumptions about how to interpret this data. 

Cheers,

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.

Re: pg_xlog size growing untill it fills the partition

From

Michal TOMA

Date:

04 October 2013, 08:43:10

Hello Alban,

Yes I think I'm saturating the HDD IO but the problem is that the only thing
that runs on that server is postgres itself.
I have two tablespeces one on a SSD software raid 1 another one on a HDD
software raid 1.
The disk which is saturating is the HDD and the pg_xlog directory is on the
SSD. So the problem seems to be that the HDD is saturating while the SSD is
not. The server is able to write wal files on the SSD really fast while
writing data from the wal files to the HDD is much slower.

The question is how do i solve this issue. It must be a way to tell postgres
to regulate the application making the INSERT requests even if there is no IO
problem on the pg_xlog/SSD partition (but at the same time there is a huge
one on the data/HDD partition).
Something like "if total pg_xlog size is more than 20 GB" let all INSERT
queries wait till the data is commited on the HDD.
For the only solution I see is to manually restart the server every time the
size of pg_xlog is close to filling the partition. This is definitely not an
option.

Michal

On Friday 04 October 2013 09:01:53 Alban Hertroys wrote:
> On Oct 3, 2013, at 23:56, Michal TOMA <mt@sicoop.com> wrote:
> > I have a problem on my pg 9.2.4 setup (OpenSuse 12.2, kernel 3.2.13).
> > My pg_xlog directory is growing uncontrolably untill it fills the
> > partition. The database is under heavy write load and is spread on two
> > tablesapces one on a ssd software raid1 partition and a second one on a
> > hdd software raid1 partition. I have no wal archiving enabled nor any
> > replication.
> >
> > I have tried different checkpoint related parameters without any
> > noticable improvement. Now I have:
> >     checkpoint_completion_target = 0.9
> >     wal_buffers = 8MB
> >     checkpoint_segments = 16
> >     checkpoint_timeout = 20min
> >     shared_buffers = 2GB
> >     log_checkpoints = on
> >
> > This is what I can see in the log:
> > 2013-10-03 13:58:56 CEST   LOG:  checkpoint starting: xlog
> > 2013-10-03 13:59:56 CEST   LOG:  checkpoint complete: wrote 448 buffers
> > (0.2%); 0 transaction log file(s) added, 9 removed, 18 recycled;
> > write=39.144 s, sync=21.136 s, total=60.286 s; sync files=380,
> > longest=14.517 s, average=0.055 s
> >
> > 2013-10-03 14:04:07 CEST   LOG:  checkpoint starting: xlog
> > 2013-10-03 15:27:01 CEST   LOG:  checkpoint complete: wrote 693 buffers
> > (0.3%); 0 transaction log file(s) added, 0 removed, 16 recycled;
> > write=90.775 s, sync=4883.295 s, total=4974.074 s; sync files=531,
> > longest=152.855 s, average=9.196 s 2013-10-03 15:27:01 CEST   LOG:
> > checkpoint starting: xlog time
> > 2013-10-03 19:06:30 CEST   LOG:  checkpoint complete: wrote 3467 buffers
> > (1.3%); 0 transaction log file(s) added, 0 removed, 16 recycled;
> > write=122.555 s, sync=13046.077 s, total=13168.637 s; sync files=650,
> > longest=234.697 s, average=20.069 s 2013-10-03 19:06:30 CEST   LOG:
> > checkpoint starting: xlog time
> > 2013-10-03 22:30:25 CEST   LOG:  checkpoint complete: wrote 10198 buffers
> > (3.9%); 0 transaction log file(s) added, 216 removed, 33 recycled;
> > write=132.229 s, sync=12102.311 s, total=12234.608 s; sync files=667,
> > longest=181.374 s, average=18.144 s 2013-10-03 22:30:25 CEST   LOG:
> > checkpoint starting: xlog time
>
> I'm not too familiar with checkpoint logging output, but from the looks of
> it you're literally spending hours on syncing checkpoints.
>
> Are those disks on a RAID controller with a failed cache battery or
> something? You aren't using RAID-5, are you?
>
> > -------------------------------------------------------------------------
> >------------------------------------
> >
> > When the server is up and running under the usual load I get the
> > following results:
> > -------------------------------------------------------------------------
> >------------------------------------ 2 seconds per test
> > O_DIRECT supported on this platform for open_datasync and open_sync.
> >
> > Compare file sync methods using one 8kB write:
> > (in wal_sync_method preference order, except fdatasync
> > is Linux's default)
> >        open_datasync                       0.369 ops/sec
> >        fdatasync                           0.575 ops/sec
> >        fsync                               0.125 ops/sec
> >        fsync_writethrough                            n/a
> >        open_sync                           0.222 ops/sec
> >
> > Compare file sync methods using two 8kB writes:
> > (in wal_sync_method preference order, except fdatasync
> > is Linux's default)
> >        open_datasync                       0.383 ops/sec
> >        fdatasync                           2.171 ops/sec
> >        fsync                               1.318 ops/sec
> >        fsync_writethrough                            n/a
> >        open_sync                           0.929 ops/sec
> >
> > Compare open_sync with different write sizes:
> > (This is designed to compare the cost of writing 16kB
> > in different write open_sync sizes.)
> >         1 * 16kB open_sync write           0.079 ops/sec
> >         2 *  8kB open_sync writes          0.041 ops/sec
> >         4 *  4kB open_sync writes          0.194 ops/sec
> >         8 *  2kB open_sync writes          0.013 ops/sec
> >        16 *  1kB open_sync writes          0.005 ops/sec
> >
> > Test if fsync on non-write file descriptor is honored:
> > (If the times are similar, fsync() can sync data written
> > on a different descriptor.)
> >        write, fsync, close                 0.098 ops/sec
> >        write, close, fsync                 0.067 ops/sec
> >
> > Non-Sync'ed 8kB writes:
> >        write                               0.102 ops/sec
> > -------------------------------------------------------------------------
> >------------------------------------
>
> Those numbers look bad.
>
> Are these the SSD's or the software RAID?
>
> It's almost as if you're saturating your disk I/O bandwidth. What hardware
> is involved here? Or is it a kernel limitation, perhaps?
>
> > I need to tell to the server to limit the amount of wal files in pg_xlog
> > somehow whatever the efect on the performance could be.
>
> I think more's at play here. Unfortunately, if it's not directly related to
> the things I mentioned I can't help much. I'm a bit out of my league here
> though - I already made lots of assumptions about how to interpret this
> data.
>
> Cheers,
>
> Alban Hertroys
> --
> If you can't see the forest for the trees,
> cut the trees and you'll find there is no forest.

Re: pg_xlog size growing untill it fills the partition

From

Marcin Mańk

Date:

07 October 2013, 13:24:21

On Thu, Oct 3, 2013 at 11:56 PM, Michal TOMA <mt@sicoop.com> wrote:

Now I have:
checkpoint_completion_target = 0.9
wal_buffers = 8MB
checkpoint_segments = 16
checkpoint_timeout = 20min
shared_buffers = 2GB
log_checkpoints = on

This is what I can see in the log:
2013-10-03 13:58:56 CEST LOG: checkpoint starting: xlog
2013-10-03 13:59:56 CEST LOG: checkpoint complete: wrote 448 buffers (0.2%); 0 transaction log file(s) added, 9 removed, 18 recycled; write=39.144 s, s, sync=12102.311 s, total=12234.608 s; sync files=667, longest=181.374 s, average=18.144 s
2013-10-03 22:30:25 CEST LOG: checkpoint starting: xlog time

From your logs, it seems that the writes are spread all over the (fairly large) database. Is that correct? What is the database size? What is the size of the working data set (i.e. the set of rows that are in use)?

I heard of people having good results with setting a low value for shared_buffers (like 128MB) in a high write activity scenarios. Setting it that low would mean that checkpoints would have 16 times less to do.

Re: pg_xlog size growing untill it fills the partition

From

Michal TOMA

Date:

07 October 2013, 14:05:42

The database is about 1TB large.
Thera are mainly 4 tables 2 with row sizes about 1000 bytes and 2 with rows
with about 200 bytes but contaniong also several usually large (the size of
an average webpage) text fields.
My initial setting was 16GB of shared buffers I tried with 2GB and did not
notice any clear difference.
I'll try with 128MB.

On Monday 07 October 2013 15:23:33 Marcin Mańk wrote:
> On Thu, Oct 3, 2013 at 11:56 PM, Michal TOMA <mt@sicoop.com> wrote:
> > Now I have:
> >         checkpoint_completion_target = 0.9
> >         wal_buffers = 8MB
> >         checkpoint_segments = 16
> >         checkpoint_timeout = 20min
> >         shared_buffers = 2GB
> >         log_checkpoints = on
> >
> > This is what I can see in the log:
> > 2013-10-03 13:58:56 CEST   LOG:  checkpoint starting: xlog
> > 2013-10-03 13:59:56 CEST   LOG:  checkpoint complete: wrote 448 buffers
> > (0.2%); 0 transaction log file(s) added, 9 removed, 18 recycled;
> > write=39.144 s, s, sync=12102.311 s, total=12234.608 s; sync files=667,
> > longest=181.374 s, average=18.144 s
> > 2013-10-03 22:30:25 CEST   LOG:  checkpoint starting: xlog time
>
> From your logs, it seems that the writes are spread all over the (fairly
> large) database. Is that correct? What is the database size? What is the
> size of the working data set (i.e. the set of rows that are in use)?
>
> I heard of people having good results with setting a low value for
> shared_buffers (like 128MB) in a high write activity scenarios. Setting it
> that low would mean that checkpoints would have 16 times less to do.

Re: pg_xlog size growing untill it fills the partition

From

Jeff Janes

Date:

07 October 2013, 16:08:10

On Mon, Oct 7, 2013 at 6:23 AM, Marcin Mańk <marcin.mank@gmail.com> wrote:

On Thu, Oct 3, 2013 at 11:56 PM, Michal TOMA <mt@sicoop.com> wrote:

This is what I can see in the log:
2013-10-03 13:58:56 CEST LOG: checkpoint starting: xlog
2013-10-03 13:59:56 CEST LOG: checkpoint complete: wrote 448 buffers (0.2%); 0 transaction log file(s) added, 9 removed, 18 recycled; write=39.144 s, s, sync=12102.311 s, total=12234.608 s; sync files=667, longest=181.374 s, average=18.144 s

2013-10-03 22:30:25 CEST LOG: checkpoint starting: xlog time

From your logs, it seems that the writes are spread all over the (fairly large) database. Is that correct? What is the database size? What is the size of the working data set (i.e. the set of rows that are in use)?

I heard of people having good results with setting a low value for shared_buffers (like 128MB) in a high write activity scenarios. Setting it that low would mean that checkpoints would have 16 times less to do.

It looks like most of the actual writing is being done by either the background writer or the backends themselves, not the checkpoint. And the checkpointer still has to sync all the files, so lowering it further is unlikely to help.

I don't think he ever gave us the specs of the RAID is using. My guess is that it is way too small for the workload.

Cheers,

Jeff

Re: pg_xlog size growing untill it fills the partition

From

Michal TOMA

Date:

07 October 2013, 18:44:40

I gave it in my first post. It is a software raid 1 of average 7200 rpm disks
(Hitachi HDS723020BLE640) for the main tablespace and a software raid 1 of
SSDs for onother tablespace and alos the partition holding the pg_xlog
directory.
The problem is not the workload as the application is a web crawler. So the
workload can be infinite. What I would expect Postgres to do is to regulate
te workload somehow insetad of just crashing twice a day with a "partition
full" followed by automatic recovery.
If the workload is too high to handle, the query response time should go down.
This would be perfectly acceptable for my application and this is in fact the
setting I'm trying to find. What I have now is a very (too) good response
time for 10 hours followed by a crash.

On Monday 07 October 2013 18:07:57 Jeff Janes wrote:
> On Mon, Oct 7, 2013 at 6:23 AM, Marcin Mańk <marcin.mank@gmail.com> wrote:
> > On Thu, Oct 3, 2013 at 11:56 PM, Michal TOMA <mt@sicoop.com> wrote:
> >> This is what I can see in the log:
> >> 2013-10-03 13:58:56 CEST   LOG:  checkpoint starting: xlog
> >> 2013-10-03 13:59:56 CEST   LOG:  checkpoint complete: wrote 448 buffers
> >> (0.2%); 0 transaction log file(s) added, 9 removed, 18 recycled;
> >> write=39.144 s, s, sync=12102.311 s, total=12234.608 s; sync files=667,
> >> longest=181.374 s, average=18.144 s
> >>
> >> 2013-10-03 22:30:25 CEST   LOG:  checkpoint starting: xlog time
> >
> > From your logs, it seems that the writes are spread all over the (fairly
> > large) database. Is that correct? What is the database size? What is the
> > size of the working data set (i.e. the set of rows that are in use)?
> >
> > I heard of people having good results with setting a low value for
> > shared_buffers (like 128MB) in a high write activity scenarios. Setting
> > it that low would mean that checkpoints would have 16 times less to do.
>
> It looks like most of the actual writing is being done by either the
> background writer or the backends themselves, not the checkpoint.  And the
> checkpointer still has to sync all the files, so lowering it further is
> unlikely to help.
>
> I don't think he ever gave us the specs of the RAID is using.   My guess is
> that it is way too small for the workload.
>
> Cheers,
>
> Jeff

Re: pg_xlog size growing untill it fills the partition

From

Jeff Janes

Date:

07 October 2013, 19:33:21

On Mon, Oct 7, 2013 at 11:44 AM, Michal TOMA <mt@sicoop.com> wrote:

I gave it in my first post. It is a software raid 1 of average 7200 rpm disks
(Hitachi HDS723020BLE640) for the main tablespace and a software raid 1 of
SSDs for onother tablespace and alos the partition holding the pg_xlog
directory.

So that is exactly 2 drives on the HDD side? Yeah, that isn't going to go very far.

The problem is not the workload as the application is a web crawler. So the
workload can be infinite. What I would expect Postgres to do is to regulate
te workload somehow insetad of just crashing twice a day with a "partition
full" followed by automatic recovery.

There has been some discussion about mechanisms to throttle throughput based on the log file partition filling up, but it was more in the context of archiving going down rather than checkpointing being way too slow. No real conclusion was reached though.

And I'm not very hopeful about it, especially not as something that would be on by default. I'd be pretty ticked if the system started automatically throttling a bulk load because it extrapolated and decided that some problem might occur at some point in the future--even though I know that the bulk load will be finished before that point is reached.

It seems like the best place to implement the throttling would be in your application, as that is where the sleeping can be done with the least amount of locks/resources being held. Maybe you could check `fgrep Dirty /proc/meminfo` and throttle based on that value.

Also, the nasty slug of dirty pages is accumulating in the OS, not in PostgreSQL itself, so you could turn down dirty_ratio and friends in the kernel to limit the problem.

Cheers,

Jeff

Re: pg_xlog size growing untill it fills the partition

From

Michal TOMA

Date:

08 October 2013, 11:36:59

On Monday 07 October 2013 21:33:14 Jeff Janes wrote:
> On Mon, Oct 7, 2013 at 11:44 AM, Michal TOMA <mt@sicoop.com> wrote:
> > I gave it in my first post. It is a software raid 1 of average 7200 rpm
> > disks
> > (Hitachi HDS723020BLE640) for the main tablespace and a software raid 1
> > of SSDs for onother tablespace and alos the partition holding the pg_xlog
> > directory.
>
> So that is exactly 2 drives on the HDD side?  Yeah, that isn't going to go
> very far.
>
> > The problem is not the workload as the application is a web crawler. So
> > the workload can be infinite. What I would expect Postgres to do is to
> > regulate te workload somehow insetad of just crashing twice a day with a
> > "partition full" followed by automatic recovery.
>
> There has been some discussion about mechanisms to throttle throughput
> based on the log file partition filling up, but it was more in the context
> of archiving going down rather than checkpointing being way too slow.  No
> real conclusion was reached though.
>
> And I'm not very hopeful about it, especially not as something that would
> be on by default.  I'd be pretty ticked if the system started automatically
> throttling a bulk load because it extrapolated and decided that some
> problem might occur at some point in the future--even though I know that
> the bulk load will be finished before that point is reached.
I don't really see it as an extrapolation about some problem that may occur in
the future. My problem is sure very specific but it is not a decision about a
problem that may occur sometimes in the future in my case it is a decicion to
throttle the workload now or crash with a partition full immediately.
I would really appreciate to have some kind of mechanism that would enable me
to tell Postgres you have 20GB xlog max otherwise you'll fill the partition
and crash.

>
> It seems like the best place to implement the throttling would be in your
> application, as that is where the sleeping can be done with the least
> amount of locks/resources being held.  Maybe you could check `fgrep Dirty
> /proc/meminfo` and throttle based on that value.
>
> Also, the nasty slug of dirty pages is accumulating in the OS, not in
> PostgreSQL itself, so you could turn down dirty_ratio and friends in the
> kernel to limit the problem.

I can of course implement this on the application size quite easily in
different ways but the problem seems to me how do I guess what the real
database limits are.
In my case the server was running fine for a year with my initial settings.
Now it is crashing twice a day very likely because of one particular table or
index becoming too big to fit in the memory and suddenly requiring much more
disk activity.
This seems to me like a pretty hard thing to guess on the application side.

Quite surprisingly as advised in one of the replies to my problem I have set
the shared_buffers to a very low value (128MB instead of my initial 16GB) and
this seems to allow me to control the growth of the xlog.

I don't really understand why. Any explanation on this phenomenon would be
greatly appreciated.

Michal

>
> Cheers,
>
> Jeff

Re: pg_xlog size growing untill it fills the partition

From

Michal TOMA

Date:

08 October 2013, 11:39:09

On Monday 07 October 2013 15:23:33 Marcin Mańk wrote:
> On Thu, Oct 3, 2013 at 11:56 PM, Michal TOMA <mt@sicoop.com> wrote:
> > Now I have:
> >         checkpoint_completion_target = 0.9
> >         wal_buffers = 8MB
> >         checkpoint_segments = 16
> >         checkpoint_timeout = 20min
> >         shared_buffers = 2GB
> >         log_checkpoints = on
> >
> > This is what I can see in the log:
> > 2013-10-03 13:58:56 CEST   LOG:  checkpoint starting: xlog
> > 2013-10-03 13:59:56 CEST   LOG:  checkpoint complete: wrote 448 buffers
> > (0.2%); 0 transaction log file(s) added, 9 removed, 18 recycled;
> > write=39.144 s, s, sync=12102.311 s, total=12234.608 s; sync files=667,
> > longest=181.374 s, average=18.144 s
> > 2013-10-03 22:30:25 CEST   LOG:  checkpoint starting: xlog time
>
> From your logs, it seems that the writes are spread all over the (fairly
> large) database. Is that correct? What is the database size? What is the
> size of the working data set (i.e. the set of rows that are in use)?
>
> I heard of people having good results with setting a low value for
> shared_buffers (like 128MB) in a high write activity scenarios. Setting it
> that low would mean that checkpoints would have 16 times less to do.

Quite surprisingly this seems to work also in my case. Tanks for the advice.

Michal