Thread: Replaying 48 WAL files takes 80 minutes

Replaying 48 WAL files takes 80 minutes

From

"Albe Laurenz"

Date:

29 October 2012, 13:05:38

I am configuring streaming replication with hot standby
with PostgreSQL 9.1.3 on RHEL 6 (kernel 2.6.32-220.el6.x86_64).
PostgreSQL was compiled from source.

It works fine, except that starting the standby took for ever:
it took the system more than 80 minutes to replay 48 WAL files
and connect to the primary.

Can anybody think of an explanation why it takes that long?

This is decent hardware: 24 cores of AMD Opteron 6174, 128 GB RAM,
NetApp SAN attached with 8 GBit Fibrechannel (ext4 file system).
An identical system performed fine in performance tests.

Here is the log; I have edited it for readability:

2012-10-29 09:22:22.945  database system was interrupted; last known up
at 2012-10-26 01:11:59 CEST
2012-10-29 09:22:22.945  creating missing WAL directory
"pg_xlog/archive_status"
2012-10-29 09:22:22.947  entering standby mode
2012-10-29 09:22:23.434  restored log file "00000001000001D1000000C4"
from archive
2012-10-29 09:22:23.453  redo starts at 1D1/C4000020
2012-10-29 09:22:25.847  restored log file "00000001000001D1000000C5"
from archive
2012-10-29 09:22:27.457  restored log file "00000001000001D1000000C6"
from archive
2012-10-29 09:22:28.946  restored log file "00000001000001D1000000C7"
from archive
2012-10-29 09:22:30.421  restored log file "00000001000001D1000000C8"
from archive
2012-10-29 09:22:31.243  restored log file "00000001000001D1000000C9"
from archive
2012-10-29 09:22:32.194  restored log file "00000001000001D1000000CA"
from archive
2012-10-29 09:22:33.169  restored log file "00000001000001D1000000CB"
from archive
2012-10-29 09:22:33.565  restored log file "00000001000001D1000000CC"
from archive
2012-10-29 09:23:35.451  restored log file "00000001000001D1000000CD"
from archive

Everything is nice until here.
Replaying this WAL file suddenly takes 1.5 minutes instead
of mere seconds as before.

2012-10-29 09:24:54.761  restored log file "00000001000001D1000000CE"
from archive
2012-10-29 09:27:23.013  restartpoint starting: time
2012-10-29 09:28:12.200  restartpoint complete: wrote 242 buffers
(0.0%);
                         0 transaction log file(s) added, 0 removed, 0
recycled;
                         write=48.987 s, sync=0.185 s, total=49.184 s;
                         sync files=1096, longest=0.016 s, average=0.000
s
2012-10-29 09:28:12.206  recovery restart point at 1D1/CC618278
2012-10-29 09:28:31.226  restored log file "00000001000001D1000000CF"
from archive

Again there is a difference of 2.5 minutes
between these WAL files, only 50 seconds of
which were spent in the restartpoint.

From here on it continues in quite the same vein.
Some WAL files are restored in seconds, but some take
more than 4 minutes.

I'll skip to the end of the log:

2012-10-29 10:37:53.809  restored log file "00000001000001D1000000EF"
from archive
2012-10-29 10:38:53.194  restartpoint starting: time
2012-10-29 10:39:25.929  restartpoint complete: wrote 161 buffers
(0.0%);
                         0 transaction log file(s) added, 0 removed, 0
recycled;
                         write=32.661 s, sync=0.066 s, total=32.734 s;
                         sync files=251, longest=0.003 s, average=0.000
s
2012-10-29 10:39:25.929  recovery restart point at 1D1/ED95C728
2012-10-29 10:42:56.153  restored log file "00000001000001D1000000F0"
from archive
2012-10-29 10:43:53.062  restartpoint starting: time
2012-10-29 10:45:36.871  restored log file "00000001000001D1000000F1"
from archive
2012-10-29 10:45:39.832  restartpoint complete: wrote 594 buffers
(0.0%);
                         0 transaction log file(s) added, 0 removed, 0
recycled;
                         write=106.666 s, sync=0.093 s, total=106.769 s;
                         sync files=729, longest=0.004 s, average=0.000
s
2012-10-29 10:45:39.832  recovery restart point at 1D1/EF5D4340
2012-10-29 10:46:13.602  restored log file "00000001000001D1000000F2"
from archive
2012-10-29 10:47:38.396  restored log file "00000001000001D1000000F3"
from archive
2012-10-29 10:47:38.962  streaming replication successfully connected to
primary

I'd be happy if somebody could shed light on this.

Yours,
Laurenz Albe

PS: Here is the configuration:

             name             | current_setting
------------------------------+---------------------------
 version                      | PostgreSQL 9.1.3 on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20110731 (Red Hat
4.4.6-3), 64-bit
 archive_command              | gzip -1 <"%p" | tee
/POSTGRES/data/exchange/"%f".gz >/POSTGRES/data/backups/ELAK/"%f".gz
 archive_mode                 | on
 checkpoint_completion_target | 0.9
 checkpoint_segments          | 30
 client_encoding              | UTF8
 constraint_exclusion         | off
 cursor_tuple_fraction        | 1
 custom_variable_classes      | pg_stat_statements
 default_statistics_target    | 1000
 effective_cache_size         | 64GB
 hot_standby                  | on
 lc_collate                   | de_DE.UTF8
 lc_ctype                     | de_DE.UTF8
 listen_addresses             | *
 log_checkpoints              | on
 log_connections              | on
 log_destination              | csvlog
 log_directory                | /POSTGRES/data/logs/ELAK
 log_disconnections           | on
 log_filename                 | ELAK-%Y-%m-%d.log
 log_lock_waits               | on
 log_min_duration_statement   | 3s
 log_min_error_statement      | log
 log_min_messages             | log
 log_rotation_size            | 0
 log_statement                | ddl
 log_temp_files               | 0
 logging_collector            | on
 maintenance_work_mem         | 1GB
 max_connections              | 800
 max_prepared_transactions    | 800
 max_stack_depth              | 9MB
 max_standby_archive_delay    | 0
 max_standby_streaming_delay  | 0
 max_wal_senders              | 2
 pg_stat_statements.max       | 5000
 pg_stat_statements.track     | all
 port                         | 55503
 server_encoding              | UTF8
 shared_buffers               | 16GB
 shared_preload_libraries     | pg_stat_statements,passwordcheck
 ssl                          | on
 tcp_keepalives_count         | 0
 tcp_keepalives_idle          | 0
 TimeZone                     | Europe/Vienna
 wal_buffers                  | 16MB
 wal_level                    | hot_standby
 work_mem                     | 8MB
(49 rows)

Re: Replaying 48 WAL files takes 80 minutes

From

"ktm@rice.edu"

Date:

29 October 2012, 13:19:27

On Mon, Oct 29, 2012 at 02:05:24PM +0100, Albe Laurenz wrote:
> I am configuring streaming replication with hot standby
> with PostgreSQL 9.1.3 on RHEL 6 (kernel 2.6.32-220.el6.x86_64).
> PostgreSQL was compiled from source.
>
> It works fine, except that starting the standby took for ever:
> it took the system more than 80 minutes to replay 48 WAL files
> and connect to the primary.
>
> Can anybody think of an explanation why it takes that long?
>
> This is decent hardware: 24 cores of AMD Opteron 6174, 128 GB RAM,
> NetApp SAN attached with 8 GBit Fibrechannel (ext4 file system).
> An identical system performed fine in performance tests.
>
> Here is the log; I have edited it for readability:
>
> 2012-10-29 09:22:22.945  database system was interrupted; last known up
> at 2012-10-26 01:11:59 CEST
> 2012-10-29 09:22:22.945  creating missing WAL directory
> "pg_xlog/archive_status"
> 2012-10-29 09:22:22.947  entering standby mode
> 2012-10-29 09:22:23.434  restored log file "00000001000001D1000000C4"
> from archive
> 2012-10-29 09:22:23.453  redo starts at 1D1/C4000020
> 2012-10-29 09:22:25.847  restored log file "00000001000001D1000000C5"
> from archive
> 2012-10-29 09:22:27.457  restored log file "00000001000001D1000000C6"
> from archive
> 2012-10-29 09:22:28.946  restored log file "00000001000001D1000000C7"
> from archive
> 2012-10-29 09:22:30.421  restored log file "00000001000001D1000000C8"
> from archive
> 2012-10-29 09:22:31.243  restored log file "00000001000001D1000000C9"
> from archive
> 2012-10-29 09:22:32.194  restored log file "00000001000001D1000000CA"
> from archive
> 2012-10-29 09:22:33.169  restored log file "00000001000001D1000000CB"
> from archive
> 2012-10-29 09:22:33.565  restored log file "00000001000001D1000000CC"
> from archive
> 2012-10-29 09:23:35.451  restored log file "00000001000001D1000000CD"
> from archive
>
> Everything is nice until here.
> Replaying this WAL file suddenly takes 1.5 minutes instead
> of mere seconds as before.
>
> 2012-10-29 09:24:54.761  restored log file "00000001000001D1000000CE"
> from archive
> 2012-10-29 09:27:23.013  restartpoint starting: time
> 2012-10-29 09:28:12.200  restartpoint complete: wrote 242 buffers
> (0.0%);
>                          0 transaction log file(s) added, 0 removed, 0
> recycled;
>                          write=48.987 s, sync=0.185 s, total=49.184 s;
>                          sync files=1096, longest=0.016 s, average=0.000
> s
> 2012-10-29 09:28:12.206  recovery restart point at 1D1/CC618278
> 2012-10-29 09:28:31.226  restored log file "00000001000001D1000000CF"
> from archive
>
> Again there is a difference of 2.5 minutes
> between these WAL files, only 50 seconds of
> which were spent in the restartpoint.
>
> From here on it continues in quite the same vein.
> Some WAL files are restored in seconds, but some take
> more than 4 minutes.
>
> I'll skip to the end of the log:
>
> 2012-10-29 10:37:53.809  restored log file "00000001000001D1000000EF"
> from archive
> 2012-10-29 10:38:53.194  restartpoint starting: time
> 2012-10-29 10:39:25.929  restartpoint complete: wrote 161 buffers
> (0.0%);
>                          0 transaction log file(s) added, 0 removed, 0
> recycled;
>                          write=32.661 s, sync=0.066 s, total=32.734 s;
>                          sync files=251, longest=0.003 s, average=0.000
> s
> 2012-10-29 10:39:25.929  recovery restart point at 1D1/ED95C728
> 2012-10-29 10:42:56.153  restored log file "00000001000001D1000000F0"
> from archive
> 2012-10-29 10:43:53.062  restartpoint starting: time
> 2012-10-29 10:45:36.871  restored log file "00000001000001D1000000F1"
> from archive
> 2012-10-29 10:45:39.832  restartpoint complete: wrote 594 buffers
> (0.0%);
>                          0 transaction log file(s) added, 0 removed, 0
> recycled;
>                          write=106.666 s, sync=0.093 s, total=106.769 s;
>                          sync files=729, longest=0.004 s, average=0.000
> s
> 2012-10-29 10:45:39.832  recovery restart point at 1D1/EF5D4340
> 2012-10-29 10:46:13.602  restored log file "00000001000001D1000000F2"
> from archive
> 2012-10-29 10:47:38.396  restored log file "00000001000001D1000000F3"
> from archive
> 2012-10-29 10:47:38.962  streaming replication successfully connected to
> primary
>
> I'd be happy if somebody could shed light on this.
>
> Yours,
> Laurenz Albe
>
> PS: Here is the configuration:
>
>              name             | current_setting
> ------------------------------+---------------------------
>  version                      | PostgreSQL 9.1.3 on
> x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6 20110731 (Red Hat
> 4.4.6-3), 64-bit
>  archive_command              | gzip -1 <"%p" | tee
> /POSTGRES/data/exchange/"%f".gz >/POSTGRES/data/backups/ELAK/"%f".gz
>  archive_mode                 | on
>  checkpoint_completion_target | 0.9
>  checkpoint_segments          | 30
>  client_encoding              | UTF8
>  constraint_exclusion         | off
>  cursor_tuple_fraction        | 1
>  custom_variable_classes      | pg_stat_statements
>  default_statistics_target    | 1000
>  effective_cache_size         | 64GB
>  hot_standby                  | on
>  lc_collate                   | de_DE.UTF8
>  lc_ctype                     | de_DE.UTF8
>  listen_addresses             | *
>  log_checkpoints              | on
>  log_connections              | on
>  log_destination              | csvlog
>  log_directory                | /POSTGRES/data/logs/ELAK
>  log_disconnections           | on
>  log_filename                 | ELAK-%Y-%m-%d.log
>  log_lock_waits               | on
>  log_min_duration_statement   | 3s
>  log_min_error_statement      | log
>  log_min_messages             | log
>  log_rotation_size            | 0
>  log_statement                | ddl
>  log_temp_files               | 0
>  logging_collector            | on
>  maintenance_work_mem         | 1GB
>  max_connections              | 800
>  max_prepared_transactions    | 800
>  max_stack_depth              | 9MB
>  max_standby_archive_delay    | 0
>  max_standby_streaming_delay  | 0
>  max_wal_senders              | 2
>  pg_stat_statements.max       | 5000
>  pg_stat_statements.track     | all
>  port                         | 55503
>  server_encoding              | UTF8
>  shared_buffers               | 16GB
>  shared_preload_libraries     | pg_stat_statements,passwordcheck
>  ssl                          | on
>  tcp_keepalives_count         | 0
>  tcp_keepalives_idle          | 0
>  TimeZone                     | Europe/Vienna
>  wal_buffers                  | 16MB
>  wal_level                    | hot_standby
>  work_mem                     | 8MB
> (49 rows)
>
>
Hi Albe,

My first guess would be that there was something using I/O resources on your
NetApp. That is the behavior you would expect once the I/O cache on the NetApp
has been filled and you actually have to perform writes to the underlying
disks. Is this a dedicated box? Can you get I/O stats from the box during the
recovery?

Regards,
Ken
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Replaying 48 WAL files takes 80 minutes

From

Alvaro Herrera

Date:

29 October 2012, 14:29:15

Albe Laurenz wrote:
> I am configuring streaming replication with hot standby
> with PostgreSQL 9.1.3 on RHEL 6 (kernel 2.6.32-220.el6.x86_64).
> PostgreSQL was compiled from source.
>
> It works fine, except that starting the standby took for ever:
> it took the system more than 80 minutes to replay 48 WAL files
> and connect to the primary.
>
> Can anybody think of an explanation why it takes that long?

Can you do a quick xlogdump of those files?  Maybe there is something
unusual (say particular types of GIN/GiST index updates) on the files
that take longer.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Replaying 48 WAL files takes 80 minutes

From

"Albe Laurenz"

Date:

29 October 2012, 15:05:29

Alvaro Herrera wrote:
>> I am configuring streaming replication with hot standby
>> with PostgreSQL 9.1.3 on RHEL 6 (kernel 2.6.32-220.el6.x86_64).
>> PostgreSQL was compiled from source.
>>
>> It works fine, except that starting the standby took for ever:
>> it took the system more than 80 minutes to replay 48 WAL files
>> and connect to the primary.
>>
>> Can anybody think of an explanation why it takes that long?
>
> Can you do a quick xlogdump of those files?  Maybe there is something
> unusual (say particular types of GIN/GiST index updates) on the files
> that take longer.

There are no GIN and GiST indexes in this cluster.

Here's the output of "xlogdump -S" on one of the WAL files
that took over 4 minutes:

00000001000001D1000000EF:

Unable to read continuation page?
 ** maybe continues to next segment **
---------------------------------------------------------------
TimeLineId: 1, LogId: 465, LogSegment: 239

Resource manager stats:
  [0]XLOG      : 2 records, 112 bytes (avg 56.0 bytes)
                 checkpoint: 2, switch: 0, backup end: 0
  [1]Transaction: 427 records, 96512 bytes (avg 226.0 bytes)
                 commit: 427, abort: 0
  [2]Storage   : 0 record, 0 byte (avg 0.0 byte)
  [3]CLOG      : 0 record, 0 byte (avg 0.0 byte)
  [4]Database  : 0 record, 0 byte (avg 0.0 byte)
  [5]Tablespace: 0 record, 0 byte (avg 0.0 byte)
  [6]MultiXact : 0 record, 0 byte (avg 0.0 byte)
  [7]RelMap    : 0 record, 0 byte (avg 0.0 byte)
  [8]Standby   : 84 records, 1352 bytes (avg 16.1 bytes)
  [9]Heap2     : 325 records, 9340 bytes (avg 28.7 bytes)
  [10]Heap      : 7611 records, 4118483 bytes (avg 541.1 bytes)
                 ins: 2498, upd/hot_upd: 409/2178, del: 2494
  [11]Btree     : 3648 records, 120814 bytes (avg 33.1 bytes)
  [12]Hash      : 0 record, 0 byte (avg 0.0 byte)
  [13]Gin       : 0 record, 0 byte (avg 0.0 byte)
  [14]Gist      : 0 record, 0 byte (avg 0.0 byte)
  [15]Sequence  : 0 record, 0 byte (avg 0.0 byte)

Backup block stats: 2600 blocks, 11885880 bytes (avg 4571.5 bytes)

Yours,
Laurenz Albe

Re: Replaying 48 WAL files takes 80 minutes

From

Jeff Janes

Date:

29 October 2012, 16:42:28

On Mon, Oct 29, 2012 at 6:05 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
> I am configuring streaming replication with hot standby
> with PostgreSQL 9.1.3 on RHEL 6 (kernel 2.6.32-220.el6.x86_64).
> PostgreSQL was compiled from source.
>
> It works fine, except that starting the standby took for ever:
> it took the system more than 80 minutes to replay 48 WAL files
> and connect to the primary.
>
> Can anybody think of an explanation why it takes that long?

Could the slow log files be replaying into randomly scattered pages
which are not yet in RAM?

Do you have sar or vmstat reports?

Cheers,

Jeff

Re: Replaying 48 WAL files takes 80 minutes

From

"Albe Laurenz"

Date:

30 October 2012, 08:51:49

>> On Mon, Oct 29, 2012 at 6:05 AM, Albe Laurenz
<laurenz.albe@wien.gv.at> wrote:
>>> I am configuring streaming replication with hot standby
>>> with PostgreSQL 9.1.3 on RHEL 6 (kernel 2.6.32-220.el6.x86_64).
>>> PostgreSQL was compiled from source.
>>>
>>> It works fine, except that starting the standby took for ever:
>>> it took the system more than 80 minutes to replay 48 WAL files
>>> and connect to the primary.
>>>
>>> Can anybody think of an explanation why it takes that long?

Jeff Janes wrote:
>> Could the slow log files be replaying into randomly scattered pages
>> which are not yet in RAM?
>>
>> Do you have sar or vmstat reports?

The sar reports from the time in question tell me that I read
about 350 MB/s and wrote less than 0.2 MB/s.  The disks were
fairly busy (around 90%).

Jeff Trout wrote:
> If you do not have good random io performance log replay is nearly
unbearable.
>
> also, what io scheduler are you using? if it is cfq change that to
deadline or noop.
> that can make a huge difference.

We use the noop scheduler.
As I said, an identical system performed well in load tests.

The sar reports give credit to Jeff Janes' theory.
Why does WAL replay read much more than it writes?
I thought that pretty much every block read during WAL
replay would also get dirtied and hence written out.

I wonder why the performance is good in the first few seconds.
Why should exactly the pages that I need in the beginning
happen to be in cache?

And finally: are the numbers I observe (replay 48 files in 80
minutes) ok or is this terribly slow as it seems to me?

Yours,
Laurenz Albe

Re: Replaying 48 WAL files takes 80 minutes

From

Heikki Linnakangas

Date:

30 October 2012, 10:07:57

On 30.10.2012 10:50, Albe Laurenz wrote:
> Why does WAL replay read much more than it writes?
> I thought that pretty much every block read during WAL
> replay would also get dirtied and hence written out.

Not necessarily. If a block is modified and written out of the buffer
cache before next checkpoint, the latest version of the block is already
on disk. On replay, the redo routine reads the block, sees that the
change was applied, and does nothing.

> I wonder why the performance is good in the first few seconds.
> Why should exactly the pages that I need in the beginning
> happen to be in cache?

This is probably because of full_page_writes=on. When replay has a full
page image of a block, it doesn't need to read the old contents from
disk. It can just blindly write the image to disk. Writing a block to
disk also puts that block in the OS cache, so this also efficiently
warms the cache from the WAL. Hence in the beginning of replay, you just
write a lot of full page images to the OS cache, which is fast, and you
only start reading from disk after you've filled up the OS cache. If
this theory is true, you should see a pattern in the I/O stats, where in
the first seconds there is no I/O, but the CPU is 100% busy while it
reads from WAL and writes out the pages to the OS cache. After the OS
cache fills up with the dirty pages (up to dirty_ratio, on Linux), you
will start to see a lot of writes. As the replay progresses, you will
see more and more reads, as you start to get cache misses.

- Heikki

Re: Replaying 48 WAL files takes 80 minutes

From

"ktm@rice.edu"

Date:

30 October 2012, 13:05:41

On Tue, Oct 30, 2012 at 09:50:44AM +0100, Albe Laurenz wrote:
> >> On Mon, Oct 29, 2012 at 6:05 AM, Albe Laurenz
> <laurenz.albe@wien.gv.at> wrote:
> >>> I am configuring streaming replication with hot standby
> >>> with PostgreSQL 9.1.3 on RHEL 6 (kernel 2.6.32-220.el6.x86_64).
> >>> PostgreSQL was compiled from source.
> >>>
> >>> It works fine, except that starting the standby took for ever:
> >>> it took the system more than 80 minutes to replay 48 WAL files
> >>> and connect to the primary.
> >>>
> >>> Can anybody think of an explanation why it takes that long?
>
> Jeff Janes wrote:
> >> Could the slow log files be replaying into randomly scattered pages
> >> which are not yet in RAM?
> >>
> >> Do you have sar or vmstat reports?
>
> The sar reports from the time in question tell me that I read
> about 350 MB/s and wrote less than 0.2 MB/s.  The disks were
> fairly busy (around 90%).
>
> Jeff Trout wrote:
> > If you do not have good random io performance log replay is nearly
> unbearable.
> >
> > also, what io scheduler are you using? if it is cfq change that to
> deadline or noop.
> > that can make a huge difference.
>
> We use the noop scheduler.
> As I said, an identical system performed well in load tests.
>
> The sar reports give credit to Jeff Janes' theory.
> Why does WAL replay read much more than it writes?
> I thought that pretty much every block read during WAL
> replay would also get dirtied and hence written out.
>
> I wonder why the performance is good in the first few seconds.
> Why should exactly the pages that I need in the beginning
> happen to be in cache?
>
> And finally: are the numbers I observe (replay 48 files in 80
> minutes) ok or is this terribly slow as it seems to me?
>
> Yours,
> Laurenz Albe
>

Hi,

The load tests probably had the "important" data already cached. Processing
a WAL file would involve bringing all the data back into memory using a
random I/O pattern. Perhaps priming the file cache using some sequential
reads would allow the random I/O to hit memory instead of disk. I may be
misremembering, but wasn't there an associated project/program that would
parse the WAL files and generate cache priming reads?

Regards,
Ken

Re: Replaying 48 WAL files takes 80 minutes

From

"Albe Laurenz"

Date:

30 October 2012, 13:11:22

Heikki Linnakangas wrote:
>> Why does WAL replay read much more than it writes?
>> I thought that pretty much every block read during WAL
>> replay would also get dirtied and hence written out.
>
> Not necessarily. If a block is modified and written out of the buffer
> cache before next checkpoint, the latest version of the block is
already
> on disk. On replay, the redo routine reads the block, sees that the
> change was applied, and does nothing.

True.  Could that account for 1000 times more reads than writes?

>> I wonder why the performance is good in the first few seconds.
>> Why should exactly the pages that I need in the beginning
>> happen to be in cache?
>
> This is probably because of full_page_writes=on. When replay has a
full
> page image of a block, it doesn't need to read the old contents from
> disk. It can just blindly write the image to disk. Writing a block to
> disk also puts that block in the OS cache, so this also efficiently
> warms the cache from the WAL. Hence in the beginning of replay, you
just
> write a lot of full page images to the OS cache, which is fast, and
you
> only start reading from disk after you've filled up the OS cache. If
> this theory is true, you should see a pattern in the I/O stats, where
in
> the first seconds there is no I/O, but the CPU is 100% busy while it
> reads from WAL and writes out the pages to the OS cache. After the OS
> cache fills up with the dirty pages (up to dirty_ratio, on Linux), you
> will start to see a lot of writes. As the replay progresses, you will
> see more and more reads, as you start to get cache misses.

That makes sense to me.
Unfortunately I don't have statistics in the required resolution
to verify that.

Thanks for the explanations.

Yours,
Laurenz Albe

Re: Replaying 48 WAL files takes 80 minutes

From

"Albe Laurenz"

Date:

30 October 2012, 13:17:56

ktm@rice.edu wrote:
>>> If you do not have good random io performance log replay is nearly
>>> unbearable.
>>>
>>> also, what io scheduler are you using? if it is cfq change that to
>>> deadline or noop.
>>> that can make a huge difference.
>>
>> We use the noop scheduler.
>> As I said, an identical system performed well in load tests.

> The load tests probably had the "important" data already cached.
Processing
> a WAL file would involve bringing all the data back into memory using
a
> random I/O pattern.

The database is way too big (1 TB) to fit into cache.

What are "all the data" that have to be brought back?
Surely only the database blocks that are modified by the WAL,
right?

Yours,
Laurenz Albe

Re: Replaying 48 WAL files takes 80 minutes

From

"ktm@rice.edu"

Date:

30 October 2012, 13:41:34

On Tue, Oct 30, 2012 at 02:16:57PM +0100, Albe Laurenz wrote:
> ktm@rice.edu wrote:
> >>> If you do not have good random io performance log replay is nearly
> >>> unbearable.
> >>>
> >>> also, what io scheduler are you using? if it is cfq change that to
> >>> deadline or noop.
> >>> that can make a huge difference.
> >>
> >> We use the noop scheduler.
> >> As I said, an identical system performed well in load tests.
>
> > The load tests probably had the "important" data already cached.
> Processing
> > a WAL file would involve bringing all the data back into memory using
> a
> > random I/O pattern.
>
> The database is way too big (1 TB) to fit into cache.
>
> What are "all the data" that have to be brought back?
> Surely only the database blocks that are modified by the WAL,
> right?
>
> Yours,
> Laurenz Albe
>

Right, it would only read the blocks that are modified.

Regards,
Ken

Re: Replaying 48 WAL files takes 80 minutes

From

Jeff

Date:

01 November 2012, 19:19:32

On Oct 29, 2012, at 12:42 PM, Jeff Janes wrote:

> On Mon, Oct 29, 2012 at 6:05 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
>> I am configuring streaming replication with hot standby
>> with PostgreSQL 9.1.3 on RHEL 6 (kernel 2.6.32-220.el6.x86_64).
>> PostgreSQL was compiled from source.
>>
>> It works fine, except that starting the standby took for ever:
>> it took the system more than 80 minutes to replay 48 WAL files
>> and connect to the primary.
>>
>> Can anybody think of an explanation why it takes that long?
>
> Could the slow log files be replaying into randomly scattered pages
> which are not yet in RAM?
>
> Do you have sar or vmstat reports?
>
> Cheers,


If you do not have good random io performance log replay is nearly unbearable.

also, what io scheduler are you using? if it is cfq change that to deadline or noop.
that can make a huge difference.


--
Jeff Trout <jeff@jefftrout.com>

Re: Replaying 48 WAL files takes 80 minutes

From

Jeff Trout

Date:

01 November 2012, 19:19:39

On Oct 29, 2012, at 12:42 PM, Jeff Janes wrote:

> On Mon, Oct 29, 2012 at 6:05 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
>> I am configuring streaming replication with hot standby
>> with PostgreSQL 9.1.3 on RHEL 6 (kernel 2.6.32-220.el6.x86_64).
>> PostgreSQL was compiled from source.
>>
>> It works fine, except that starting the standby took for ever:
>> it took the system more than 80 minutes to replay 48 WAL files
>> and connect to the primary.
>>
>> Can anybody think of an explanation why it takes that long?
>
> Could the slow log files be replaying into randomly scattered pages
> which are not yet in RAM?
>
> Do you have sar or vmstat reports?
>


If you do not have good random io performance log replay is nearly unbearable. (I've run into this before many times)

Also, what io scheduler are you using? if it is cfq change that to deadline or noop.
that can make a huge difference.

--
Jeff Trout <jeff@jefftrout.com