Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size - Mailing list pgsql-bugs

From Valentine Gogichashvili
Subject Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size
Date
Msg-id CAP93muX=fbhOfjdYDvG8iJ0sPj4ptoHOV_mAqKE+9qvZmbUEiA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size  (Valentine Gogichashvili <valgog@gmail.com>)
Responses Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-bugs
Hello again,

now I have restarted the slave with shared_buffers set to 32GB, and now the
recovery process is at 100% CPU:

$ strace -c -f -p 27076
Process 27076 attached - interrupt to quit
 Process 27076 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 67.65    0.002127           0     18127           lseek
 28.98    0.000911           0     16496           getppid
  2.58    0.000081           0       758           read
  0.80    0.000025           4         6         1 open
  0.00    0.000000           0         4           close
  0.00    0.000000           0         1           stat
  0.00    0.000000           0       493       493 unlink
------ ----------- ----------- --------- --------- ----------------
100.00    0.003144                 35885       494 total


$ strace -c -f -p 27076
Process 27076 attached - interrupt to quit
Process 27076 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 44.44    0.000240           0      5976           lseek
 38.15    0.000206           0      5522           getppid
 17.41    0.000094           1       172           read
  0.00    0.000000           0         4         1 open
  0.00    0.000000           0         2           close
  0.00    0.000000           0         1           stat
  0.00    0.000000           0       159       159 unlink
------ ----------- ----------- --------- --------- ----------------
100.00    0.000540                 11836       160 total

Then restarted again with shared_buffers set back to 2GB:

now the replay process is relatively fast again and here is the strace:

$ strace -c -f -p 36450
Process 36450 attached - interrupt to quit
Process 39336 attached
Process 39337 attached
Process 39337 detached
Process 39338 attached
Process 39338 detached
Process 39339 attached
Process 39339 detached
Process 39336 detached
Process 39359 attached
Process 39360 attached
Process 39360 detached
Process 39361 attached
Process 39361 detached
Process 39362 attached
Process 39362 detached
Process 39359 detached
Process 40168 attached
Process 40169 attached
Process 40169 detached
Process 40170 attached
Process 40170 detached
Process 40171 attached
Process 40171 detached
Process 40168 detached
Process 36450 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 88.21    0.668159       31817        21         9 wait4
  4.69    0.035553           6      5970           read
  1.92    0.014572           0    120565           lseek
  1.91    0.014474           5      2957      2954 unlink
  1.55    0.011766           8      1545           write
  1.27    0.009582           0    109274           getppid
  0.31    0.002340           3       762           fcntl
  0.03    0.000245          20        12           clone
  0.02    0.000163           1       219           rt_sigprocmask
  0.02    0.000162           1       154        10 open
  0.01    0.000092           0       216           mmap
  0.01    0.000084           1       133        42 stat
  0.01    0.000066           6        12           getrlimit
  0.01    0.000046           0       178           close
  0.00    0.000037           1        27           munmap
  0.00    0.000035           6         6           pipe
  0.00    0.000035          12         3           set_tid_address
  0.00    0.000030           1        42           getegid
  0.00    0.000021           0       183           rt_sigaction
  0.00    0.000000           0       108           fstat
  0.00    0.000000           0       102           mprotect
  0.00    0.000000           0        36           brk
  0.00    0.000000           0         9           rt_sigreturn
  0.00    0.000000           0         9         9 ioctl
  0.00    0.000000           0        48        12 access
  0.00    0.000000           0         6           dup2
  0.00    0.000000           0         6           getpid
  0.00    0.000000           0        12           execve
  0.00    0.000000           0         9           uname
  0.00    0.000000           0        42           getuid
  0.00    0.000000           0        42           getgid
  0.00    0.000000           0        45           geteuid
  0.00    0.000000           0         6           getpgrp
  0.00    0.000000           0         6           statfs
  0.00    0.000000           0        12           arch_prctl
  0.00    0.000000           0         6         3 futex
  0.00    0.000000           0         3           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.757462                242786      3039 total

With best regards,

-- Valentine Gogichashvili


On Thu, Aug 16, 2012 at 4:53 PM, Valentine Gogichashvili
<valgog@gmail.com>wrote:

> Hello Andreas,
>
> here is the process, that now actually is not using CPU at all and the
> shared_buffers are set to 2GB:
>
> 50978 postgres  20   0 2288m 2.0g 2.0g S  0.0  1.6   4225:34 postgres:
> startup process   recovering 000000050000262E000000FD
>
> It is hanging on that file for several minutes now.
>
> and here is the strace:
>
> $ strace -c -f -p 50978
> Process 50978 attached - interrupt to quit
>  Process 50978 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  94.82    0.007999          37       215           select
>   2.73    0.000230           1       215           getppid
>   2.45    0.000207           1       215       215 stat
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.008436                   645       215 total
>
> What kind of additional profiling information would you like to see?
>
> Regards,
>
> -- Valentin
>
>
>
>

pgsql-bugs by date:

Previous
From: Valentine Gogichashvili
Date:
Subject: Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size
Next
From: Valentine Gogichashvili
Date:
Subject: Re: BUG #7494: WAL replay speed depends heavily on the shared_buffers size