Home > mailing lists

Re: Postgres 10, slave not catching up with master - Mailing list pgsql-general

From	Andy Colson
Subject	Re: Postgres 10, slave not catching up with master
Date	October 21, 2018 16:52:50
Msg-id	fdf46989-bb76-b84f-606e-0884eb345da1@squeakycode.net Whole thread Raw
In response to	Postgres 10, slave not catching up with master (Boris Sagadin <boris@infosplet.com>)
List	pgsql-general

Tree view

On 10/21/18 2:06 AM, Boris Sagadin wrote:
> Hello,
> 
> I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x 1.9TB NVMe drive) AWS instance with about 5TB of
diskspace occupied, ext4, Ubuntu 16.04.
 
> 
> Multi-tenant DB with about 40000 tables, insert heavy.
> 
> I started a new slave with identical HW specs, SR. DB started syncing from master, which took about 4 hours, then it
startedapplying the WALs. However, it seems it can't catch up. Delay is still around 3 hours (measured with now() -
pg_last_xact_replay_timestamp()),even a day later. It goes a few 100s up and down, but it seems to float around 3h
mark.
> 
> Disk IO is low at about 10%, measured with iostat, no connected clients, recovery process is at around 90% CPU single
coreusage.
 
> 
> Tried tuning the various parameters, but with no avail. Only thing I found suspicious is stracing the recovery
processconstantly produces many errors such as:
 
> 
> lseek(428, 0, SEEK_END)                 = 780124160
> lseek(30, 0, SEEK_END)                  = 212992
> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
> lseek(680, 0, SEEK_END)                 = 493117440
> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
> lseek(774, 0, SEEK_END)                 = 583368704
> 
> ...[snip]...
> 
> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
> lseek(774, 0, SEEK_END)                 = 583368704
> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
> lseek(277, 0, SEEK_END)                 = 502882304
> lseek(6, 516096, SEEK_SET)              = 516096
> read(6, "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"..., 8192) = 8192
> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
> lseek(735, 0, SEEK_END)                 = 272809984
> read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
> lseek(277, 0, SEEK_END)                 = 502882304
> 
> ls -l fd/9
> lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]
> 
> 
> Perf top on recovery produces:
> 
>   27.76%  postgres            [.] pglz_decompress
>     9.90%  [kernel]            [k] entry_SYSCALL_64_after_swapgs
>     7.09%  postgres            [.] hash_search_with_hash_value
>     4.26% libpthread-2.23.so <http://libpthread-2.23.so>  [.] llseek
>     3.64% libpthread-2.23.so <http://libpthread-2.23.so>  [.] __read_nocancel
>     2.80%  [kernel]            [k] __fget_light
>     2.67%  postgres            [.] 0x000000000034d3ba
>     1.85%  [kernel]            [k] ext4_llseek
>     1.84%  postgres            [.] pg_comp_crc32c_sse42
>     1.44%  postgres            [.] hash_any
>     1.35%  postgres            [.] 0x000000000036afad
>     1.29%  postgres            [.] MarkBufferDirty
>     1.21%  postgres            [.] XLogReadRecord
> [...]
> 
> Tried changing the process limits with prlimit to unlimited, but no change.
> 
> I can turn off the WAL compression but I doubt this is the main culprit. Any ideas appreciated.
> 
> Regards,
> Boris
> 

I dont have an answer, but at a guess, decompress is taking the most amount of time, and the process spewing "Resource
temporarilyunavailable" might be just sitting around waiting for the decompress to finish.
 

I'd say yea, definitely try disabling WAL compression.  I'm really curious what that would do.

I assume you are using streaming replication?

-Andy

pgsql-general by date:

From: Johannes Graën
Date: 21 October 2018, 12:27:20
Subject: found xmin x from before relfrozenxid y

From: Tom Lane
Date: 21 October 2018, 17:24:16
Subject: Re: found xmin x from before relfrozenxid y

Re: Postgres 10, slave not catching up with master - Mailing list pgsql-general

Previous

Next