Home > mailing lists

Re: Postgres 10, slave not catching up with master - Mailing list pgsql-general

From	Hellmuth Vargas
Subject	Re: Postgres 10, slave not catching up with master
Date	October 23, 2018 13:10:10
Msg-id	CAN3Qy4oWBijSyKyfaOpju=prGnCnsHNr5ddMCnMX4Kh4CHU5nA@mail.gmail.com Whole thread
In response to	Re: Postgres 10, slave not catching up with master (Boris Sagadin <boris@infosplet.com>)
Responses	Re: Postgres 10, slave not catching up with master
List	pgsql-general

Tree view

can share recovery.conf file settings??

El mar., 23 de oct. de 2018 a la(s) 00:28, Boris Sagadin (boris@infosplet.com) escribió:

Yes, turning wal_compression off improves things. Slave that was mentioned unfortunately lagged too much before this setting was applied and was turned off. However the remaining slave lags less now, although still occasionally up to a few minutes. I think single threadedness of recovery is a big slowdown for write heavy databases. Maybe an option to increase wal_size beyond 16MB in v11 will help.

In the meantime we'll solve this by splitting the DB to 2 or 3 clusters or maybe trying out some sharding solution like Citus.

Boris

On Sun, Oct 21, 2018 at 9:06 AM, Boris Sagadin <boris@infosplet.com> wrote:
Hello,

I have a database running on i3.8xlarge (256GB RAM, 32 CPU cores, 4x 1.9TB NVMe drive) AWS instance with about 5TB of disk space occupied, ext4, Ubuntu 16.04.

Multi-tenant DB with about 40000 tables, insert heavy.

I started a new slave with identical HW specs, SR. DB started syncing from master, which took about 4 hours, then it started applying the WALs. However, it seems it can't catch up. Delay is still around 3 hours (measured with now() - pg_last_xact_replay_timestamp()), even a day later. It goes a few 100s up and down, but it seems to float around 3h mark.

Disk IO is low at about 10%, measured with iostat, no connected clients, recovery process is at around 90% CPU single core usage.

Tried tuning the various parameters, but with no avail. Only thing I found suspicious is stracing the recovery process constantly produces many errors such as:

lseek(428, 0, SEEK_END)                 = 780124160
lseek(30, 0, SEEK_END)                  = 212992
read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(680, 0, SEEK_END)                 = 493117440
read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(774, 0, SEEK_END)                 = 583368704

...[snip]...

read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(774, 0, SEEK_END)                 = 583368704
read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(277, 0, SEEK_END)                 = 502882304
lseek(6, 516096, SEEK_SET)              = 516096
read(6, "\227\320\5\0\1\0\0\0\0\340\7\246\26\274\0\0\315\0\0\0\0\0\0\0}\0178\5&/\260\r"..., 8192) = 8192
read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(735, 0, SEEK_END)                 = 272809984
read(9, 0x7ffe4001f557, 1)              = -1 EAGAIN (Resource temporarily unavailable)
lseek(277, 0, SEEK_END)                 = 502882304

ls -l fd/9
lr-x------ 1 postgres postgres 64 Oct 21 06:21 fd/9 -> pipe:[46358]

Perf top on recovery produces:

27.76% postgres            [.] pglz_decompress
   9.90% [kernel]            [k] entry_SYSCALL_64_after_swapgs
   7.09% postgres            [.] hash_search_with_hash_value
   4.26% libpthread-2.23.so [.] llseek
   3.64% libpthread-2.23.so [.] __read_nocancel
   2.80% [kernel]            [k] __fget_light
   2.67% postgres            [.] 0x000000000034d3ba
   1.85% [kernel]            [k] ext4_llseek
   1.84% postgres            [.] pg_comp_crc32c_sse42
   1.44% postgres            [.] hash_any
   1.35% postgres            [.] 0x000000000036afad
   1.29% postgres            [.] MarkBufferDirty
   1.21% postgres            [.] XLogReadRecord
[...]

Tried changing the process limits with prlimit to unlimited, but no change.

I can turn off the WAL compression but I doubt this is the main culprit. Any ideas appreciated.

Regards,
Boris

Cordialmente,

Ing. Hellmuth I. Vargas S.
Esp. Telemática y Negocios por Internet

Oracle Database 10g Administrator Certified Associate

EnterpriseDB Certified PostgreSQL 9.3 Associate

pgsql-general by date:

From: Jeff Janes
Date: 23 October 2018, 13:06:06
Subject: Re: Replication question

From: "Daniel Fink (PDF)"
Date: 23 October 2018, 13:24:28
Subject: Enabling bdr in multiple databases on the same postgresql instance/cluster

Re: Postgres 10, slave not catching up with master - Mailing list pgsql-general

Previous

Next