Home > mailing lists

[GENERAL] Logical decoding CPU-bound w/ large number of tables - Mailing list pgsql-general

From	Mathieu Fenniak
Subject	[GENERAL] Logical decoding CPU-bound w/ large number of tables
Date	May 5, 2017 23:24:07
Msg-id	CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com Whole thread Raw
Responses	Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables
List	pgsql-general

Tree view

Heyo,

I'm attempting to use logical decoding with the streaming replication protocol to perform change-data-capture on PostgreSQL 9.5.4. I'm seeing the replication stream "stall" for long periods of time where the walsender process will be pinned at 100% CPU utilization, but no data is being sent to my client.

The stalls occur unpredictably on my production system, but generally seem to be correlated with schema operations. My source database has about 100,000 tables; it's a one-schema-per-tenant multi-tenant SaaS system.

I've reproduced the same symptoms with two different approaches on my local machine. With both, I have a replication client connected via the streaming protocol.

In reproduction approach 1, I've created a thread that inserts small sets of data, and a thread that creates a schema w/ 500 tables and then drops it. This approach has pinned CPU usage, but data does come out of it -- just excruciatingly slow when compared to the same test without the schema create & drop.

In reproduction approach 2, I've created a database w/ 100,000 tables on it and performed a "vacuum ful". The walsender goes to 100% CPU and no data comes out of the replication stream for hours.

I've performed a CPU sampling with the OSX `sample` tool based upon reproduction approach #1: https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8 It appears that most of the time is spent in the RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache invalidation callbacks, both of which appear to be invalidating caches based upon the cache value.

Has anyone else run into this kind of performance problem? Any thoughts on how it might be resolved? I don't mind putting in the work if someone could describe what is happening here, and have a discussion with me about what kind of changes might be necessary to improve the performance.

Thanks all,

Mathieu Fenniak | Senior Software Architect | Phone 1-587-315-1185

We are hiring! | search jobs

pgsql-general by date:

From: Bill Moran
Date: 05 May 2017, 23:21:24
Subject: Re: [GENERAL] Can PG replace redis, amqp, s3 in the future?

From: Justin Pryzby
Date: 06 May 2017, 03:22:53
Subject: [GENERAL] PG96 pg_restore connecting to PG95 causes ERROR: unrecognizedconfiguration parameter "idle_in_transaction_session_timeout"

[GENERAL] Logical decoding CPU-bound w/ large number of tables - Mailing list pgsql-general

Previous

Next