[GENERAL] Logical decoding CPU-bound w/ large number of tables - Mailing list pgsql-general

From Mathieu Fenniak
Subject [GENERAL] Logical decoding CPU-bound w/ large number of tables
Date
Msg-id CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com
Whole thread Raw
Responses Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables  (Andres Freund <andres@anarazel.de>)
List pgsql-general
Heyo,

I'm attempting to use logical decoding with the streaming replication protocol to perform change-data-capture on PostgreSQL 9.5.4.  I'm seeing the replication stream "stall" for long periods of time where the walsender process will be pinned at 100% CPU utilization, but no data is being sent to my client.

The stalls occur unpredictably on my production system, but generally seem to be correlated with schema operations.  My source database has about 100,000 tables; it's a one-schema-per-tenant multi-tenant SaaS system.

I've reproduced the same symptoms with two different approaches on my local machine.  With both, I have a replication client connected via the streaming protocol.

In reproduction approach 1, I've created a thread that inserts small sets of data, and a thread that creates a schema w/ 500 tables and then drops it.  This approach has pinned CPU usage, but data does come out of it -- just excruciatingly slow when compared to the same test without the schema create & drop.

In reproduction approach 2, I've created a database w/ 100,000 tables on it and performed a "vacuum ful".  The walsender goes to 100% CPU and no data comes out of the replication stream for hours.

I've performed a CPU sampling with the OSX `sample` tool based upon reproduction approach #1: https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8  It appears that most of the time is spent in the RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache invalidation callbacks, both of which appear to be invalidating caches based upon the cache value.

Has anyone else run into this kind of performance problem?  Any thoughts on how it might be resolved?  I don't mind putting in the work if someone could describe what is happening here, and have a discussion with me about what kind of changes might be necessary to improve the performance.

Thanks all,


Mathieu Fenniak | Senior Software Architect | Phone 1-587-315-1185

Replicon | The leader in cloud time tracking applications - 7,800+ Customers - 70+ Countries - 1.5 Million Users
www.replicon.com | facebook | linkedin | twitter | blog | contact us

We are hiring! | search jobs



pgsql-general by date:

Previous
From: Bill Moran
Date:
Subject: Re: [GENERAL] Can PG replace redis, amqp, s3 in the future?
Next
From: Justin Pryzby
Date:
Subject: [GENERAL] PG96 pg_restore connecting to PG95 causes ERROR: unrecognizedconfiguration parameter "idle_in_transaction_session_timeout"