Re: Optimizing Database High CPU - Mailing list pgsql-general

From Scottix
Subject Re: Optimizing Database High CPU
Date
Msg-id CANKFHZ-KVt08zRkbZnLJHprMnLLgVdLejHybECvLWCDormO1Xg@mail.gmail.com
Whole thread Raw
In response to Re: Optimizing Database High CPU  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-general
Hey,
So I finally found the culprit. Turns out to be the THP fighting with itself.

After running on Ubuntu
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

It instantly went from a loadavg of 30 to 3

Also make sure you re-enable on reboot.

Anyway just wanted to give a followup on the issue incase anyone else is having the same problem.

On Mon, Mar 4, 2019 at 12:03 PM Jeff Janes <jeff.janes@gmail.com> wrote:
On Wed, Feb 27, 2019 at 5:01 PM Michael Lewis <mlewis@entrata.com> wrote:
If those 50-100 connections are all active at once, yes, that is high.  They can easily spend more time fighting each other over LWLocks, spinlocks, or cachelines rather than doing useful work.  This can be exacerbated when you have multiple sockets rather than all cores in a single socket.  And these problems are likely to present as high Sys times.

Perhaps you can put up a connection pooler which will allow 100 connections to all think they are connected at once, but forces only 12 or so to actually be active at one time, making the others transparently queue.

Can you expound on this or refer me to someplace to read up on this?

Just based on my own experimentation.  This is not a blanket recommendation,  but specific to the situation that we already suspect there is contention, and the server is too old to have pg_stat_actvity.wait_event column.
   
Context, I don't want to thread jack though: I think I am seeing similar behavior in our environment at times with queries that normally take seconds taking 5+ minutes at times of high load. I see many queries showing buffer_mapping as the LwLock type in snapshots but don't know if that may be expected.

It sounds like your processes are fighting to reserve buffers in shared_buffers in which to read data pages.  But those data pages are probably already in the OS page cache, otherwise reading it from disk would be slow enough that you would be seeing some type of IO wait, or buffer_io, rather than buffer_mapping as the dominant wait type.  So I think that means you have most of your data in RAM, but not enough of it in shared_buffers.  You might be in a rare situation where setting shared_buffers to a high fraction of RAM, rather than the usual low fraction, is called for.  Increasing NUM_BUFFER_PARTITIONS might also be useful, but that requires a recompilation of the server.  But do these spikes correlate with anything known at the application level?  A change in the mix of queries, or a long report or maintenance operation?  Maybe the query plans briefly toggle over to using seq scans rather than index scans or vice versa, which drastically changes the block access patterns?
 
In our environment PgBouncer will accept several hundred connections and allow up to 100 at a time to be active on the database which are VMs with ~16 CPUs allocated (some more, some less, multi-tenant and manually sharded). It sounds like you are advocating for connection max very close to the number of cores. I'd like to better understand the pros/cons of that decision.

There are good reasons to allow more than that.  For example, your application holds some transactions open briefly while it does some cogitation on the application-side, rather than immediately committing and so returning the connection to the connection pool.  Or your server has a very high IO capacity and benefits from lots of read requests in the queue at the same time, so it can keep every spindle busy and every rotation productive.  But, if you have no reason to believe that any of those situations apply to you, but do have evidence that you have lock contention between processes, then I think that limiting the number active processes to the number of cores is a good starting point.
 
Cheers,

Jeff


--
T: @Thaumion
IG: Thaumion
Scottix@Gmail.com

pgsql-general by date:

Previous
From: Erik Jones
Date:
Subject: Re: Hot Standby Conflict on pg_attribute
Next
From: Susan Hurst
Date:
Subject: TAbles/Columns missing in information schema