We're looking for spikes in 'blk' which represents when lwlocks bump. If you're not seeing any then this is suggesting a buffer pin related issue -- this is also supported by the fact that raising shared buffers didn't help. If you're not seeing 'bk's, go ahead and disable the stats macro.
most blk comes with 0, some with 1, few hitting 100. I can't say that during stall times the number of blk 0 vs blk non-0 are very different.
So, what we need to know now is: *) What happens when you drastically *lower* shared buffers? Say, to 64mb? Note, you may experience higher load for unrelated reasons and have to scuttle the test. Also, if you have to crank higher to handle internal server structures, do that. This is a hail mary, but maybe something interesting spits out.
lowering shared_buffers didn't help.
*) How many specific query plans are needed to introduce the condition, Hopefully, it's not too many. If so, let's start gathering the plans. If you have a lot of plans to sift through, one thing we can attempt to eliminate noise is to tweak log_min_duration_statement so that during stall times (only) it logs offending queries that are unexpectedly blocking.
unfortunately, there are quite a few query plans... also, I don't think setting log_min_duration_statement will help us, cause when server is hitting high load average, it reacts slowly even on a key press. So even non-offending queries will be taking long to execute. I see all sorts of queries a being executed long during stall: spanning from simple
LOG: duration: 1131.041 ms statement: SELECT 'DBD::Pg ping test'
to complex ones, joining multiple tables.
We are still looking into all the logged queries in attempt to find the ones that are causing the problem, I'll report if we find any clues.
*) Approximately how big is your 'working set' -- the data your queries are routinely hitting?
I *think* it's within few hundreds MB range.
*) Is the distribution of the *types* of queries uniform? Or do you have special processes that occur on intervals?
it's pretty uniform.
Thanks for your patience.
oh no, thank you for trying to help me to resolve this issue.