On Tue, Nov 24, 2015 at 9:57 PM, 657985552@qq.com <657985552@qq.com> wrote:
> oh .thanks i understand . but i still have a question .
> [root@pg1 pgdata]# uname -a
> Linux pg1 3.10.0-123.9.3.el7.x86_64 #1 SMP Thu Nov 6 15:06:03 UTC 2014
> x86_64 x86_64 x86_64 GNU/Linux
> [root@pg1 pgdata]# cat /etc/redhat-release
> CentOS Linux release 7.0.1406 (Core)
>
> my os is centos7 . is there THP problem in it ?
Yes. The settings posted above (*/transparent_hugepage/*) are the smoking gun.
I've had the exact same problem as you; suddenly the database slows
down to zero and just as suddenly goes back to normal. What is
happening here is that the operating system put in some
"optimizations" to help systems manage large amounts of memory
(typical server memory configurations have gone up by several orders
of magnitude since the 4kb page size was chosen). These optimizations
do not play well with postgres memory access patterns; the operating
system is forced to defragment memory at random intervals which slows
down memory accesss causing spinlock problems. Basically postgres and
the kernel get into a very high speed argument over memory access.
Lowering shared buffers to around 2GB also provides relief. This
suggests that clock sweep is a contributor to the problem, in
particular it's maintenance of usage_count (the maintenance of which
IIRC is changing soon to pure atomic update) would be a place to start
sniffing around if we wanted to Really Fix It. So far though, no one
has been able to reproduce this issue in a non production system.
I guess if we were using (non portable) futexes instead of hand
written spinlocks we'd probably have less problems in this area.
Nevertheless given the huge performance risks I really wonder what
RedHat was thinking when they enabled it by default.
merlin