BUG #15187: When use huge page,there may be a lot of hanged connections with status startup orauthentication - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #15187: When use huge page,there may be a lot of hanged connections with status startup orauthentication
Date
Msg-id 152545168641.9460.10767825715065371428@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #15187: When use huge page, there may be a lot of hanged connections with status startup or authentication
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      15187
Logged by:          Chen huajun
Email address:      chjischj@163.com
PostgreSQL version: 10.2
Operating system:   CentOS release 6.5
Description:

◆phenomenon
When use  huge page, there may be a lot of hanged connections with status
startup or authentication state when multiple concurrent query operations
are performed.

◆Env
hardware:Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz  16C/128G/300GB SSD
OS:CentOS release 6.5 x64
PostgreSQL:10.2

◆How to reproduct?
1) set huge_pages to on in postgresql.conf. 
2) run the following script

count=0
while [ $count -le 2000 ]; do
    psql -c 'select count(*) from test20180502;' &
    count=$((count + 1))
done

table define
----------------------------
CREATE TABLE test20180502  (
    c1 int not null,
    c2 character(10000) DEFAULT 'AAAAA',
    PRIMARY KEY(c1)
);


◆The investigation of the cause 
1、there will be a lot of improper connections with status startup or
authentication or SELECT state:
# ps -eaf | grep postgres |grep SELECT |wc
      2      24     204
# ps -eaf | grep postgres |grep authen |wc
    410    4920   45097
# ps -eaf | grep postgres |grep startup |wc
     78     936    8034
2、According to the stack information (*), it can be seen that the function
PGSemaphoreLock has been in a waiting state because of the incorrect SEMA
intelligence when performing the following operation. 
errStatus = sem_wait(PG_SEM_REF(sema));

3、By looking at the system table pg_stat_activity, I've found that the
wait_event_type and wait_event of some query operations are in the following
state. 
postgres=# select pid,wait_event_type,wait_event,state,backend_xid,query
from pg_stat_activity;
 pid  | wait_event_type |     wait_event      | state  | backend_xid |
                                 query
  

------+-----------------+---------------------+--------+-------------+--------------------------------------------------------------------------------------
 6333 | Activity        | AutoVacuumMain      |        |             | 
 6337 | Activity        | LogicalLauncherMain |        |             | 
 6408 | LWLock          | buffer_mapping      | active |             |
select count(*) from test20180502;
 6414 | LWLock          | buffer_mapping      | active |             |
select count(*) from test20180502;
 7509 |                 |                     | active |             |
select pid,wait_event_type,wait_event,state,backend_xid,query from
pg_stat_activity;
 6331 | Activity        | BgWriterMain        |        |             | 
 6330 | Activity        | CheckpointerMain    |        |             | 
 6332 | Activity        | WalWriterMain       |        |             | 
 
postgres=# select locktype,relation::regclass as
table,a.pid,mode,granted,substring(query,1,50) from pg_locks a,
pg_stat_activity b where a.pid = b.pid;
  locktype  |           table           | pid  |      mode       | granted |
                    substring                      

------------+---------------------------+------+-----------------+---------+----------------------------------------------------
 relation   | test20180502_pkey         | 6408 | AccessShareLock | t       |
select count(*) from test20180502;
 relation   | test20180502              | 6408 | AccessShareLock | t       |
select count(*) from test20180502;
 virtualxid |                           | 6408 | ExclusiveLock   | t       |
select count(*) from test20180502;
 relation   | test20180502_pkey         | 6414 | AccessShareLock | t       |
select count(*) from test20180502;
 relation   | test20180502              | 6414 | AccessShareLock | t       |
select count(*) from test20180502;
 virtualxid |                           | 6414 | ExclusiveLock   | t       |
select count(*) from test20180502;
 
*When there is a large number of improper connections in the startup or
authentication or SELECT state, the stack information is as follows: 
◇authentication state
#0  0x000000321dc0d930 in sem_wait () from /lib64/libpthread.so.0
#1  0x00000000006a5240 in PGSemaphoreLock ()
#2  0x00000000007152e5 in LWLockAcquire ()
#3  0x000000000070fc8f in LockRelease ()
#4  0x000000000070c323 in UnlockRelationId ()
#5  0x00000000004ad2d2 in relation_close ()
#6  0x000000000080ed21 in CatalogCacheInitializeCache ()
#7  0x0000000000810ee0 in SearchCatCache ()
#8  0x000000000081dc08 in GetSysCacheOid ()
#9  0x00000000007391a7 in get_role_oid ()
#10 0x000000000062d102 in hba_getauthmethod ()
#11 0x000000000062b729 in ClientAuthentication ()
#12 0x0000000000833fb3 in InitPostgres ()
#13 0x0000000000724b52 in PostgresMain ()
#14 0x00000000006b8c8a in PostmasterMain ()
#15 0x00000000006395e0 in main ()
◇startup state
#0  0x000000321dc0d930 in sem_wait () from /lib64/libpthread.so.0
#1  0x00000000006a5240 in PGSemaphoreLock ()
#2  0x00000000007152e5 in LWLockAcquire ()
#3  0x0000000000710508 in LockAcquireExtended ()
#4  0x000000000070cce8 in LockRelationOid ()
#5  0x00000000004ad405 in relation_open ()
#6  0x00000000004ad436 in heap_open ()
#7  0x000000000080ecce in CatalogCacheInitializeCache ()
#8  0x0000000000810ee0 in SearchCatCache ()
#9  0x0000000000845efd in superuser_arg ()
#10 0x0000000000833b17 in InitPostgres ()
#11 0x0000000000724b52 in PostgresMain ()
#12 0x00000000006b8c8a in PostmasterMain ()
#13 0x00000000006395e0 in main ()
◇SELECT state
#0  0x000000321dc0d930 in sem_wait () from /lib64/libpthread.so.0
#1  0x00000000006a5240 in PGSemaphoreLock ()
#2  0x00000000007152e5 in LWLockAcquire ()
#3  0x0000000000710508 in LockAcquireExtended ()
#4  0x000000000070cce8 in LockRelationOid ()
#5  0x00000000004ad405 in relation_open ()
#6  0x00000000004ad436 in heap_open ()
#7  0x000000000080ecce in CatalogCacheInitializeCache ()
#8  0x0000000000810ee0 in SearchCatCache ()
#9  0x000000000081d4e8 in get_tablespace ()
#10 0x000000000081d5c1 in get_tablespace_page_costs ()
#11 0x000000000066ba15 in cost_seqscan ()
#12 0x000000000069cc23 in create_seqscan_path ()
#13 0x0000000000665423 in set_rel_pathlist ()
#14 0x0000000000666716 in make_one_rel ()
#15 0x0000000000682e91 in query_planner ()
#16 0x0000000000687055 in grouping_planner ()
#17 0x00000000006887c9 in subquery_planner ()
#18 0x00000000006895bd in standard_planner ()
#19 0x00000000007238ca in pg_plan_query ()
#20 0x000000000072399e in pg_plan_queries ()
#21 0x000000000072406a in exec_simple_query ()
#22 0x0000000000724f09 in PostgresMain ()
#23 0x00000000006b8c8a in PostmasterMain ()
#24 0x00000000006395e0 in main ()

◆The additional occurrence condition
a)Depend on the number of table records. 
The probleam do not happen when table contains few records
 
tablename      record       result
test20180502    10000       not happen
                50000       not happen
                57500       happen
                65000       happen
                80000       happen
               100000       happen
              1000000       happen
test20180503    10000       not happen
                50000       not happen
                57500       not happen
                65000       not happen
                80000       not happen
               100000       happen
              1000000       happen

table define
----------------------------
CREATE TABLE test20180502  (
    c1 int not null,
    c2 character(10000) DEFAULT 'AAAAA',
    PRIMARY KEY(c1)
);

CREATE TABLE test20180503  (
    c1 int not null,
    PRIMARY KEY(c1)
);


b) Depend on OS
CentOS release 6.5 (Final) 2.6.32-431.el6.x86_64, happen
CentOS Linux release 7.3.1611 (Core) 3.10.0-514.el7.x86_64, not happen

c) Depend on huge page
huge_page=on, happen(no matter transparent_hugepage is [always] or
[never])
huge_page=off, not happen

d) Depend on PostgreSQL Version
PostgreSQL 10.2 happen
PostgreSQL 9.6 not happen


pgsql-bugs by date:

Previous
From: Alexey Ermakov
Date:
Subject: Re: BUG #15184: Planner overestimates number of rows in empty table
Next
From: Tom Lane
Date:
Subject: Re: BUG #15187: When use huge page, there may be a lot of hanged connections with status startup or authentication