Thread: Very slow planning performance on partition table

Very slow planning performance on partition table

From

Rural Hunter

Date:

23 July 2014, 13:21:10

Hi,

I have a table partitioned with about 60 children tables.  Now I found
the planning time of simple query with partition key are very slow.
# explain analyze select count(*) as cnt from article where pid=88 and
hash_code='2ca3ff8b17b163f0212c2ba01b80a064';
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=16.55..16.56 rows=1 width=0) (actual
time=0.259..0.259 rows=1 loops=1)
    ->  Append  (cost=0.00..16.55 rows=2 width=0) (actual
time=0.248..0.250 rows=1 loops=1)
          ->  Seq Scan on article  (cost=0.00..0.00 rows=1 width=0)
(actual time=0.002..0.002 rows=0 loops=1)
                Filter: ((pid = 88) AND (hash_code =
'2ca3ff8b17b163f0212c2ba01b80a064'::bpchar))
          ->  Index Scan using article_88_hash_idx on article_88
article  (cost=0.00..16.55 rows=1 width=0) (actual time=0.246..0.248
rows=1 loops=1)
                Index Cond: (hash_code =
'2ca3ff8b17b163f0212c2ba01b80a064'::bpchar)
                Filter: (pid = 88)
  Total runtime: 3.816 ms
(8 rows)

Time: 30999.986 ms

You can see the timing output that the actual run time of the 'explain
analyze' is 30 seconds while the select sql itself takes only 3 ms. My
partition key is on article.pid and the constraint is simple like this:
CONSTRAINT article_88_pid_check CHECK (pid = 88). What's wrong and how
can I improve the planning performance?

Re: Very slow planning performance on partition table

From

Douglas J Hunley

Date:

23 July 2014, 17:35:39

On Wed, Jul 23, 2014 at 6:21 AM, Rural Hunter <ruralhunter@gmail.com> wrote:

What's wrong and how can I improve the planning performance?

What is constraint exclusion set to?

Douglas J Hunley (doug.hunley@gmail.com)

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

24 July 2014, 01:30:23

<div class="moz-cite-prefix">It's the default value(partition): <br /> # grep exclusion postgresql.conf<br />
#constraint_exclusion= partition    # on, off, or partition<br /><br /> btw, I'm on postgresql 9.2.4<br /><br /> 在
2014/7/241:35, Douglas J Hunley 写道:<br /></div><blockquote
cite="mid:CALxYTP66Mi9-GZkLdP4RBf1goUn2WGmGp0ra_yE055zYB_PGpQ@mail.gmail.com"type="cite"><div dir="ltr"><div
class="gmail_extra"><br/><div class="gmail_quote">On Wed, Jul 23, 2014 at 6:21 AM, Rural Hunter <span dir="ltr"><<a
href="mailto:ruralhunter@gmail.com"moz-do-not-send="true" target="_blank">ruralhunter@gmail.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0             .8ex;border-left:1px #ccc
solid;padding-left:1ex">What'swrong and how can I improve the planning performance?</blockquote></div><br /> What is
constraintexclusion set to?<br /><br clear="all" /><div><br /></div> -- <br /><div dir="ltr">Douglas J Hunley (<a
href="mailto:doug.hunley@gmail.com"moz-do-not-send="true"
target="_blank">doug.hunley@gmail.com</a>)</div></div></div></blockquote><br/>

Re: Very slow planning performance on partition table

From

Tom Lane

Date:

25 July 2014, 01:53:33

Rural Hunter <ruralhunter@gmail.com> writes:
> I have a table partitioned with about 60 children tables.  Now I found
> the planning time of simple query with partition key are very slow.
> ...
> You can see the timing output that the actual run time of the 'explain
> analyze' is 30 seconds while the select sql itself takes only 3 ms. My
> partition key is on article.pid and the constraint is simple like this:
> CONSTRAINT article_88_pid_check CHECK (pid = 88). What's wrong and how
> can I improve the planning performance?

[ shrug... ] Insufficient data.  When I try a simple test case based on
what you've told us, I get planning times of a couple of milliseconds.
I can think of contributing factors that would increase that, but not by
four orders of magnitude.  So there's something very significant that
you've left out.  Can you construct a self-contained test case that's
this slow?

            regards, tom lane

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

25 July 2014, 03:26:15

2014/7/25 9:53, Tom Lane wrote:
> Rural
> [ shrug... ] Insufficient data.  When I try a simple test case based on
> what you've told us, I get planning times of a couple of milliseconds.
> I can think of contributing factors that would increase that, but not by
> four orders of magnitude.  So there's something very significant that
> you've left out.  Can you construct a self-contained test case that's
> this slow?
>
>             regards, tom lane
>
No I cann't. I exported the db schema(without data) to another server
and there is no problem. Is the planning time related to data volume?
Anything else can I check? I already checked the default statistics
target and it's the default value. I did change some statistics target
on one column of the table, but the column is not involved in the slow
planning query.

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

25 July 2014, 14:24:01

在 2014/7/25 9:53, Tom Lane 写道:
> [ shrug... ] Insufficient data. When I try a simple test case based on
> what you've told us, I get planning times of a couple of milliseconds.
> I can think of contributing factors that would increase that, but not
> by four orders of magnitude. So there's something very significant
> that you've left out. Can you construct a self-contained test case
> that's this slow? regards, tom lane

I run dbg on the backend process and got this:
(gdb) bt
#0  0x00007fc4a1b6cdb7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000005f8703 in PGSemaphoreLock ()
#2  0x0000000000636703 in LWLockAcquire ()
#3  0x0000000000632eb3 in LockAcquireExtended ()
#4  0x000000000062fdfb in LockRelationOid ()
#5  0x0000000000474e55 in relation_open ()
#6  0x000000000047b39b in index_open ()
#7  0x00000000005f3c22 in get_relation_info ()
#8  0x00000000005f6590 in build_simple_rel ()
#9  0x00000000005f65db in build_simple_rel ()
#10 0x00000000005de8c0 in add_base_rels_to_query ()
#11 0x00000000005df352 in query_planner ()
#12 0x00000000005e0d51 in grouping_planner ()
#13 0x00000000005e2bbe in subquery_planner ()
#14 0x00000000005e2ef9 in standard_planner ()
#15 0x00000000006426e1 in pg_plan_query ()
#16 0x000000000064279e in pg_plan_queries ()
#17 0x00000000006f4b7a in BuildCachedPlan ()
#18 0x00000000006f4e1e in GetCachedPlan ()
#19 0x0000000000642259 in exec_bind_message ()
#20 0x0000000000643561 in PostgresMain ()
#21 0x000000000060347f in ServerLoop ()
#22 0x0000000000604121 in PostmasterMain ()
#23 0x00000000005a5ade in main ()

Does that indicate something? seems it's waiting for some lock.

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

27 July 2014, 15:44:25

Anyone? I can see many pg processes are in BIND status with htop. Some
of them could be hanging like 30 mins. I tried gdb on the same process
many times and the trace shows same as my previous post. This happened
after I partitioned my main tables to 60 children tables. And also, I'm
experiecing a cpu peak around 30-60 mins every 1-2 days. During the
peak, all my cpus(32 cores) are full utilized while there is no special
load and the memory and io are fine. Sometimes I had to kill the db
process and restart the db to escape the situation. I tried to upgrade
to the latest 9.2.9 but it didn't help.

在 2014/7/25 22:23, Rural Hunter 写道:
> I run dbg on the backend process and got this:
> (gdb) bt
> #0  0x00007fc4a1b6cdb7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00000000005f8703 in PGSemaphoreLock ()
> #2  0x0000000000636703 in LWLockAcquire ()
> #3  0x0000000000632eb3 in LockAcquireExtended ()
> #4  0x000000000062fdfb in LockRelationOid ()
> #5  0x0000000000474e55 in relation_open ()
> #6  0x000000000047b39b in index_open ()
> #7  0x00000000005f3c22 in get_relation_info ()
> #8  0x00000000005f6590 in build_simple_rel ()
> #9  0x00000000005f65db in build_simple_rel ()
> #10 0x00000000005de8c0 in add_base_rels_to_query ()
> #11 0x00000000005df352 in query_planner ()
> #12 0x00000000005e0d51 in grouping_planner ()
> #13 0x00000000005e2bbe in subquery_planner ()
> #14 0x00000000005e2ef9 in standard_planner ()
> #15 0x00000000006426e1 in pg_plan_query ()
> #16 0x000000000064279e in pg_plan_queries ()
> #17 0x00000000006f4b7a in BuildCachedPlan ()
> #18 0x00000000006f4e1e in GetCachedPlan ()
> #19 0x0000000000642259 in exec_bind_message ()
> #20 0x0000000000643561 in PostgresMain ()
> #21 0x000000000060347f in ServerLoop ()
> #22 0x0000000000604121 in PostmasterMain ()
> #23 0x00000000005a5ade in main ()
>
> Does that indicate something? seems it's waiting for some lock.
>

Re: Very slow planning performance on partition table

From

Tom Lane

Date:

27 July 2014, 16:28:15

Rural Hunter <ruralhunter@gmail.com> writes:
>> Does that indicate something? seems it's waiting for some lock.

Yeah, that's what the stack trace suggests.  Have you looked into pg_locks
and pg_stat_activity to see which lock it wants and what's holding said
lock?

            regards, tom lane

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

28 July 2014, 01:19:07

Yes I checked. The connection I inspected is the longest running one.
There was no other connections blocking it. And I also see all locks are
granted for it. Does the planning phase require some internal locks?

在 2014/7/28 0:28, Tom Lane 写道:
> Yeah, that's what the stack trace suggests.  Have you looked into pg_locks
> and pg_stat_activity to see which lock it wants and what's holding said
> lock?
>
>             regards, tom lane
>

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

28 July 2014, 07:01:32

This is the vmstat output when the high load peak happens:
# vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
id wa
54  1 756868 1047128 264572 341573472    0    0   243   272    1 2 51  7
39  2
53  1 756888 924452 264508 341566080    0    7 31379  3623 53110 184808
29 55 15  1
70  1 756892 992416 264408 341530880    0    3 14483  9455 53010 183758
23 61 15  1
93  1 756900 954704 264160 341514208    0    3 20280  3391 66607 304526
23 59 17  1
65  2 756916 998524 263696 341427520    0    5 23295  2084 53748 213259
26 60 12  1
46  0 756924 969036 263636 341421088    0    3 23508  1447 51134 200739
22 59 19  1
123  1 756932 977336 263568 341426016    0    3 21444  2747 48044 174390
27 59 13  1
71  2 756932 975932 263580 341483520    0    0 19328 89629 54321 234718
25 59 14  2
47  5 756932 967004 263676 341502240    0    0 19509 52652 56792 236648
21 60 15  4
70  0 756944 1038464 263660 341468800    0    4 21349  3584 51937 179806
25 59 15  1
70  0 756940 923800 263532 341475712    0    0 15135  1524 58201 236794
21 59 19  1
40  1 756940 1022420 263560 341506560    0    0  9163  4889 34702 130106
19 61 19  1
59  0 756944 939380 263500 341518144    0    1 22809  4024 46398 224644
21 60 19  1
56  1 756956 954656 263464 341469440    0    4 22927  4477 53705 175386
28 57 14  1
39  0 756976 968204 263372 341376576    0    7 24612  2556 61900 262784
30 51 18  1
109  1 756984 1015260 263332 341323776    0    3 16636  4039 29271
83699  7 85  7  1
76  6 756992 980044 263312 341308128    0    3  6949  1848 27496 130478
6 90  2  2
103  0 756992 963540 263308 341352064    0    0 22125  2493 20526 61133
4 88  6  2

Seems most of the cpu is used by sys part.

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

28 July 2014, 13:10:48

I am now seeing another phenominom of hanging connections. They are
showing 'UPDATE' status in process list.
(gdb) bt
#0  0x00007f783f79d4f7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000005f97d3 in PGSemaphoreLock ()
#2  0x0000000000638153 in LWLockAcquire ()
#3  0x00000000004a9239 in ginStepRight ()
#4  0x00000000004a9c61 in ginFindLeafPage ()
#5  0x00000000004a8377 in ginInsertItemPointers ()
#6  0x00000000004a4548 in ginEntryInsert ()
#7  0x00000000004ae687 in ginInsertCleanup ()
#8  0x00000000004af3d6 in ginHeapTupleFastInsert ()
#9  0x00000000004a4ab1 in gininsert ()
#10 0x0000000000709b15 in FunctionCall6Coll ()
#11 0x000000000047b6b7 in index_insert ()
#12 0x000000000057f475 in ExecInsertIndexTuples ()
#13 0x000000000058bf07 in ExecModifyTable ()
#14 0x00000000005766e3 in ExecProcNode ()
#15 0x0000000000575ad4 in standard_ExecutorRun ()
#16 0x000000000064718f in ProcessQuery ()
#17 0x00000000006473b7 in PortalRunMulti ()
#18 0x0000000000647e8a in PortalRun ()
#19 0x0000000000645160 in PostgresMain ()
#20 0x000000000060459e in ServerLoop ()
#21 0x00000000006053bc in PostmasterMain ()
#22 0x00000000005a686b in main ()
(gdb) q

This connection can not be killed by pg_cancel_backend nor
pg_terminate_backend. It just hangs there and does not respond to normal
kill command.  I had to kill -9 the process to terminate whole
postgresql instance. What happened there and how can I kill these
connections safely?

Re: Very slow planning performance on partition table

From

Jeff Janes

Date:

28 July 2014, 17:29:30

On Sun, Jul 27, 2014 at 9:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Rural Hunter <ruralhunter@gmail.com> writes:
>> Does that indicate something? seems it's waiting for some lock.

Yeah, that's what the stack trace suggests. Have you looked into pg_locks
and pg_stat_activity to see which lock it wants and what's holding said
lock?

If it were waiting on a pg_locks lock, the semop should be coming from ProcSleep, not from LWLockAcquire, shouldn't it?

I'm guessing he has a lot of connections, and each connection is locking each partition in shared mode in rapid fire, generating spin-lock or cache-line contention.

Cheers,

Jeff

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

29 July 2014, 00:10:42

<div class="moz-cite-prefix">在 2014/7/29 1:29, Jeff Janes 写道:<br /></div><blockquote
cite="mid:CAMkU=1ydmpt9XRMxt0sPNnQsXEoF_c7bgp2kHxtDbPNGg5Vj5w@mail.gmail.com"type="cite"><div dir="ltr"><div
class="gmail_extra"><br/><div class="gmail_quote"><div>If it were waiting on a pg_locks lock, the semop should be
comingfrom ProcSleep, not from <span
style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px">LWLockAcquire,shouldn't it?</span></div><div><span
style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:13px"><br/></span></div><div><font color="#000000"
face="arial,sans-serif">I'm guessing he has a lot of connections, and each connection is locking each partition in
sharedmode in rapid fire, generating spin-lock or cache-line contention.</font></div><div><font color="#000000"
face="arial,sans-serif"><br /></font></div><div>Cheers,</div><div><br
/></div><div>Jeff</div></div></div></div></blockquote>Yes. I have a lot of connections and they maybe coming together
anddoing the same update statement without partition key on the partition table.<br />

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

29 July 2014, 08:21:29

More information found. After the hang connection appears, I noticed
there were several hundreds of connections of the same user. Since I use
pgbouncer and I only set the pool size to 50 for each user, this is very
strange.  I checked the pgbouncer side, 'show pools' showed the active
server connection count is less than 50(only 35 actually). I also
checked the client port which is shown in pg process list. It is not
used at pgbouncer side when I did the check. So I stopped pgbouncer then
the connection count from the user drops slowly. Finally all those
connections disappeared. After that I restarted pgbouncer and it looks
good again.
With this solution, I at least don't have to kill pg when the problem
happens. But anyone has a clue why this happens? What I need to check
for the root cause? One thing I forgot to check is the network status of
those orphan connections at pg side. I will check it next time and see
if they are in abnormal status.

Re: Very slow planning performance on partition table

From

Jeff Janes

Date:

29 July 2014, 17:27:21

On Tue, Jul 29, 2014 at 1:21 AM, Rural Hunter <ruralhunter@gmail.com> wrote:

More information found. After the hang connection appears, I noticed there were several hundreds of connections of the same user. Since I use pgbouncer and I only set the pool size to 50 for each user, this is very strange. I checked the pgbouncer side, 'show pools' showed the active server connection count is less than 50(only 35 actually). I also checked the client port which is shown in pg process list. It is not used at pgbouncer side when I did the check. So I stopped pgbouncer then the connection count from the user drops slowly. Finally all those connections disappeared. After that I restarted pgbouncer and it looks good again.
With this solution, I at least don't have to kill pg when the problem happens. But anyone has a clue why this happens?

It sounds like someone is bypassing your pgbouncer and connecting directly to your database. Maybe they tried to create their own parallelization and have a master connection going through pgbouncer and create many auxiliary connections that go directly to the database (probably because pgbouncer wouldn't let them create as many connections as they wanted through it). That would explain why the connections slowly drain away once pgbouncer is shut down.

Can you change your pg_hba.conf file so that it only allows connections from pgbouncer's IP address? This should flush out the culprit pretty quickly.

Cheers,

Jeff

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

30 July 2014, 01:13:49

在 2014/7/30 1:27, Jeff Janes 写道:
>
>
> It sounds like someone is bypassing your pgbouncer and connecting
> directly to your database.  Maybe they tried to create their own
> parallelization and have a master connection going through pgbouncer
> and create many auxiliary connections that go directly to the database
> (probably because pgbouncer wouldn't let them create as many
> connections as they wanted through it).  That would explain why the
> connections slowly drain away once pgbouncer is shut down.
>
> Can you change your pg_hba.conf file so that it only allows
> connections from pgbouncer's IP address?  This should flush out the
> culprit pretty quickly.
>
> Cheers,
>
> Jeff
I suspected that first. But after I checked a few things, I am quite
sure this is not someone bypassing the pgbouncer.
1. The connections were all from the host of pgbouncer.
2. The id is an application id and no human has access to it. There was
no other suspect applications running on the host of pgbouncer when the
problem happened.
3. When I found the problem and checked the connections on the host of
pgbouncer, those network connection actually didn't exist on the client
side while they were still hanging at pg server side.

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

30 July 2014, 04:05:38

This happened again. This time I got the connection status(between
pgbouncer host to pgsql host) at postgresql side. When the problem
happens, the connection status is this:
ESTABLISHED: 188
CLOSE_WAIT: 116

The count of connections in CLOSE_WAIT is abnormal. Comparing with
normal situation, there is usually no close_wait connection. The
connection status sample is like this:
ESTABLISHED: 117
CLOSE_WAIT: 0

I have 4 users configured in pgbouncer and the pool_size is 50. So the
max number of connections from pgbouncer should be less than 200.
The connection spike happens very quickly. I created a script to check
the connections from pgbouncer. The script checks the connections from
pgbouncer every 5 mins. This is the log:
10:55:01 CST pgbouncer is healthy. connection count: 73
11:00:02 CST pgbouncer is healthy. connection count: 77
11:05:01 CST pgbouncer is healthy. connection count: 118
11:10:01 CST pgbouncer is healthy. connection count: 115
11:15:01 CST pgbouncer is healthy. connection count: 75
11:20:01 CST pgbouncer is healthy. connection count: 73
11:25:02 CST pgbouncer is healthy. connection count: 75
11:30:01 CST pgbouncer is healthy. connection count: 77
11:35:01 CST pgbouncer is healthy. connection count: 84
11:40:10 CST Problematic connection count: 292, will restart pgbouncer...

Now I suspect there is some network problem between the hosts of
pgbouncer and pgsql. Will check more.

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

30 July 2014, 04:16:48

This was no error in the log of pgbouncer, but there is a sudden drop of
request count when the problem happened:
2014-07-30 11:36:51.919 25369 LOG Stats: 2394 req/s, in 339478 b/s, out
1422425 b/s,query 3792 us
2014-07-30 11:37:51.919 25369 LOG Stats: 2207 req/s, in 314570 b/s, out
2291440 b/s,query 5344 us
2014-07-30 11:38:51.919 25369 LOG Stats: 2151 req/s, in 288565 b/s, out
1945795 b/s,query 10016 us
[=========problem happens=========]
2014-07-30 11:39:51.919 25369 LOG Stats: 1061 req/s, in 140077 b/s, out
2652730 b/s,query 515753 us
[=========pgbouncer restart=========]
2014-07-30 11:40:52.780 10640 LOG File descriptor limit: 65535
(H:65535), max_client_conn: 5500, max fds possible: 6560
2014-07-30 11:40:52.781 10640 LOG Stale pidfile, removing
2014-07-30 11:40:52.782 10642 LOG listening on 0.0.0.0:xxxx
2014-07-30 11:40:52.782 10642 WARNING Cannot listen on ::/xxxx: bind():
Address already in use
2014-07-30 11:40:52.782 10642 LOG listening on unix:/tmp/.s.PGSQL.xxxx
2014-07-30 11:40:52.782 10642 LOG process up: pgbouncer 1.5.4, libevent
1.4.13-stable (epoll), adns: libc-2.11
2014-07-30 11:41:52.781 10642 LOG Stats: 2309 req/s, in 331097 b/s, out
3806033 b/s,query 4671 us
2014-07-30 11:42:52.782 10642 LOG Stats: 2044 req/s, in 285153 b/s, out
2932543 b/s,query 4789 us
2014-07-30 11:43:52.782 10642 LOG Stats: 1969 req/s, in 282697 b/s, out
560439 b/s,query 4607 us
2014-07-30 11:44:52.782 10642 LOG Stats: 2551 req/s, in 351589 b/s, out
3223438 b/s,query 4364 us

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

30 July 2014, 10:03:42

I think I understand what happened now. I have another monitor script
runs periodically and calls pg_cancel_backend and pg_terminate_backend
for those hanging update sqls. However for some unkown reason the cancle
and termiante command doesn't work at pgsql side for those update sqls.

But I think pgbouncer side was notified by cancel or terminate command.
It then drops old connections and creates new ones while those old
connections still hang at pgsql side. That's why the connection status
shows CLOST_WAIT and there are more processes at pgsql side than
pgbouncer defined . So the root cause is still at pgsql side. It
shouldn't hang there. What the hanging process was doing is in my
previous posts. There many same concurrent sql which updates a
partitioned table witouth partition key specified in conditions. The gdb
trace shows this:
(gdb) bt
#0  0x00007f8cea310db7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000005f97d3 in PGSemaphoreLock ()
#2  0x0000000000638153 in LWLockAcquire ()
#3  0x00000000004a90d0 in ginTraverseLock ()
#4  0x00000000004a9d0b in ginFindLeafPage ()
#5  0x00000000004a8377 in ginInsertItemPointers ()
#6  0x00000000004a4548 in ginEntryInsert ()
#7  0x00000000004ae687 in ginInsertCleanup ()
#8  0x00000000004af3d6 in ginHeapTupleFastInsert ()
#9  0x00000000004a4ab1 in gininsert ()
#10 0x0000000000709b15 in FunctionCall6Coll ()
#11 0x000000000047b6b7 in index_insert ()
#12 0x000000000057f475 in ExecInsertIndexTuples ()
#13 0x000000000058bf07 in ExecModifyTable ()
#14 0x00000000005766e3 in ExecProcNode ()
#15 0x0000000000575ad4 in standard_ExecutorRun ()
#16 0x000000000064718f in ProcessQuery ()
#17 0x00000000006473b7 in PortalRunMulti ()
#18 0x0000000000647e8a in PortalRun ()
#19 0x0000000000645160 in PostgresMain ()
#20 0x000000000060459e in ServerLoop ()
#21 0x00000000006053bc in PostmasterMain ()
#22 0x00000000005a686b in main ()
(gdb) q

It will just hangs there forever and finally blocks all other update
sqls if I don't stop pgbouncer. When this happens, all the cpus will be
utilized by those hanging processes and the server load is very very
high. It keeps at serveral hundreds comparing with about 20 normally
which causes the performance problem for all tasks on the server.

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

04 August 2014, 07:23:25

Hi Tom,

Could my problem be a victim of this issue?

http://postgresql.1045698.n5.nabble.com/Planner-performance-extremely-affected-by-an-hanging-transaction-20-30-times-td5771686.html

is the patch mentioned in that thread applied in 9.2.9?

Re: Very slow planning performance on partition table

From

Rural Hunter

Date:

04 August 2014, 10:12:48

在 2014/7/30 18:03, Rural Hunter 写道:
> I think I understand what happened now. I have another monitor script
> runs periodically and calls pg_cancel_backend and pg_terminate_backend
> for those hanging update sqls. However for some unkown reason the
> cancle and termiante command doesn't work at pgsql side for those
> update sqls.
>
With the log of the monitor&kill scipt, I can confirm that the
CLOSE_WAIT is not caused by it. I logged the netstat before actually
doing the kill and found the CLOSE_WAIT connections were already there.
So it must be something else caused the CLOSE_WAIT connections.