Thread: Problems with pg_locks explosion

Problems with pg_locks explosion

From
Armand du Plessis
Date:
[Apologies, I first sent this to the incorrect list, postgres-admin, in the event you receive it twice]

Hi there,

I'm hoping someone on the list can shed some light on an issue I'm having with our Postgresql cluster. I'm literally tearing out my hair and don't have a deep enough understanding of Postgres to find the problem. 

What's happening is I had severe disk/io issues on our original Postgres cluster (9.0.8) and switched to a new instance with a RAID-0 volume array. The machine's CPU usage would hover around 30% and our database would run lightning fast with pg_locks hovering between 100-200. 

Within a few seconds something would trigger a massive increase in pg_locks so that it suddenly shoots up to 4000-8000. At this point everything dies. Queries that usually take a few milliseconds takes minutes and everything is unresponsive until I restart postgres. 

The instance still idles at this point. The only clue I could find was that it usually starts a few minutes after the checkpoint entries appear in my logs. 

Any suggestions would really be appreciated. It's killing our business at the moment. I can supply more info if required but pasted what I thought would be useful below. Not sure what else to change in the settings. 

Kind regards,

Armand



It's on Amazon EC2 - 
* cc2.8xlarge instance type 
* 6 volumes in RAID-0 configuration. (1000 PIOPS) 

60.5 GiB of memory
88 EC2 Compute Units (2 x Intel Xeon E5-2670, eight-core)
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
EBS-Optimized Available: No**
API name: cc2.8xlarge


postgresql.conf
fsync = off
full_page_writes = off
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 48GB
work_mem = 64MB
wal_buffers = -1
checkpoint_segments = 128
shared_buffers = 32GB
max_connections = 800
effective_io_concurrency = 3 # Down from 6

# - Background Writer -

#bgwriter_delay = 200ms                 # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100            # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0          # 0-10.0 multipler on buffers scanned/round


$ free
             total       used       free     shared    buffers     cached
Mem:      61368192   60988180     380012          0        784   44167172
-/+ buffers/cache:   16820224   44547968
Swap:            0          0          0

$ top -c 
top - 21:55:51 up 12 days, 12:41,  4 users,  load average: 6.03, 16.10, 24.15
top - 21:55:54 up 12 days, 12:41,  4 users,  load average: 6.03, 15.94, 24.06
Tasks: 837 total,   6 running, 831 sleeping,   0 stopped,   0 zombie
Cpu(s): 15.7%us,  1.7%sy,  0.0%ni, 81.6%id,  0.3%wa,  0.0%hi,  0.6%si,  0.0%st
Mem:  61368192k total, 54820988k used,  6547204k free,     9032k buffer

[ec2-user@ip-10-155-231-112 ~]$ sudo iostat
Linux 3.2.39-6.88.amzn1.x86_64 ()  04/01/2013  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          21.00    0.00    1.10    0.26    0.00   77.63

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvda              0.21         5.00         2.22    5411830    2401368
xvdk             98.32      1774.67       969.86 1919359965 1048932113
xvdj             98.28      1773.68       969.14 1918288697 1048156776
xvdi             98.29      1773.69       969.61 1918300250 1048662470
xvdh             98.24      1773.92       967.54 1918544618 1046419936
xvdg             98.27      1774.15       968.85 1918790636 1047842846
xvdf             98.32      1775.56       968.69 1920316435 1047668172
md127           733.85     10645.68      5813.70 11513598393 6287682313

What bugs me on this is the throughput percentage on the volumes in Cloudwatch is 100% on all volumes. 

The problems seem to overlap with checkpoints. 

2013-04-01 21:31:35.592 UTC,,,26877,,5159fa5f.68fd,1,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2013-04-01 21:40:35.033 UTC,,,26877,,5159fa5f.68fd,2,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint complete: wrote 100635 buffers (2.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=539.439 s, sync=0.000 s, total=539.441 s; sync files=0, longest=0.000 s, average=0.000 s",,,,,,,,,""
2013-04-01 21:41:35.093 UTC,,,26877,,5159fa5f.68fd,3,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""




Fwd: Problems with pg_locks explosion

From
Vasilis Ventirozos
Date:

Hello, i think that your system during the checkpoint pauses all clients in order to flush all data from controller's cache to the disks if i were you i'd try to tune my checkpoint parameters better, if that doesn't work, show us some vmstat output please

Vasilis Ventirozos

---------- Forwarded message ----------
From: "Armand du Plessis" <adp@bank.io>
Date: Apr 2, 2013 1:37 AM
Subject: [PERFORM] Problems with pg_locks explosion
To: "pgsql-performance" <pgsql-performance@postgresql.org>
Cc:

[Apologies, I first sent this to the incorrect list, postgres-admin, in the event you receive it twice]

Hi there,

I'm hoping someone on the list can shed some light on an issue I'm having with our Postgresql cluster. I'm literally tearing out my hair and don't have a deep enough understanding of Postgres to find the problem. 

What's happening is I had severe disk/io issues on our original Postgres cluster (9.0.8) and switched to a new instance with a RAID-0 volume array. The machine's CPU usage would hover around 30% and our database would run lightning fast with pg_locks hovering between 100-200. 

Within a few seconds something would trigger a massive increase in pg_locks so that it suddenly shoots up to 4000-8000. At this point everything dies. Queries that usually take a few milliseconds takes minutes and everything is unresponsive until I restart postgres. 

The instance still idles at this point. The only clue I could find was that it usually starts a few minutes after the checkpoint entries appear in my logs. 

Any suggestions would really be appreciated. It's killing our business at the moment. I can supply more info if required but pasted what I thought would be useful below. Not sure what else to change in the settings. 

Kind regards,

Armand



It's on Amazon EC2 - 
* cc2.8xlarge instance type 
* 6 volumes in RAID-0 configuration. (1000 PIOPS) 

60.5 GiB of memory
88 EC2 Compute Units (2 x Intel Xeon E5-2670, eight-core)
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
EBS-Optimized Available: No**
API name: cc2.8xlarge


postgresql.conf
fsync = off
full_page_writes = off
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 48GB
work_mem = 64MB
wal_buffers = -1
checkpoint_segments = 128
shared_buffers = 32GB
max_connections = 800
effective_io_concurrency = 3 # Down from 6

# - Background Writer -

#bgwriter_delay = 200ms                 # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100            # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0          # 0-10.0 multipler on buffers scanned/round


$ free
             total       used       free     shared    buffers     cached
Mem:      61368192   60988180     380012          0        784   44167172
-/+ buffers/cache:   16820224   44547968
Swap:            0          0          0

$ top -c 
top - 21:55:51 up 12 days, 12:41,  4 users,  load average: 6.03, 16.10, 24.15
top - 21:55:54 up 12 days, 12:41,  4 users,  load average: 6.03, 15.94, 24.06
Tasks: 837 total,   6 running, 831 sleeping,   0 stopped,   0 zombie
Cpu(s): 15.7%us,  1.7%sy,  0.0%ni, 81.6%id,  0.3%wa,  0.0%hi,  0.6%si,  0.0%st
Mem:  61368192k total, 54820988k used,  6547204k free,     9032k buffer

[ec2-user@ip-10-155-231-112 ~]$ sudo iostat
Linux 3.2.39-6.88.amzn1.x86_64 ()  04/01/2013  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          21.00    0.00    1.10    0.26    0.00   77.63

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvda              0.21         5.00         2.22    5411830    2401368
xvdk             98.32      1774.67       969.86 1919359965 1048932113
xvdj             98.28      1773.68       969.14 1918288697 1048156776
xvdi             98.29      1773.69       969.61 1918300250 1048662470
xvdh             98.24      1773.92       967.54 1918544618 1046419936
xvdg             98.27      1774.15       968.85 1918790636 1047842846
xvdf             98.32      1775.56       968.69 1920316435 1047668172
md127           733.85     10645.68      5813.70 11513598393 6287682313

What bugs me on this is the throughput percentage on the volumes in Cloudwatch is 100% on all volumes. 

The problems seem to overlap with checkpoints. 

2013-04-01 21:31:35.592 UTC,,,26877,,5159fa5f.68fd,1,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2013-04-01 21:40:35.033 UTC,,,26877,,5159fa5f.68fd,2,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint complete: wrote 100635 buffers (2.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=539.439 s, sync=0.000 s, total=539.441 s; sync files=0, longest=0.000 s, average=0.000 s",,,,,,,,,""
2013-04-01 21:41:35.093 UTC,,,26877,,5159fa5f.68fd,3,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""




Re: Problems with pg_locks explosion

From
Armand du Plessis
Date:
Thanks for the reply. 

I've now updated the background writer settings to:

# - Background Writer -

bgwriter_delay = 200ms                  # 10-10000ms between rounds
bgwriter_lru_maxpages = 400             # 0-1000 max buffers written/round
bgwriter_lru_multiplier = 2.0           # 0-10.0 multipler on buffers scanned/round

checkpoint_segments = 128
checkpoint_timeout = 25min  

It's still happening at the moment, this time without any checkpoint entries in the log :(

Below the output from vmstat. I'm not sure what to look for in there?

Thanks again,

Armand


$ sudo vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 485800   4224 44781700    0    0   167    91    1    0 21  1 78  0  0
 7  0      0 353920   4224 44836176    0    0  6320    54 21371 12921 11  2 87  0  0
32  0      0 352220   4232 44749544    0    0  1110     8 19414 9620  6 42 52  0  0
 3  0      0 363044   4232 44615772    0    0    59  1943 11185 3774  0 81 18  0  0
48  0      0 360076   4240 44550744    0    0     0    34 9563 5210  0 74 26  0  0
33  0      0 413708   4240 44438248    0    0    92   962 11250 8169  0 61 39  0  0
109  0      0 418080   4240 44344596    0    0   605  3490 10098 6216  1 49 50  0  0
58  0      0 425388   4240 44286528    0    0     5    10 10794 2470  1 91  8  0  0
53  0      0 435864   4240 44243000    0    0    11     0 9755 2428  0 92  8  0  0
12  0      0 440792   4248 44213164    0    0   134     5 7883 3038  0 51 49  0  0
 3  0      0 440360   4256 44158684    0    0   548   146 8450 3930  2 27 70  0  0
 2  0      0 929236   4256 44248608    0    0 10466   845 22575 14196 20  5 74  0  0
 4  0      0 859160   4256 44311828    0    0  7120    61 20890 12835 12  1 86  0  0
 4  0      0 685308   4256 44369404    0    0  6110    24 20645 12545 13  1 85  0  0
 4  0      0 695440   4256 44396304    0    0  5351  1208 19529 11781 11  1 88  0  0
 4  0      0 628276   4256 44468116    0    0  9202     0 19875 12172  9  1 89  0  0
 6  0      0 579716   4256 44503848    0    0  3799    22 19223 11772 10  1 88  0  0
 3  1      0 502948   4256 44539784    0    0  3721  6700 20620 11939 13  1 85  0  0
 4  0      0 414120   4256 44583456    0    0  3860   856 19801 12092 10  1 89  0  0
 6  0      0 349240   4256 44642880    0    0  6122    48 19834 11933 11  2 87  0  0
 3  0      0 400536   4256 44535872    0    0  6287     5 18945 11461 10  1 89  0  0
 3  0      0 364256   4256 44592412    0    0  5487  2018 20145 12344 11  1 87  0  0
 7  0      0 343732   4256 44598784    0    0  4209    24 19099 11482 10  1 88  0  0
 6  0      0 339608   4236 44576768    0    0  6805   151 18821 11333  9  2 89  0  0
 9  1      0 339364   4236 44556884    0    0  2597  4339 19205 11918 11  3 85  0  0
24  0      0 341596   4236 44480368    0    0  6165  5309 19353 11562 11  4 84  1  0
30  0      0 359044   4236 44416452    0    0  1364     6 12638 6138  5 28 67  0  0
 4  0      0 436468   4224 44326500    0    0  3704  1264 11346 7545  4 27 68  0  0
 3  1      0 459736   4224 44384788    0    0  6541     8 20159 12097 11  1 88  0  0
 8  1      0 347812   4224 44462100    0    0 12292  2860 20851 12377  9  1 89  1  0
 1  0      0 379752   4224 44402396    0    0  5849   147 20171 12253 11  1 88  0  0
 4  0      0 453692   4216 44243480    0    0  6546   269 20689 13028 12  2 86  0  0
 8  0      0 390160   4216 44259768    0    0  4243     0 20476 21238  6 16 78  0  0
 6  0      0 344504   4216 44336264    0    0  7214     2 20919 12625 11  1 87  0  0
 4  0      0 350128   4200 44324976    0    0 10726  2173 20417 12351 10  1 88  0  0
 2  1      0 362300   4200 44282484    0    0  7148   714 22469 14468 12  2 86  0  0
 3  0      0 366252   4184 44311680    0    0  7617   133 20487 12364  9  1 90  0  0
 6  0      0 368904   4184 44248152    0    0  5162     6 22910 15221 14  7 80  0  0
 2  0      0 383108   4184 44276780    0    0  5846  1120 21109 12563 11  1 88  0  0
 7  0      0 338348   4184 44274472    0    0  9270     5 21243 12698 10  1 88  0  0
24  0      0 339676   4184 44213036    0    0  6639    18 22976 12700 13 12 74  0  0
12  0      0 371848   4184 44146500    0    0   657   133 18968 7445  5 53 43  0  0
37  0      0 374516   4184 44076212    0    0    16     2 9156 4472  1 48 52  0  0
16  0      0 398412   4184 43971060    0    0   127     0 9967 6018  0 48 52  0  0
 4  0      0 417312   4184 44084392    0    0 17434  1072 23661 14268 16  6 78  1  0
 4  0      0 407672   4184 44139896    0    0  5785     0 19779 11869 11  1 88  0  0
 9  0      0 349544   4184 44051596    0    0  6899     8 20376 12774 10  3 88  0  0
 5  0      0 424628   4184 44059628    0    0  9105   175 24546 15354 13 20 66  1  0
 2  0      0 377164   4184 44070564    0    0  9363     3 21191 12608 11  2 87  0  0
 5  0      0 353360   4184 44040804    0    0  6661     0 20931 12815 12  2 85  0  0
 4  0      0 355144   4180 44034620    0    0  7061     8 21264 12379 11  3 86  0  0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
21  0      0 358396   4180 43958420    0    0  7595  1749 23258 12299 10 27 63  0  0
 6  1      0 437480   4160 43922152    0    0 17565    14 17059 14928  6 18 74  2  0
 6  0      0 380304   4160 43993932    0    0 10120   168 21519 12798 11  2 87  0  0
 8  0      0 337740   4160 44007432    0    0  6033   520 20872 12461 11  1 88  0  0
13  0      0 349712   4132 43927784    0    0  6777     6 20919 12568 11  2 86  0  0
 6  1      0 351180   4112 43899756    0    0  8640     0 22543 12519 11 10 78  0  0
 6  0      0 356392   4112 43921532    0    0  6206    48 20383 12050 12  1 86  0  0
 6  0      0 355552   4108 43863448    0    0  6106     3 21244 11817  9  9 82  0  0
 3  0      0 364992   7312 43856824    0    0 11283   199 21296 12638 13  2 85  0  0
 4  1      0 371968   7120 43818552    0    0  6715  1534 22322 13305 11  7 81  0  0
debug2: channel 0: window 999365 sent adjust 49211
12  0      0 338540   7120 43822256    0    0  9142     3 21520 12194 13  5 82  0  0
 8  0      0 386016   7112 43717136    0    0  2123     3 20465 11466  8 20 72  0  0
 8  0      0 352388   7112 43715872    0    0 10366    51 25758 13879 16 19 65  0  0
20  0      0 351472   7112 43701060    0    0 13091    10 23766 12832 11 11 77  1  0
 2  0      0 386820   7112 43587520    0    0   482   210 17187 6773  3 69 28  0  0
64  0      0 401956   7112 43473728    0    0     0     5 10796 9487  0 55 44  0  0


On Tue, Apr 2, 2013 at 12:56 AM, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:

Hello, i think that your system during the checkpoint pauses all clients in order to flush all data from controller's cache to the disks if i were you i'd try to tune my checkpoint parameters better, if that doesn't work, show us some vmstat output please

Vasilis Ventirozos

---------- Forwarded message ----------
From: "Armand du Plessis" <adp@bank.io>
Date: Apr 2, 2013 1:37 AM
Subject: [PERFORM] Problems with pg_locks explosion
To: "pgsql-performance" <pgsql-performance@postgresql.org>
Cc:

[Apologies, I first sent this to the incorrect list, postgres-admin, in the event you receive it twice]

Hi there,

I'm hoping someone on the list can shed some light on an issue I'm having with our Postgresql cluster. I'm literally tearing out my hair and don't have a deep enough understanding of Postgres to find the problem. 

What's happening is I had severe disk/io issues on our original Postgres cluster (9.0.8) and switched to a new instance with a RAID-0 volume array. The machine's CPU usage would hover around 30% and our database would run lightning fast with pg_locks hovering between 100-200. 

Within a few seconds something would trigger a massive increase in pg_locks so that it suddenly shoots up to 4000-8000. At this point everything dies. Queries that usually take a few milliseconds takes minutes and everything is unresponsive until I restart postgres. 

The instance still idles at this point. The only clue I could find was that it usually starts a few minutes after the checkpoint entries appear in my logs. 

Any suggestions would really be appreciated. It's killing our business at the moment. I can supply more info if required but pasted what I thought would be useful below. Not sure what else to change in the settings. 

Kind regards,

Armand



It's on Amazon EC2 - 
* cc2.8xlarge instance type 
* 6 volumes in RAID-0 configuration. (1000 PIOPS) 

60.5 GiB of memory
88 EC2 Compute Units (2 x Intel Xeon E5-2670, eight-core)
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
EBS-Optimized Available: No**
API name: cc2.8xlarge


postgresql.conf
fsync = off
full_page_writes = off
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 48GB
work_mem = 64MB
wal_buffers = -1
checkpoint_segments = 128
shared_buffers = 32GB
max_connections = 800
effective_io_concurrency = 3 # Down from 6

# - Background Writer -

#bgwriter_delay = 200ms                 # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100            # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0          # 0-10.0 multipler on buffers scanned/round


$ free
             total       used       free     shared    buffers     cached
Mem:      61368192   60988180     380012          0        784   44167172
-/+ buffers/cache:   16820224   44547968
Swap:            0          0          0

$ top -c 
top - 21:55:51 up 12 days, 12:41,  4 users,  load average: 6.03, 16.10, 24.15
top - 21:55:54 up 12 days, 12:41,  4 users,  load average: 6.03, 15.94, 24.06
Tasks: 837 total,   6 running, 831 sleeping,   0 stopped,   0 zombie
Cpu(s): 15.7%us,  1.7%sy,  0.0%ni, 81.6%id,  0.3%wa,  0.0%hi,  0.6%si,  0.0%st
Mem:  61368192k total, 54820988k used,  6547204k free,     9032k buffer

[ec2-user@ip-10-155-231-112 ~]$ sudo iostat
Linux 3.2.39-6.88.amzn1.x86_64 ()  04/01/2013  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          21.00    0.00    1.10    0.26    0.00   77.63

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvda              0.21         5.00         2.22    5411830    2401368
xvdk             98.32      1774.67       969.86 1919359965 1048932113
xvdj             98.28      1773.68       969.14 1918288697 1048156776
xvdi             98.29      1773.69       969.61 1918300250 1048662470
xvdh             98.24      1773.92       967.54 1918544618 1046419936
xvdg             98.27      1774.15       968.85 1918790636 1047842846
xvdf             98.32      1775.56       968.69 1920316435 1047668172
md127           733.85     10645.68      5813.70 11513598393 6287682313

What bugs me on this is the throughput percentage on the volumes in Cloudwatch is 100% on all volumes. 

The problems seem to overlap with checkpoints. 

2013-04-01 21:31:35.592 UTC,,,26877,,5159fa5f.68fd,1,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2013-04-01 21:40:35.033 UTC,,,26877,,5159fa5f.68fd,2,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint complete: wrote 100635 buffers (2.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=539.439 s, sync=0.000 s, total=539.441 s; sync files=0, longest=0.000 s, average=0.000 s",,,,,,,,,""
2013-04-01 21:41:35.093 UTC,,,26877,,5159fa5f.68fd,3,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""





Re: Problems with pg_locks explosion

From
Armand du Plessis
Date:
Apologies, the checkpoint log entry was a few seconds after I sent this email. Now pg_locks on 7000. 

And vmstat:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 7  0      0 351212   7032 37412872    0    0   167    91    0    1 21  1 78  0  0
 4  0      0 350600   7040 37374464    0    0  6077   332 12889 6634  7  3 90  0  0
40  0      0 346244   7040 37310604    0    0  3687  2638 16355 5517  7 31 61  0  0
27  0      0 385620   7040 37206560    0    0    69  1587 14483 4108  3 75 22  0  0


On Tue, Apr 2, 2013 at 1:09 AM, Armand du Plessis <adp@bank.io> wrote:
Thanks for the reply. 

I've now updated the background writer settings to:

# - Background Writer -

bgwriter_delay = 200ms                  # 10-10000ms between rounds
bgwriter_lru_maxpages = 400             # 0-1000 max buffers written/round
bgwriter_lru_multiplier = 2.0           # 0-10.0 multipler on buffers scanned/round

checkpoint_segments = 128
checkpoint_timeout = 25min  

It's still happening at the moment, this time without any checkpoint entries in the log :(

Below the output from vmstat. I'm not sure what to look for in there?

Thanks again,

Armand


$ sudo vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 485800   4224 44781700    0    0   167    91    1    0 21  1 78  0  0
 7  0      0 353920   4224 44836176    0    0  6320    54 21371 12921 11  2 87  0  0
32  0      0 352220   4232 44749544    0    0  1110     8 19414 9620  6 42 52  0  0
 3  0      0 363044   4232 44615772    0    0    59  1943 11185 3774  0 81 18  0  0
48  0      0 360076   4240 44550744    0    0     0    34 9563 5210  0 74 26  0  0
33  0      0 413708   4240 44438248    0    0    92   962 11250 8169  0 61 39  0  0
109  0      0 418080   4240 44344596    0    0   605  3490 10098 6216  1 49 50  0  0
58  0      0 425388   4240 44286528    0    0     5    10 10794 2470  1 91  8  0  0
53  0      0 435864   4240 44243000    0    0    11     0 9755 2428  0 92  8  0  0
12  0      0 440792   4248 44213164    0    0   134     5 7883 3038  0 51 49  0  0
 3  0      0 440360   4256 44158684    0    0   548   146 8450 3930  2 27 70  0  0
 2  0      0 929236   4256 44248608    0    0 10466   845 22575 14196 20  5 74  0  0
 4  0      0 859160   4256 44311828    0    0  7120    61 20890 12835 12  1 86  0  0
 4  0      0 685308   4256 44369404    0    0  6110    24 20645 12545 13  1 85  0  0
 4  0      0 695440   4256 44396304    0    0  5351  1208 19529 11781 11  1 88  0  0
 4  0      0 628276   4256 44468116    0    0  9202     0 19875 12172  9  1 89  0  0
 6  0      0 579716   4256 44503848    0    0  3799    22 19223 11772 10  1 88  0  0
 3  1      0 502948   4256 44539784    0    0  3721  6700 20620 11939 13  1 85  0  0
 4  0      0 414120   4256 44583456    0    0  3860   856 19801 12092 10  1 89  0  0
 6  0      0 349240   4256 44642880    0    0  6122    48 19834 11933 11  2 87  0  0
 3  0      0 400536   4256 44535872    0    0  6287     5 18945 11461 10  1 89  0  0
 3  0      0 364256   4256 44592412    0    0  5487  2018 20145 12344 11  1 87  0  0
 7  0      0 343732   4256 44598784    0    0  4209    24 19099 11482 10  1 88  0  0
 6  0      0 339608   4236 44576768    0    0  6805   151 18821 11333  9  2 89  0  0
 9  1      0 339364   4236 44556884    0    0  2597  4339 19205 11918 11  3 85  0  0
24  0      0 341596   4236 44480368    0    0  6165  5309 19353 11562 11  4 84  1  0
30  0      0 359044   4236 44416452    0    0  1364     6 12638 6138  5 28 67  0  0
 4  0      0 436468   4224 44326500    0    0  3704  1264 11346 7545  4 27 68  0  0
 3  1      0 459736   4224 44384788    0    0  6541     8 20159 12097 11  1 88  0  0
 8  1      0 347812   4224 44462100    0    0 12292  2860 20851 12377  9  1 89  1  0
 1  0      0 379752   4224 44402396    0    0  5849   147 20171 12253 11  1 88  0  0
 4  0      0 453692   4216 44243480    0    0  6546   269 20689 13028 12  2 86  0  0
 8  0      0 390160   4216 44259768    0    0  4243     0 20476 21238  6 16 78  0  0
 6  0      0 344504   4216 44336264    0    0  7214     2 20919 12625 11  1 87  0  0
 4  0      0 350128   4200 44324976    0    0 10726  2173 20417 12351 10  1 88  0  0
 2  1      0 362300   4200 44282484    0    0  7148   714 22469 14468 12  2 86  0  0
 3  0      0 366252   4184 44311680    0    0  7617   133 20487 12364  9  1 90  0  0
 6  0      0 368904   4184 44248152    0    0  5162     6 22910 15221 14  7 80  0  0
 2  0      0 383108   4184 44276780    0    0  5846  1120 21109 12563 11  1 88  0  0
 7  0      0 338348   4184 44274472    0    0  9270     5 21243 12698 10  1 88  0  0
24  0      0 339676   4184 44213036    0    0  6639    18 22976 12700 13 12 74  0  0
12  0      0 371848   4184 44146500    0    0   657   133 18968 7445  5 53 43  0  0
37  0      0 374516   4184 44076212    0    0    16     2 9156 4472  1 48 52  0  0
16  0      0 398412   4184 43971060    0    0   127     0 9967 6018  0 48 52  0  0
 4  0      0 417312   4184 44084392    0    0 17434  1072 23661 14268 16  6 78  1  0
 4  0      0 407672   4184 44139896    0    0  5785     0 19779 11869 11  1 88  0  0
 9  0      0 349544   4184 44051596    0    0  6899     8 20376 12774 10  3 88  0  0
 5  0      0 424628   4184 44059628    0    0  9105   175 24546 15354 13 20 66  1  0
 2  0      0 377164   4184 44070564    0    0  9363     3 21191 12608 11  2 87  0  0
 5  0      0 353360   4184 44040804    0    0  6661     0 20931 12815 12  2 85  0  0
 4  0      0 355144   4180 44034620    0    0  7061     8 21264 12379 11  3 86  0  0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
21  0      0 358396   4180 43958420    0    0  7595  1749 23258 12299 10 27 63  0  0
 6  1      0 437480   4160 43922152    0    0 17565    14 17059 14928  6 18 74  2  0
 6  0      0 380304   4160 43993932    0    0 10120   168 21519 12798 11  2 87  0  0
 8  0      0 337740   4160 44007432    0    0  6033   520 20872 12461 11  1 88  0  0
13  0      0 349712   4132 43927784    0    0  6777     6 20919 12568 11  2 86  0  0
 6  1      0 351180   4112 43899756    0    0  8640     0 22543 12519 11 10 78  0  0
 6  0      0 356392   4112 43921532    0    0  6206    48 20383 12050 12  1 86  0  0
 6  0      0 355552   4108 43863448    0    0  6106     3 21244 11817  9  9 82  0  0
 3  0      0 364992   7312 43856824    0    0 11283   199 21296 12638 13  2 85  0  0
 4  1      0 371968   7120 43818552    0    0  6715  1534 22322 13305 11  7 81  0  0
debug2: channel 0: window 999365 sent adjust 49211
12  0      0 338540   7120 43822256    0    0  9142     3 21520 12194 13  5 82  0  0
 8  0      0 386016   7112 43717136    0    0  2123     3 20465 11466  8 20 72  0  0
 8  0      0 352388   7112 43715872    0    0 10366    51 25758 13879 16 19 65  0  0
20  0      0 351472   7112 43701060    0    0 13091    10 23766 12832 11 11 77  1  0
 2  0      0 386820   7112 43587520    0    0   482   210 17187 6773  3 69 28  0  0
64  0      0 401956   7112 43473728    0    0     0     5 10796 9487  0 55 44  0  0


On Tue, Apr 2, 2013 at 12:56 AM, Vasilis Ventirozos <v.ventirozos@gmail.com> wrote:

Hello, i think that your system during the checkpoint pauses all clients in order to flush all data from controller's cache to the disks if i were you i'd try to tune my checkpoint parameters better, if that doesn't work, show us some vmstat output please

Vasilis Ventirozos

---------- Forwarded message ----------
From: "Armand du Plessis" <adp@bank.io>
Date: Apr 2, 2013 1:37 AM
Subject: [PERFORM] Problems with pg_locks explosion
To: "pgsql-performance" <pgsql-performance@postgresql.org>
Cc:

[Apologies, I first sent this to the incorrect list, postgres-admin, in the event you receive it twice]

Hi there,

I'm hoping someone on the list can shed some light on an issue I'm having with our Postgresql cluster. I'm literally tearing out my hair and don't have a deep enough understanding of Postgres to find the problem. 

What's happening is I had severe disk/io issues on our original Postgres cluster (9.0.8) and switched to a new instance with a RAID-0 volume array. The machine's CPU usage would hover around 30% and our database would run lightning fast with pg_locks hovering between 100-200. 

Within a few seconds something would trigger a massive increase in pg_locks so that it suddenly shoots up to 4000-8000. At this point everything dies. Queries that usually take a few milliseconds takes minutes and everything is unresponsive until I restart postgres. 

The instance still idles at this point. The only clue I could find was that it usually starts a few minutes after the checkpoint entries appear in my logs. 

Any suggestions would really be appreciated. It's killing our business at the moment. I can supply more info if required but pasted what I thought would be useful below. Not sure what else to change in the settings. 

Kind regards,

Armand



It's on Amazon EC2 - 
* cc2.8xlarge instance type 
* 6 volumes in RAID-0 configuration. (1000 PIOPS) 

60.5 GiB of memory
88 EC2 Compute Units (2 x Intel Xeon E5-2670, eight-core)
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
EBS-Optimized Available: No**
API name: cc2.8xlarge


postgresql.conf
fsync = off
full_page_writes = off
default_statistics_target = 100
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
effective_cache_size = 48GB
work_mem = 64MB
wal_buffers = -1
checkpoint_segments = 128
shared_buffers = 32GB
max_connections = 800
effective_io_concurrency = 3 # Down from 6

# - Background Writer -

#bgwriter_delay = 200ms                 # 10-10000ms between rounds
#bgwriter_lru_maxpages = 100            # 0-1000 max buffers written/round
#bgwriter_lru_multiplier = 2.0          # 0-10.0 multipler on buffers scanned/round


$ free
             total       used       free     shared    buffers     cached
Mem:      61368192   60988180     380012          0        784   44167172
-/+ buffers/cache:   16820224   44547968
Swap:            0          0          0

$ top -c 
top - 21:55:51 up 12 days, 12:41,  4 users,  load average: 6.03, 16.10, 24.15
top - 21:55:54 up 12 days, 12:41,  4 users,  load average: 6.03, 15.94, 24.06
Tasks: 837 total,   6 running, 831 sleeping,   0 stopped,   0 zombie
Cpu(s): 15.7%us,  1.7%sy,  0.0%ni, 81.6%id,  0.3%wa,  0.0%hi,  0.6%si,  0.0%st
Mem:  61368192k total, 54820988k used,  6547204k free,     9032k buffer

[ec2-user@ip-10-155-231-112 ~]$ sudo iostat
Linux 3.2.39-6.88.amzn1.x86_64 ()  04/01/2013  _x86_64_ (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          21.00    0.00    1.10    0.26    0.00   77.63

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvda              0.21         5.00         2.22    5411830    2401368
xvdk             98.32      1774.67       969.86 1919359965 1048932113
xvdj             98.28      1773.68       969.14 1918288697 1048156776
xvdi             98.29      1773.69       969.61 1918300250 1048662470
xvdh             98.24      1773.92       967.54 1918544618 1046419936
xvdg             98.27      1774.15       968.85 1918790636 1047842846
xvdf             98.32      1775.56       968.69 1920316435 1047668172
md127           733.85     10645.68      5813.70 11513598393 6287682313

What bugs me on this is the throughput percentage on the volumes in Cloudwatch is 100% on all volumes. 

The problems seem to overlap with checkpoints. 

2013-04-01 21:31:35.592 UTC,,,26877,,5159fa5f.68fd,1,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2013-04-01 21:40:35.033 UTC,,,26877,,5159fa5f.68fd,2,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint complete: wrote 100635 buffers (2.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=539.439 s, sync=0.000 s, total=539.441 s; sync files=0, longest=0.000 s, average=0.000 s",,,,,,,,,""
2013-04-01 21:41:35.093 UTC,,,26877,,5159fa5f.68fd,3,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""






Re: Problems with pg_locks explosion

From
Vasilis Ventirozos
Date:
Try these SQL statements , they will give you more information about whats happening in your server lock-wise

SELECT
locktype, virtualtransaction,transactionid,nspname,relname,mode,granted,
cast(date_trunc('second',query_start) AS timestamp) AS query_start,
substr(current_query,1,25) AS query
FROM pg_locks
LEFT OUTER JOIN pg_class ON (pg_locks.relation = pg_class.oid)
LEFT OUTER JOIN pg_namespace ON (pg_namespace.oid = pg_class.
relnamespace), pg_stat_activity
WHERE
NOT pg_locks.pid=pg_backend_pid() AND pg_locks.pid=pg_stat_activity.procpid;


SELECT
locked.pid AS locked_pid, locker.pid AS locker_pid, locked_act.usename AS locked_user, locker_act.usename AS locker_user,
locked.virtualtransaction, locked.transactionid, locked.locktype
FROM
pg_locks locked, pg_locks locker, pg_stat_activity locked_act, pg_stat_activity locker_act
WHERE
locker.granted=true AND  locked.granted=false AND locked.pid=locked_act.procpid AND
locker.pid=locker_act.procpid AND (locked.virtualtransaction=locker.virtualtransaction OR locked.transactionid=locker.transactionid);

SELECT
locked.pid AS locked_pid, locker.pid AS locker_pid, locked_act.usename AS locked_user, locker_act.usename AS locker_user,
locked.virtualtransaction, locked.transactionid, relname
FROM
pg_locks locked
LEFT OUTER JOIN pg_class ON (locked.relation = pg_class.oid), pg_locks locker,pg_stat_activity locked_act, pg_stat_activity locker_act
WHERE
locker.granted=true AND locked.granted=false AND locked.pid=locked_act.procpid AND locker.pid=locker_act.procpid AND locked.relation=locker.relation;

Vasilis Ventirozos

Re: Problems with pg_locks explosion

From
Mark Kirkwood
Date:
In addition to tuning the various Postgres config knobs you may need to
look at how your AWS server is set up. If your load is causing an IO
stall then *symptoms* of this will be lots of locks...

You have quite a lot of memory (60G), so look at tuning the
vm.dirty_background_ratio, vm.dirty_ratio sysctls to avoid trying to
*suddenly* write out many gigs of dirty buffers.

Your provisioned volumes are much better than the default AWS ones, but
are still not hugely fast (i.e 1000 IOPS is about 8 MB/s worth of
Postgres 8k buffers). So you may need to look at adding more volumes
into the array, or adding some separate ones and putting pg_xlog
directory on 'em.

However before making changes I would recommend using iostat or sar to
monitor how volumes are handling the load (I usually choose a 1 sec
granularity and look for 100% util and high - server hundred ms -
awaits). Also iotop could be enlightening.

Regards

Mark

On 02/04/13 11:35, Armand du Plessis wrote:
>
> It's on Amazon EC2 -
> * cc2.8xlarge instance type
> * 6 volumes in RAID-0 configuration. (1000 PIOPS)
>
> 60.5 GiB of memory
> 88 EC2 Compute Units (2 x Intel Xeon E5-2670, eight-core)
> 3370 GB of instance storage
> 64-bit platform
> I/O Performance: Very High (10 Gigabit Ethernet)
> EBS-Optimized Available: No**
> API name: cc2.8xlarge
>



Re: Problems with pg_locks explosion

From
Jeff Janes
Date:
On Mon, Apr 1, 2013 at 3:35 PM, Armand du Plessis <adp@bank.io> wrote:
[Apologies, I first sent this to the incorrect list, postgres-admin, in the event you receive it twice]

Hi there,

I'm hoping someone on the list can shed some light on an issue I'm having with our Postgresql cluster. I'm literally tearing out my hair and don't have a deep enough understanding of Postgres to find the problem. 

What's happening is I had severe disk/io issues on our original Postgres cluster (9.0.8)
and switched to a new instance with a RAID-0 volume array.

What was the old instance IO?  Did you do IO benchmarking on both?
 
The machine's CPU usage would hover around 30% and our database would run lightning fast with pg_locks hovering between 100-200. 

Within a few seconds something would trigger a massive increase in pg_locks so that it suddenly shoots up to 4000-8000. At this point everything dies. Queries that usually take a few milliseconds takes minutes and everything is unresponsive until I restart postgres. 

I think that pg_locks is pretty much a red herring.  All it means is that you have a lot more active connections than you used to.   All active connections are going to hold various locks, while most idle connections (other than 'idle in transaction') connections will not hold any.

Although I doubt it will solve this particular problem, you should probably use a connection pooler.



shared_buffers = 32GB

That seems very high.  There are reports that using >8 GB leads to precisely the type of problem you are seeing (checkpoint associated freezes).  Although I've never seen those reports when fsync=off.  

I thought you might be suffering from the problem solved in release 9.1 by item "Merge duplicate fsync requests (Robert Haas, Greg Smith)", but then I realized that with fsync=off it could not be that.

 

max_connections = 800

That also is very high.


The problems seem to overlap with checkpoints. 

2013-04-01 21:31:35.592 UTC,,,26877,,5159fa5f.68fd,1,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2013-04-01 21:40:35.033 UTC,,,26877,,5159fa5f.68fd,2,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint complete: wrote 100635 buffers (2.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=539.439 s, sync=0.000 s, total=539.441 s; sync files=0, longest=0.000 s, average=0.000 s",,,,,,,,,""
2013-04-01 21:41:35.093 UTC,,,26877,,5159fa5f.68fd,3,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""


I think you changed checkpoint_timout from default (5 min) to 10 minutes, without telling us.  Anyway, this is where it would be nice to know how much of the 539.439 s in the write phase was spent blocking on writes, and how much was spent napping.  But that info is not collected by pgsql.

Your top output looked for it was a time at which there were no problems, and it didn't include the top processes, so it wasn't very informative.
 
If you could upgrade to 9.2 and capture some data with track_io_timing, that could be useful.

Cheers,

Jeff

Fwd: Problems with pg_locks explosion

From
Armand du Plessis
Date:
Thanks Mark, 

I had a look at the iostat output (on a 5s interval) and pasted it below. The utilization and waits seems low. Included a sample below, #1 taken during normal operation and then when the locks happen it basically drops to 0 across the board. My (mis)understanding of the IOPS was that it would be 1000 IOPS per/volume and when in RAID0 should give me quite a bit higher throughput than in a single EBS volume setup. (My naive envelop calculation was #volumes * PIOPS = Effective IOPS :/)


Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdk              0.00     0.00  141.60    0.00  5084.80     0.00    35.91     0.43    3.06   0.51   7.28
xvdj              0.00     0.00  140.40    0.40  4614.40    24.00    32.94     0.49    3.45   0.52   7.28
xvdi              0.00     0.00  123.00    2.00  4019.20   163.20    33.46     0.33    2.63   0.68   8.48
xvdh              0.00     0.00  139.80    0.80  4787.20    67.20    34.53     0.52    3.73   0.55   7.68
xvdg              0.00     0.00  143.80    0.20  4804.80    16.00    33.48     0.86    6.03   0.72  10.40
xvdf              0.00     0.00  146.40    0.00  4758.40     0.00    32.50     0.55    3.76   0.55   8.00
md127             0.00     0.00  831.20    3.40 27867.20   270.40    33.71     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00  100.00    0.00    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvda              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdk              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdj              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdi              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdh              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdg              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
xvdf              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md127             0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

It only spikes to 100% util when the server restarts. What bugs me though is Cloud Metrics show 100% Throughput on all the volumes despite the output above. 

I'm looking into  vm.dirty_background_ratio, vm.dirty_ratio sysctls. Is there any guidance or links available that would be useful as a starting point? 

Thanks again for the help, I really appreciate it. 

Regards,

Armand

On Tue, Apr 2, 2013 at 2:11 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
In addition to tuning the various Postgres config knobs you may need to look at how your AWS server is set up. If your load is causing an IO stall then *symptoms* of this will be lots of locks...

You have quite a lot of memory (60G), so look at tuning the vm.dirty_background_ratio, vm.dirty_ratio sysctls to avoid trying to *suddenly* write out many gigs of dirty buffers.

Your provisioned volumes are much better than the default AWS ones, but are still not hugely fast (i.e 1000 IOPS is about 8 MB/s worth of Postgres 8k buffers). So you may need to look at adding more volumes into the array, or adding some separate ones and putting pg_xlog directory on 'em.

However before making changes I would recommend using iostat or sar to monitor how volumes are handling the load (I usually choose a 1 sec granularity and look for 100% util and high - server hundred ms - awaits). Also iotop could be enlightening.

Regards

Mark

Re: Problems with pg_locks explosion

From
Armand du Plessis
Date:
Hi Jeff, 

Sorry I should've mentioned the new instance is Postgres 9.2.3. The old instance IO maxed out the disk/io available on a single EBS volume on AWS. It had 2000 PIOPS but was constantly bottlenecked. I assumed that striping 6 1000 IOPS volumes in RAID-0 would give me some breathing space on that front, and looking at the iostat (just included in previous email) it seems to be doing OK. 

I actually had pg_pool running as a test but to avoid having too many moving parts in the change removed it from the equation. Need to look into the proper configuration so it doesn't saturate my cluster worse than I'm doing myself.

I've commented inline. 

Regards,

Armand

PS. This is probably the most helpful mailing list I've ever come across. Starting to feel a little more that it can be solved. 


On Tue, Apr 2, 2013 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 1, 2013 at 3:35 PM, Armand du Plessis <adp@bank.io> wrote:
[Apologies, I first sent this to the incorrect list, postgres-admin, in the event you receive it twice]

Hi there,

I'm hoping someone on the list can shed some light on an issue I'm having with our Postgresql cluster. I'm literally tearing out my hair and don't have a deep enough understanding of Postgres to find the problem. 

What's happening is I had severe disk/io issues on our original Postgres cluster (9.0.8)
and switched to a new instance with a RAID-0 volume array.

What was the old instance IO?  Did you do IO benchmarking on both?
 
The machine's CPU usage would hover around 30% and our database would run lightning fast with pg_locks hovering between 100-200. 

Within a few seconds something would trigger a massive increase in pg_locks so that it suddenly shoots up to 4000-8000. At this point everything dies. Queries that usually take a few milliseconds takes minutes and everything is unresponsive until I restart postgres. 

I think that pg_locks is pretty much a red herring.  All it means is that you have a lot more active connections than you used to.   All active connections are going to hold various locks, while most idle connections (other than 'idle in transaction') connections will not hold any.

Although I doubt it will solve this particular problem, you should probably use a connection pooler.



shared_buffers = 32GB

That seems very high.  There are reports that using >8 GB leads to precisely the type of problem you are seeing (checkpoint associated freezes).  Although I've never seen those reports when fsync=off.  

I thought you might be suffering from the problem solved in release 9.1 by item "Merge duplicate fsync requests (Robert Haas, Greg Smith)", but then I realized that with fsync=off it could not be that.

 

max_connections = 800

That also is very high.


The problems seem to overlap with checkpoints. 

2013-04-01 21:31:35.592 UTC,,,26877,,5159fa5f.68fd,1,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2013-04-01 21:40:35.033 UTC,,,26877,,5159fa5f.68fd,2,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint complete: wrote 100635 buffers (2.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=539.439 s, sync=0.000 s, total=539.441 s; sync files=0, longest=0.000 s, average=0.000 s",,,,,,,,,""
2013-04-01 21:41:35.093 UTC,,,26877,,5159fa5f.68fd,3,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""


I think you changed checkpoint_timout from default (5 min) to 10 minutes, without telling us.  Anyway, this is where it would be nice to know how much of the 539.439 s in the write phase was spent blocking on writes, and how much was spent napping.  But that info is not collected by pgsql.

I did actually change it to 25 minutes. Apologies it was probably lost in the text of a previous email. Here's the changed settings:

# - Background Writer -

bgwriter_delay = 200ms                  # 10-10000ms between rounds
bgwriter_lru_maxpages = 400             # 0-1000 max buffers written/round
bgwriter_lru_multiplier = 2.0           # 0-10.0 multipler on buffers scanned/round

checkpoint_segments = 128
checkpoint_timeout = 25min  

It seems to be lasting longer with these settings. 
 

Your top output looked for it was a time at which there were no problems, and it didn't include the top processes, so it wasn't very informative.
 
If you could upgrade to 9.2 and capture some data with track_io_timing, that could be useful.

I'm looking into track_io_timing.
 

Cheers,

Jeff

Re: Problems with pg_locks explosion

From
Armand du Plessis
Date:
I've run an EXPLAIN ANALYZE on one of the queries that appeared in the pg_locks (although like you say that might be a red herring) both during normal response times (2) and also after the locks backlog materialized (1) 

The output below, I've just blanked out some columns. The IO timings do seem an order of magnitude slower but not excessive unless I'm reading it wrong. 

"Limit  (cost=2364.19..2365.44 rows=500 width=177) (actual time=6501.103..6507.196 rows=500 loops=1)"
"  Output: 
"  Buffers: shared hit=7163 read=137"
"  I/O Timings: read=107.771"
"  ->  Sort  (cost=2364.19..2365.56 rows=549 width=177) (actual time=6501.095..6503.216 rows=500 loops=1)"
"        Output: 
"        Sort Key: messages.created_at"
"        Sort Method: quicksort  Memory: 294kB"
"        Buffers: shared hit=7163 read=137"
"        I/O Timings: read=107.771"
"        ->  Nested Loop  (cost=181.19..2339.21 rows=549 width=177) (actual time=6344.410..6495.377 rows=783 loops=1)"
"              Output: 
"              Buffers: shared hit=7160 read=137"
"              I/O Timings: read=107.771"
"              ->  Nested Loop  (cost=181.19..1568.99 rows=549 width=177) (actual time=6344.389..6470.549 rows=783 loops=1)"
"                    Output: 
"                    Buffers: shared hit=3931 read=137"
"                    I/O Timings: read=107.771"
"                    ->  Bitmap Heap Scan on public.messages  (cost=181.19..798.78 rows=549 width=177) (actual time=6344.342..6436.117 rows=783 loops=1)"
"                          Output: 
"                          Recheck Cond: 
"                          Buffers: shared hit=707 read=137"
"                          I/O Timings: read=107.771"
"                          ->  BitmapOr  (cost=181.19..181.19 rows=549 width=0) (actual time=6344.226..6344.226 rows=0 loops=1)"
"                                Buffers: shared hit=120 read=20"
"                                I/O Timings: read=37.085"
"                                ->  Bitmap Index Scan on messages_sender_type_created_at_idx  (cost=0.00..23.41 rows=309 width=0) (actual time=6343.358..6343.358 rows=366 loops=1)"
"                                      Index Cond: 
"                                      Buffers: shared hit=26 read=15"
"                                      I/O Timings: read=36.977"
"                                ->  Bitmap Index Scan on messages_type_sender_recipient_created_at  
"                                      Buffers: shared hit=94 read=5"
"                                      I/O Timings: read=0.108"
"                    ->  Index Only Scan using profiles_pkey on public.profiles  (cost=0.00..1.39 rows=1 width=4) (actual time=0.018..0.024 rows=1 loops=783)"
"                          Output: profiles.id"
"                          Index Cond: (profiles.id = messages.sender)"
"                          Heap Fetches: 661"
"                          Buffers: shared hit=3224"
"              ->  Index Only Scan using profiles_pkey on public.profiles recipient_profiles_messages  (cost=0.00..1.39 rows=1 width=4) (actual time=0.014..0.018 rows=1 loops=783)"
"                    Output: recipient_profiles_messages.id"
"                    Index Cond: (recipient_profiles_messages.id = messages.recipient)"
"                    Heap Fetches: 667"
"                    Buffers: shared hit=3229"
"Total runtime: 6509.328 ms"



"Limit  (cost=2366.57..2367.82 rows=500 width=177) (actual time=73.284..76.296 rows=500 loops=1)"
"  Output: various columns"
"  Buffers: shared hit=6738 read=562"
"  I/O Timings: read=19.212"
"  ->  Sort  (cost=2366.57..2367.94 rows=549 width=177) (actual time=73.276..74.300 rows=500 loops=1)"
"        Output: various columns"
"        Sort Key: messages.created_at"
"        Sort Method: quicksort  Memory: 294kB"
"        Buffers: shared hit=6738 read=562"
"        I/O Timings: read=19.212"
"        ->  Nested Loop  (cost=181.19..2341.59 rows=549 width=177) (actual time=3.556..69.866 rows=783 loops=1)"
"              Output: various columns
"              Buffers: shared hit=6735 read=562"
"              I/O Timings: read=19.212"
"              ->  Nested Loop  (cost=181.19..1570.19 rows=549 width=177) (actual time=3.497..53.820 rows=783 loops=1)"
"                    Output: various columns
"                    Buffers: shared hit=3506 read=562"
"                    I/O Timings: read=19.212"
"                    ->  Bitmap Heap Scan on public.messages  (cost=181.19..798.78 rows=549 width=177) (actual time=3.408..32.906 rows=783 loops=1)"
"                          Output: various columns
"                          Recheck Cond: ()
"                          Buffers: shared hit=282 read=562"
"                          I/O Timings: read=19.212"
"                          ->  BitmapOr  (cost=181.19..181.19 rows=549 width=0) (actual time=3.279..3.279 rows=0 loops=1)"
"                                Buffers: shared hit=114 read=26"
"                                I/O Timings: read=1.755"
"                                ->  Bitmap Index Scan on messages_sender_type_created_at_idx  (cost=0.00..23.41 rows=309 width=0) (actual time=1.882..1.882 rows=366 loops=1)"
"                                      Index Cond: 
"                                      Buffers: shared hit=25 read=16"
"                                      I/O Timings: read=1.085"
"                                ->  Bitmap Index Scan on 
"                                      Buffers: shared hit=89 read=10"
"                                      I/O Timings: read=0.670"
"                    ->  Index Only Scan using profiles_pkey on public.profiles  (cost=0.00..1.40 rows=1 width=4) (actual time=0.012..0.015 rows=1 loops=783)"
"                          Output: profiles.id"
"                          Index Cond: (profiles.id = messages.sender)"
"                          Heap Fetches: 654"
"                          Buffers: shared hit=3224"
"              ->  Index Only Scan using profiles_pkey on public.profiles recipient_profiles_messages  (cost=0.00..1.40 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=783)"
"                    Output: recipient_profiles_messages.id"
"                    Index Cond: (recipient_profiles_messages.id = messages.recipient)"
"                    Heap Fetches: 647"
"                    Buffers: shared hit=3229"
"Total runtime: 77.528 ms"



On Tue, Apr 2, 2013 at 2:40 AM, Armand du Plessis <adp@bank.io> wrote:
Hi Jeff, 

Sorry I should've mentioned the new instance is Postgres 9.2.3. The old instance IO maxed out the disk/io available on a single EBS volume on AWS. It had 2000 PIOPS but was constantly bottlenecked. I assumed that striping 6 1000 IOPS volumes in RAID-0 would give me some breathing space on that front, and looking at the iostat (just included in previous email) it seems to be doing OK. 

I actually had pg_pool running as a test but to avoid having too many moving parts in the change removed it from the equation. Need to look into the proper configuration so it doesn't saturate my cluster worse than I'm doing myself.

I've commented inline. 

Regards,

Armand

PS. This is probably the most helpful mailing list I've ever come across. Starting to feel a little more that it can be solved. 


On Tue, Apr 2, 2013 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 1, 2013 at 3:35 PM, Armand du Plessis <adp@bank.io> wrote:
[Apologies, I first sent this to the incorrect list, postgres-admin, in the event you receive it twice]

Hi there,

I'm hoping someone on the list can shed some light on an issue I'm having with our Postgresql cluster. I'm literally tearing out my hair and don't have a deep enough understanding of Postgres to find the problem. 

What's happening is I had severe disk/io issues on our original Postgres cluster (9.0.8)
and switched to a new instance with a RAID-0 volume array.

What was the old instance IO?  Did you do IO benchmarking on both?
 
The machine's CPU usage would hover around 30% and our database would run lightning fast with pg_locks hovering between 100-200. 

Within a few seconds something would trigger a massive increase in pg_locks so that it suddenly shoots up to 4000-8000. At this point everything dies. Queries that usually take a few milliseconds takes minutes and everything is unresponsive until I restart postgres. 

I think that pg_locks is pretty much a red herring.  All it means is that you have a lot more active connections than you used to.   All active connections are going to hold various locks, while most idle connections (other than 'idle in transaction') connections will not hold any.

Although I doubt it will solve this particular problem, you should probably use a connection pooler.



shared_buffers = 32GB

That seems very high.  There are reports that using >8 GB leads to precisely the type of problem you are seeing (checkpoint associated freezes).  Although I've never seen those reports when fsync=off.  

I thought you might be suffering from the problem solved in release 9.1 by item "Merge duplicate fsync requests (Robert Haas, Greg Smith)", but then I realized that with fsync=off it could not be that.

 

max_connections = 800

That also is very high.


The problems seem to overlap with checkpoints. 

2013-04-01 21:31:35.592 UTC,,,26877,,5159fa5f.68fd,1,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2013-04-01 21:40:35.033 UTC,,,26877,,5159fa5f.68fd,2,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint complete: wrote 100635 buffers (2.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=539.439 s, sync=0.000 s, total=539.441 s; sync files=0, longest=0.000 s, average=0.000 s",,,,,,,,,""
2013-04-01 21:41:35.093 UTC,,,26877,,5159fa5f.68fd,3,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""


I think you changed checkpoint_timout from default (5 min) to 10 minutes, without telling us.  Anyway, this is where it would be nice to know how much of the 539.439 s in the write phase was spent blocking on writes, and how much was spent napping.  But that info is not collected by pgsql.

I did actually change it to 25 minutes. Apologies it was probably lost in the text of a previous email. Here's the changed settings:

# - Background Writer -

bgwriter_delay = 200ms                  # 10-10000ms between rounds
bgwriter_lru_maxpages = 400             # 0-1000 max buffers written/round
bgwriter_lru_multiplier = 2.0           # 0-10.0 multipler on buffers scanned/round

checkpoint_segments = 128
checkpoint_timeout = 25min  

It seems to be lasting longer with these settings. 
 

Your top output looked for it was a time at which there were no problems, and it didn't include the top processes, so it wasn't very informative.
 
If you could upgrade to 9.2 and capture some data with track_io_timing, that could be useful.

I'm looking into track_io_timing.
 

Cheers,

Jeff


Re: Problems with pg_locks explosion

From
Steven Crandell
Date:
Armand,

All of the symptoms you describe line up perfectly with a problem I had recently when upgrading DB hardware.
Everything ran find until we hit some threshold somewhere at which point the locks would pile up in the thousands just as you describe, all while we were not I/O bound.

I was moving from a DELL 810 that used a flex memory bridge to a DELL 820 that used round robin on their quad core intels.
(Interestingly we also found out that DELL is planning on rolling back to the flex memory bridge later this year.)

Any chance you could find out if your old processors might have been using flex while you're new processors might be using round robin?

-s


On Mon, Apr 1, 2013 at 5:59 PM, Armand du Plessis <adp@bank.io> wrote:
I've run an EXPLAIN ANALYZE on one of the queries that appeared in the pg_locks (although like you say that might be a red herring) both during normal response times (2) and also after the locks backlog materialized (1) 

The output below, I've just blanked out some columns. The IO timings do seem an order of magnitude slower but not excessive unless I'm reading it wrong. 

"Limit  (cost=2364.19..2365.44 rows=500 width=177) (actual time=6501.103..6507.196 rows=500 loops=1)"
"  Output: 
"  Buffers: shared hit=7163 read=137"
"  I/O Timings: read=107.771"
"  ->  Sort  (cost=2364.19..2365.56 rows=549 width=177) (actual time=6501.095..6503.216 rows=500 loops=1)"
"        Output: 
"        Sort Key: messages.created_at"
"        Sort Method: quicksort  Memory: 294kB"
"        Buffers: shared hit=7163 read=137"
"        I/O Timings: read=107.771"
"        ->  Nested Loop  (cost=181.19..2339.21 rows=549 width=177) (actual time=6344.410..6495.377 rows=783 loops=1)"
"              Output: 
"              Buffers: shared hit=7160 read=137"
"              I/O Timings: read=107.771"
"              ->  Nested Loop  (cost=181.19..1568.99 rows=549 width=177) (actual time=6344.389..6470.549 rows=783 loops=1)"
"                    Output: 
"                    Buffers: shared hit=3931 read=137"
"                    I/O Timings: read=107.771"
"                    ->  Bitmap Heap Scan on public.messages  (cost=181.19..798.78 rows=549 width=177) (actual time=6344.342..6436.117 rows=783 loops=1)"
"                          Output: 
"                          Recheck Cond: 
"                          Buffers: shared hit=707 read=137"
"                          I/O Timings: read=107.771"
"                          ->  BitmapOr  (cost=181.19..181.19 rows=549 width=0) (actual time=6344.226..6344.226 rows=0 loops=1)"
"                                Buffers: shared hit=120 read=20"
"                                I/O Timings: read=37.085"
"                                ->  Bitmap Index Scan on messages_sender_type_created_at_idx  (cost=0.00..23.41 rows=309 width=0) (actual time=6343.358..6343.358 rows=366 loops=1)"
"                                      Index Cond: 
"                                      Buffers: shared hit=26 read=15"
"                                      I/O Timings: read=36.977"
"                                ->  Bitmap Index Scan on messages_type_sender_recipient_created_at  
"                                      Buffers: shared hit=94 read=5"
"                                      I/O Timings: read=0.108"
"                    ->  Index Only Scan using profiles_pkey on public.profiles  (cost=0.00..1.39 rows=1 width=4) (actual time=0.018..0.024 rows=1 loops=783)"
"                          Output: profiles.id"
"                          Index Cond: (profiles.id = messages.sender)"
"                          Heap Fetches: 661"
"                          Buffers: shared hit=3224"
"              ->  Index Only Scan using profiles_pkey on public.profiles recipient_profiles_messages  (cost=0.00..1.39 rows=1 width=4) (actual time=0.014..0.018 rows=1 loops=783)"
"                    Output: recipient_profiles_messages.id"
"                    Index Cond: (recipient_profiles_messages.id = messages.recipient)"
"                    Heap Fetches: 667"
"                    Buffers: shared hit=3229"
"Total runtime: 6509.328 ms"



"Limit  (cost=2366.57..2367.82 rows=500 width=177) (actual time=73.284..76.296 rows=500 loops=1)"
"  Output: various columns"
"  Buffers: shared hit=6738 read=562"
"  I/O Timings: read=19.212"
"  ->  Sort  (cost=2366.57..2367.94 rows=549 width=177) (actual time=73.276..74.300 rows=500 loops=1)"
"        Output: various columns"
"        Sort Key: messages.created_at"
"        Sort Method: quicksort  Memory: 294kB"
"        Buffers: shared hit=6738 read=562"
"        I/O Timings: read=19.212"
"        ->  Nested Loop  (cost=181.19..2341.59 rows=549 width=177) (actual time=3.556..69.866 rows=783 loops=1)"
"              Output: various columns
"              Buffers: shared hit=6735 read=562"
"              I/O Timings: read=19.212"
"              ->  Nested Loop  (cost=181.19..1570.19 rows=549 width=177) (actual time=3.497..53.820 rows=783 loops=1)"
"                    Output: various columns
"                    Buffers: shared hit=3506 read=562"
"                    I/O Timings: read=19.212"
"                    ->  Bitmap Heap Scan on public.messages  (cost=181.19..798.78 rows=549 width=177) (actual time=3.408..32.906 rows=783 loops=1)"
"                          Output: various columns
"                          Recheck Cond: ()
"                          Buffers: shared hit=282 read=562"
"                          I/O Timings: read=19.212"
"                          ->  BitmapOr  (cost=181.19..181.19 rows=549 width=0) (actual time=3.279..3.279 rows=0 loops=1)"
"                                Buffers: shared hit=114 read=26"
"                                I/O Timings: read=1.755"
"                                ->  Bitmap Index Scan on messages_sender_type_created_at_idx  (cost=0.00..23.41 rows=309 width=0) (actual time=1.882..1.882 rows=366 loops=1)"
"                                      Index Cond: 
"                                      Buffers: shared hit=25 read=16"
"                                      I/O Timings: read=1.085"
"                                ->  Bitmap Index Scan on 
"                                      Buffers: shared hit=89 read=10"
"                                      I/O Timings: read=0.670"
"                    ->  Index Only Scan using profiles_pkey on public.profiles  (cost=0.00..1.40 rows=1 width=4) (actual time=0.012..0.015 rows=1 loops=783)"
"                          Output: profiles.id"
"                          Index Cond: (profiles.id = messages.sender)"
"                          Heap Fetches: 654"
"                          Buffers: shared hit=3224"
"              ->  Index Only Scan using profiles_pkey on public.profiles recipient_profiles_messages  (cost=0.00..1.40 rows=1 width=4) (actual time=0.007..0.009 rows=1 loops=783)"
"                    Output: recipient_profiles_messages.id"
"                    Index Cond: (recipient_profiles_messages.id = messages.recipient)"
"                    Heap Fetches: 647"
"                    Buffers: shared hit=3229"
"Total runtime: 77.528 ms"



On Tue, Apr 2, 2013 at 2:40 AM, Armand du Plessis <adp@bank.io> wrote:
Hi Jeff, 

Sorry I should've mentioned the new instance is Postgres 9.2.3. The old instance IO maxed out the disk/io available on a single EBS volume on AWS. It had 2000 PIOPS but was constantly bottlenecked. I assumed that striping 6 1000 IOPS volumes in RAID-0 would give me some breathing space on that front, and looking at the iostat (just included in previous email) it seems to be doing OK. 

I actually had pg_pool running as a test but to avoid having too many moving parts in the change removed it from the equation. Need to look into the proper configuration so it doesn't saturate my cluster worse than I'm doing myself.

I've commented inline. 

Regards,

Armand

PS. This is probably the most helpful mailing list I've ever come across. Starting to feel a little more that it can be solved. 


On Tue, Apr 2, 2013 at 2:21 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Mon, Apr 1, 2013 at 3:35 PM, Armand du Plessis <adp@bank.io> wrote:
[Apologies, I first sent this to the incorrect list, postgres-admin, in the event you receive it twice]

Hi there,

I'm hoping someone on the list can shed some light on an issue I'm having with our Postgresql cluster. I'm literally tearing out my hair and don't have a deep enough understanding of Postgres to find the problem. 

What's happening is I had severe disk/io issues on our original Postgres cluster (9.0.8)
and switched to a new instance with a RAID-0 volume array.

What was the old instance IO?  Did you do IO benchmarking on both?
 
The machine's CPU usage would hover around 30% and our database would run lightning fast with pg_locks hovering between 100-200. 

Within a few seconds something would trigger a massive increase in pg_locks so that it suddenly shoots up to 4000-8000. At this point everything dies. Queries that usually take a few milliseconds takes minutes and everything is unresponsive until I restart postgres. 

I think that pg_locks is pretty much a red herring.  All it means is that you have a lot more active connections than you used to.   All active connections are going to hold various locks, while most idle connections (other than 'idle in transaction') connections will not hold any.

Although I doubt it will solve this particular problem, you should probably use a connection pooler.



shared_buffers = 32GB

That seems very high.  There are reports that using >8 GB leads to precisely the type of problem you are seeing (checkpoint associated freezes).  Although I've never seen those reports when fsync=off.  

I thought you might be suffering from the problem solved in release 9.1 by item "Merge duplicate fsync requests (Robert Haas, Greg Smith)", but then I realized that with fsync=off it could not be that.

 

max_connections = 800

That also is very high.


The problems seem to overlap with checkpoints. 

2013-04-01 21:31:35.592 UTC,,,26877,,5159fa5f.68fd,1,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""
2013-04-01 21:40:35.033 UTC,,,26877,,5159fa5f.68fd,2,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint complete: wrote 100635 buffers (2.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=539.439 s, sync=0.000 s, total=539.441 s; sync files=0, longest=0.000 s, average=0.000 s",,,,,,,,,""
2013-04-01 21:41:35.093 UTC,,,26877,,5159fa5f.68fd,3,,2013-04-01 21:21:35 UTC,,0,LOG,00000,"checkpoint starting: time",,,,,,,,,""


I think you changed checkpoint_timout from default (5 min) to 10 minutes, without telling us.  Anyway, this is where it would be nice to know how much of the 539.439 s in the write phase was spent blocking on writes, and how much was spent napping.  But that info is not collected by pgsql.

I did actually change it to 25 minutes. Apologies it was probably lost in the text of a previous email. Here's the changed settings:

# - Background Writer -

bgwriter_delay = 200ms                  # 10-10000ms between rounds
bgwriter_lru_maxpages = 400             # 0-1000 max buffers written/round
bgwriter_lru_multiplier = 2.0           # 0-10.0 multipler on buffers scanned/round

checkpoint_segments = 128
checkpoint_timeout = 25min  

It seems to be lasting longer with these settings. 
 

Your top output looked for it was a time at which there were no problems, and it didn't include the top processes, so it wasn't very informative.
 
If you could upgrade to 9.2 and capture some data with track_io_timing, that could be useful.

I'm looking into track_io_timing.
 

Cheers,

Jeff



Re: Problems with pg_locks explosion

From
Armand du Plessis
Date:
Hi Steven, 

Sounds very familiar. Painfully familiar :( 

But I really don't know. All I can see is that in this particular configuration the instance has 2 x Intel Xeon E5-2670, eight-core processors. I can't find any info on whether it's flex or round robin. AWS typically don't make the underlying hardware known. The exception is on the chip-types on the higher-end instance types which is where I got the info above from. 

Below is an excerpt from atop when the problem occur. The CPUs jump to high sys usage, not sure if that was similar to what you saw?

How did you get it resolved in the end?

ATOP - ip-10-155-231-112                                                                      2013/04/02  01:25:40                 ------                                                                        2s elapsed
59;169H 0   70.15s  |              |  user   8.19s |               |               |              |               | #proc   1015  |              |  #zombie    0 |               | clones     0 |               |               |               |              |  #exit      2 |
CPU | sys    3182%  |              |  user     30% |               | irq       1%  |              |               |               | idle      0% |               | wait      0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      98%  |              |  user      1% |               | irq       1%  |              |               |               | idle      0% |               | cpu000 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      96%  |              |  user      4% |               | irq       0%  |              |               |               | idle      0% |               | cpu001 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu002 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      99%  |              |  user      1% |               | irq       0%  |              |               |               | idle      0% |               | cpu003 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu004 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu005 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      98%  |              |  user      2% |               | irq       0%  |              |               |               | idle      0% |               | cpu006 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      99%  |              |  user      1% |               | irq       0%  |              |               |               | idle      0% |               | cpu007 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      99%  |              |  user      1% |               | irq       0%  |              |               |               | idle      0% |               | cpu008 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu009 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      99%  |              |  user      1% |               | irq       0%  |              |               |               | idle      0% |               | cpu010 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu011 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      99%  |              |  user      1% |               | irq       0%  |              |               |               | idle      0% |               | cpu012 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      97%  |              |  user      3% |               | irq       0%  |              |               |               | idle      0% |               | cpu013 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu014 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu015 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu016 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      82%  |              |  user     18% |               | irq       0%  |              |               |               | idle      0% |               | cpu017 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu018 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu019 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu020 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu021 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu022 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu023 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu024 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu025 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu026 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu027 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      99%  |              |  user      1% |               | irq       0%  |              |               |               | idle      0% |               | cpu028 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu029 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys     100%  |              |  user      0% |               | irq       0%  |              |               |               | idle      0% |               | cpu030 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
cpu | sys      99%  |              |  user      1% |               | irq       0%  |              |               |               | idle      0% |               | cpu031 w  0%  |              |               |               | steal     0%  |              |  guest     0% |
CPL | avg1   90.60  |              |  avg5   60.80 |               |               | avg15  39.77 |               |               |              |  csw     1011 |               | intr   17568 |               |               |               |              |  numcpu    32 |
MEM | tot    58.5G  |              |  free  418.4M |               | cache  45.0G  | dirty   0.6M |               | buff    5.8M  |              |  slab  501.2M |               |              |               |               |               |              |               |
SWP | tot     0.0M  |              |  free    0.0M |               |               |              |               |               |              |               |               |              |               |               | vmcom  49.8G  |              |  vmlim  29.3G |
PAG | scan    1858  |              |               |               | stall 0  |              |               |               |              |               |               |              |  swin       0 |               |               |              |  swout      0 |
NET | transport     | tcpi     318 |          | tcpo     392  | udpi      34  |         |  udpo      39 | tcpao      0  |              |  tcppo      2 | tcprs      0  |              |  tcpie      0 |  tcpor      0 |               | udpnp      0 |  udpip      0 |
NET | network       |              |  ipi      357 |               | ipo      397  | ipfrw 0 |               | deliv    357  |              |               |               |              |               |               | icmpi      0  |              |  icmpo      0 |
NET | eth0    ----  |              |  pcki     318 | pcko     358  |               | si  200 Kbps |  so  947 Kbps |               | coll       0 |               | mlti       0  | erri       0 |               |  erro       0 | drpi       0  |              |  drpo       0 |
NET | lo      ----  |              |  pcki      39 | pcko      39  |               | si   79 Kbps |  so   79 Kbps |               | coll       0 |               | mlti       0  | erri       0 |               |  erro       0 | drpi       0  |              |  drpo       0 |
debug2: channel 0: window 997757 sent adjust 50819


On Tue, Apr 2, 2013 at 3:07 AM, Steven Crandell <steven.crandell@gmail.com> wrote:
Armand,

All of the symptoms you describe line up perfectly with a problem I had recently when upgrading DB hardware.
Everything ran find until we hit some threshold somewhere at which point the locks would pile up in the thousands just as you describe, all while we were not I/O bound.

I was moving from a DELL 810 that used a flex memory bridge to a DELL 820 that used round robin on their quad core intels.
(Interestingly we also found out that DELL is planning on rolling back to the flex memory bridge later this year.)

Any chance you could find out if your old processors might have been using flex while you're new processors might be using round robin?

-s


Re: Fwd: Problems with pg_locks explosion

From
Mark Kirkwood
Date:
Yeah, as I understand it you should have 6000 IOPS available for the md
device (ideally).

The iostats you display certainly look benign... but the key time to be
sampling would be when you see the lock list explode - could look very
different then.

Re vm.dirty* - I would crank the values down by a factor of 5:

vm.dirty_background_ratio = 1 (down from 5)
vm.dirty_ratio = 2 (down from 10)

Assuming of course that you actually are seeing an IO stall (which
should be catchable via iostat or iotop)... and not some other issue.
Otherwise leave 'em alone and keep looking :-)

Cheers

Mark


On 02/04/13 13:31, Armand du Plessis wrote:
>
>
> I had a look at the iostat output (on a 5s interval) and pasted it
> below. The utilization and waits seems low. Included a sample below, #1
> taken during normal operation and then when the locks happen it
> basically drops to 0 across the board. My (mis)understanding of the IOPS
> was that it would be 1000 IOPS per/volume and when in RAID0 should give
> me quite a bit higher throughput than in a single EBS volume setup. (My
> naive envelop calculation was #volumes * PIOPS = Effective IOPS :/)
>
>
> I'm looking into  vm.dirty_background_ratio, vm.dirty_ratio sysctls. Is
> there any guidance or links available that would be useful as a starting
> point?
>
> Thanks again for the help, I really appreciate it.
>



Re: Problems with pg_locks explosion

From
Jeff Janes
Date:
On Monday, April 1, 2013, Mark Kirkwood wrote:

Your provisioned volumes are much better than the default AWS ones, but are still not hugely fast (i.e 1000 IOPS is about 8 MB/s worth of Postgres 8k buffers). So you may need to look at adding more volumes into the array, or adding some separate ones and putting pg_xlog directory on 'em.

However before making changes I would recommend using iostat or sar to monitor how volumes are handling the load (I usually choose a 1 sec granularity and look for 100% util and high - server hundred ms - awaits). Also iotop could be enlightening.

Hi Mark,

Do you have experience using these tools with AWS?  When using non-DAS in other contexts, I've noticed that these tools often give deranged results, because the kernel doesn't correctly know what time to attribute to "network" and what to attribute to "disk".  But I haven't looked into it on AWS EBS, maybe they do a better job there.
 
Thanks for any insight,

Jeff

Re: Problems with pg_locks explosion

From
Jeff Janes
Date:
On Monday, April 1, 2013, Armand du Plessis wrote:
I've run an EXPLAIN ANALYZE on one of the queries that appeared in the pg_locks (although like you say that might be a red herring) both during normal response times (2) and also after the locks backlog materialized (1) 

The output below, I've just blanked out some columns. The IO timings do seem an order of magnitude slower but not excessive unless I'm reading it wrong. 

"Limit  (cost=2364.19..2365.44 rows=500 width=177) (actual time=6501.103..6507.196 rows=500 loops=1)"
"  Output: 
"  Buffers: shared hit=7163 read=137"
"  I/O Timings: read=107.771"
 
...
 

"Limit  (cost=2366.57..2367.82 rows=500 width=177) (actual time=73.284..76.296 rows=500 loops=1)"
"  Output: various columns"
"  Buffers: shared hit=6738 read=562"
"  I/O Timings: read=19.212"

You are correct that the different in IO timing for reads is not nearly enough to explain the difference, but the ratio is still large enough to perhaps be suggestive.  It could be be that all the extra time is spent in IO writes (not reported here).  If you turn on track_io_timing on system-wide you could check the write times in pg_stat_database.  

(Write time has an attribution problem.  I need to make room for my data, so I write out someone else's.  Is the time spent attributed to the one doing the writing, or the one who owns the data written?)

But it is perhaps looking like it might not be IO at all, but rather some kind of internal kernel problem, such as the "zone reclaim" and "huge pages" and memory interleaving, which have been discussed elsewhere in this list for high CPU high RAM machines.  I would summarize it for you, but I don't understand it, and don't have ready access to machines with 64 CPUs and 128 GB of RAM in order to explore it for myself.

But if that is the case, then using a connection pooler to restrict the number of simultaneously active connections might actually be a big win (despite what I said previously).  
 
Cheers,

Jeff

Re: Problems with pg_locks explosion

From
Armand du Plessis
Date:
Thanks Jeff, yup, I'm actually busy setting up pg_pool now. Preliminary results looks promising after switching some client nodes to connect through the pool. 

Here's the output of pg_stat_dabatase but also doesn't seem to be spending more time there either. I'll have a look through the archives for the posts you refer to. Certainly a simple 'upgrade' that's gone south badly here. On paper and with earlier tests with partial load this instance is much better suited for our workload, bizarre. 

Thanks for all the help. 


user=# select * from pg_stat_database;
 datid |      datname      | numbackends | xact_commit | xact_rollback | blks_read |   blks_hit   | tup_returned | tup_fetched  | tup_inserted | tup_updated | tup_deleted | conflicts | temp_files | temp_bytes  | deadlocks | blk_read_time | blk_write_time |          stats_
reset          
-------+-------------------+-------------+-------------+---------------+-----------+--------------+--------------+--------------+--------------+-------------+-------------+-----------+------------+-------------+-----------+---------------+----------------+
 17671 | production        |         232 |   524583730 |        213187 | 623507632 | 403590060861 | 256853219556 | 185283046675 |     95086454 |    66720609 |     4449894 |         0 |         18 | 12076277760 |         0 |   4689747.806 |       4526.539 | 2013-03-29 16:1
8:09.334432+00




On Tue, Apr 2, 2013 at 8:08 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Monday, April 1, 2013, Armand du Plessis wrote:
I've run an EXPLAIN ANALYZE on one of the queries that appeared in the pg_locks (although like you say that might be a red herring) both during normal response times (2) and also after the locks backlog materialized (1) 

The output below, I've just blanked out some columns. The IO timings do seem an order of magnitude slower but not excessive unless I'm reading it wrong. 

"Limit  (cost=2364.19..2365.44 rows=500 width=177) (actual time=6501.103..6507.196 rows=500 loops=1)"
"  Output: 
"  Buffers: shared hit=7163 read=137"
"  I/O Timings: read=107.771"
 
...
 

"Limit  (cost=2366.57..2367.82 rows=500 width=177) (actual time=73.284..76.296 rows=500 loops=1)"
"  Output: various columns"
"  Buffers: shared hit=6738 read=562"
"  I/O Timings: read=19.212"

You are correct that the different in IO timing for reads is not nearly enough to explain the difference, but the ratio is still large enough to perhaps be suggestive.  It could be be that all the extra time is spent in IO writes (not reported here).  If you turn on track_io_timing on system-wide you could check the write times in pg_stat_database.  

(Write time has an attribution problem.  I need to make room for my data, so I write out someone else's.  Is the time spent attributed to the one doing the writing, or the one who owns the data written?)

But it is perhaps looking like it might not be IO at all, but rather some kind of internal kernel problem, such as the "zone reclaim" and "huge pages" and memory interleaving, which have been discussed elsewhere in this list for high CPU high RAM machines.  I would summarize it for you, but I don't understand it, and don't have ready access to machines with 64 CPUs and 128 GB of RAM in order to explore it for myself.

But if that is the case, then using a connection pooler to restrict the number of simultaneously active connections might actually be a big win (despite what I said previously).  
 
Cheers,

Jeff

Re: Problems with pg_locks explosion

From
Mark Kirkwood
Date:
On 02/04/13 19:08, Jeff Janes wrote:
> On Monday, April 1, 2013, Mark Kirkwood wrote:
>
>
>     Your provisioned volumes are much better than the default AWS ones,
>     but are still not hugely fast (i.e 1000 IOPS is about 8 MB/s worth
>     of Postgres 8k buffers). So you may need to look at adding more
>     volumes into the array, or adding some separate ones and putting
>     pg_xlog directory on 'em.
>
>     However before making changes I would recommend using iostat or sar
>     to monitor how volumes are handling the load (I usually choose a 1
>     sec granularity and look for 100% util and high - server hundred ms
>     - awaits). Also iotop could be enlightening.
>
>
> Hi Mark,
>
> Do you have experience using these tools with AWS?  When using non-DAS
> in other contexts, I've noticed that these tools often give deranged
> results, because the kernel doesn't correctly know what time to
> attribute to "network" and what to attribute to "disk".  But I haven't
> looked into it on AWS EBS, maybe they do a better job there.
> Thanks for any insight,
>



Hi Jeff,

That is a very good point. I did notice a reasonable amount of network
traffic on the graphs posted previously, along with a suspiciously low
amount of IO for md127 (which I assume is the raid0 array)...and
wondered if iostat was not seeing IO fully, however it slipped my mind
(I am on leave with kittens - so claim that for the purrrfect excuse)!

However I don't recall there being a problem with the io tools for
standard EBS volumes - but I haven't benchmarked AWS for a over a year,
so things could be different now - and I have no experience with these
new provisioned volumes.

Armand - it might be instructive to do some benchmarking (with another
host and volume set) where you do something like:

$ dd if=/dev/zero of=file bs=8k count=1000000

and see if iostat and friends actually show you doing IO as expected!





Re: Problems with pg_locks explosion

From
Armand du Plessis
Date:
I had my reservations about my almost 0% IO usage on the raid0 array as well. I'm looking at the numbers in atop and it doesn't seem to reflect the aggregate of the volumes as one would expect. I'm just happy I am seeing numbers on the volumes, they're not too bad. 

One thing I was wondering, as a last possible IO resort. Provisioned EBS volumes requires that you maintain a wait queue of 1 for every 200 provisioned IOPS to get reliable IO. My wait queue hovers between 0-1 and with the 1000 IOPS it should be 5. Even thought about artificially pushing more IO to the volumes but I think Jeff's right, there's some internal kernel voodoo at play here. I have a feeling it'll be under control with pg_pool (if I can just get the friggen setup there right) and then I'll have more time to dig into it deeper. 

Apologies to the kittens for the interrupting your leave :) 


On Tue, Apr 2, 2013 at 8:25 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
On 02/04/13 19:08, Jeff Janes wrote:
On Monday, April 1, 2013, Mark Kirkwood wrote:


    Your provisioned volumes are much better than the default AWS ones,
    but are still not hugely fast (i.e 1000 IOPS is about 8 MB/s worth
    of Postgres 8k buffers). So you may need to look at adding more
    volumes into the array, or adding some separate ones and putting
    pg_xlog directory on 'em.

    However before making changes I would recommend using iostat or sar
    to monitor how volumes are handling the load (I usually choose a 1
    sec granularity and look for 100% util and high - server hundred ms
    - awaits). Also iotop could be enlightening.


Hi Mark,

Do you have experience using these tools with AWS?  When using non-DAS
in other contexts, I've noticed that these tools often give deranged
results, because the kernel doesn't correctly know what time to
attribute to "network" and what to attribute to "disk".  But I haven't
looked into it on AWS EBS, maybe they do a better job there.
Thanks for any insight,




Hi Jeff,

That is a very good point. I did notice a reasonable amount of network traffic on the graphs posted previously, along with a suspiciously low amount of IO for md127 (which I assume is the raid0 array)...and wondered if iostat was not seeing IO fully, however it slipped my mind (I am on leave with kittens - so claim that for the purrrfect excuse)!

However I don't recall there being a problem with the io tools for standard EBS volumes - but I haven't benchmarked AWS for a over a year, so things could be different now - and I have no experience with these new provisioned volumes.

Armand - it might be instructive to do some benchmarking (with another host and volume set) where you do something like:

$ dd if=/dev/zero of=file bs=8k count=1000000

and see if iostat and friends actually show you doing IO as expected!




Re: Problems with pg_locks explosion

From
Mark Kirkwood
Date:
Also it is worth checking what your sysctl vm.zone_reclaim_mode is set
to - if 1 then override to 0. As Jeff mentioned, this gotcha for larger
cpu number machines has been discussed at length on this list - but
still traps us now and again!

Cheers

Mark

On 02/04/13 19:33, Armand du Plessis wrote:
> I had my reservations about my almost 0% IO usage on the raid0 array as
> well. I'm looking at the numbers in atop and it doesn't seem to reflect
> the aggregate of the volumes as one would expect. I'm just happy I am
> seeing numbers on the volumes, they're not too bad.
>
> One thing I was wondering, as a last possible IO resort. Provisioned EBS
> volumes requires that you maintain a wait queue of 1 for every 200
> provisioned IOPS to get reliable IO. My wait queue hovers between 0-1
> and with the 1000 IOPS it should be 5. Even thought about artificially
> pushing more IO to the volumes but I think Jeff's right, there's some
> internal kernel voodoo at play here. I have a feeling it'll be under
> control with pg_pool (if I can just get the friggen setup there right)
> and then I'll have more time to dig into it deeper.
>
> Apologies to the kittens for the interrupting your leave :)
>



Re: Problems with pg_locks explosion

From
Armand du Plessis
Date:
Touch wood but I think I found the problem thanks to these pointers. I checked the vm.zone_reclaim_mode and mine was set to 0. However just before the locking starts I can see many of my CPUs flashing red and jump to high percentage sys usage. When I look at top it's the migration kernel tasks that seem to trigger it. 

So it seems it was a bit trigger happy with task migrations, setting the kernel.sched_migration_cost to 5000000 (5ms) seemed to have resolved my woes. I'm yet to see locks climb and it's been running stable for a bit. This post was invaluable in explaining the cause -> http://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com

# Postgres Kernel Tweaks
kernel.sched_migration_cost = 5000000
# kernel.sched_autogroup_enabled = 0

The second recommended setting 'sched_autogroup_enabled' is not available on the kernel I'm running but it doesn't seem to be a problem. 

Again, thanks again for the help. It was seriously appreciated. Long night was long. 

If things change and the problem pops up again I'll update you guys. 

Cheers,

Armand


On Tue, Apr 2, 2013 at 8:43 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
Also it is worth checking what your sysctl vm.zone_reclaim_mode is set to - if 1 then override to 0. As Jeff mentioned, this gotcha for larger cpu number machines has been discussed at length on this list - but still traps us now and again!

Cheers

Mark


On 02/04/13 19:33, Armand du Plessis wrote:
I had my reservations about my almost 0% IO usage on the raid0 array as
well. I'm looking at the numbers in atop and it doesn't seem to reflect
the aggregate of the volumes as one would expect. I'm just happy I am
seeing numbers on the volumes, they're not too bad.

One thing I was wondering, as a last possible IO resort. Provisioned EBS
volumes requires that you maintain a wait queue of 1 for every 200
provisioned IOPS to get reliable IO. My wait queue hovers between 0-1
and with the 1000 IOPS it should be 5. Even thought about artificially
pushing more IO to the volumes but I think Jeff's right, there's some
internal kernel voodoo at play here. I have a feeling it'll be under
control with pg_pool (if I can just get the friggen setup there right)
and then I'll have more time to dig into it deeper.

Apologies to the kittens for the interrupting your leave :)



Re: Problems with pg_locks explosion

From
Armand du Plessis
Date:
Jumped the gun a bit. the problem still exists like before. But it's definitely on the right track, below is the output from top in the seconds before the cluster locks up. For some reason still insisting on moving tasks around despite bumping the sched_migration_cost cost up to 100ms. 

77 root      RT   0     0    0    0 S 32.3  0.0  13:55.20 [migration/24]                                                                                                                                                                                                     
26512 postgres  20   0 8601m 7388 4992 R 32.3  0.0   0:02.17 postgres: other_user xxxx(52944) INSERT                                                                                                                                                 
   38 root      RT   0     0    0    0 S 31.3  0.0  17:26.15 [migration/11]                                                                                                                                                                                                        65 root      RT   0     0    0    0 S 30.0  0.0  13:18.66 [migration/20]                                                                                                                                                                                                     
   62 root      RT   0     0    0    0 S 29.7  0.0  12:58.81 [migration/19]                                                                                                                                                                                                     
   47 root      RT   0     0    0    0 S 29.0  0.0  18:16.43 [migration/14]                                                                                                                                                                                                     
   29 root      RT   0     0    0    0 S 28.7  0.0  25:21.47 [migration/8]                                                                                                                                                                                                      
   71 root      RT   0     0    0    0 S 28.4  0.0  13:20.31 [migration/22]                                                                                                                                                                                                     
   95 root      RT   0     0    0    0 S 23.8  0.0  13:37.31 [migration/30]                                                                                                                                                                                                     
26518 postgres  20   0 8601m 9684 5228 S 21.2  0.0   0:01.89 postgres: other_user xxxxx(52954) INSERT                                                                                                                                                 
    6 root      RT   0     0    0    0 S 20.5  0.0  39:17.72 [migration/0]                                                                                                                                                                                                      
   41 root      RT   0     0    0    0 S 19.6  0.0  18:21.36 [migration/12]                                                                                                                                                                                                     
   68 root      RT   0     0    0    0 S 19.6  0.0  13:04.62 [migration/21]                                                                                                                                                                                                     
   74 root      RT   0     0    0    0 S 18.9  0.0  13:39.41 [migration/23]                                                                                                                                                                                                     
  305 root      20   0     0    0    0 S 18.3  0.0  11:34.52 [kworker/27:1]                                                                                                                                                                                                     
   44 root      RT   0     0    0    0 S 17.0  0.0  18:30.71 [migration/13]                                                                                                                                                                                                     
   89 root      RT   0     0    0    0 S 16.0  0.0  12:13.42 [migration/28]                                                                                                                                                                                                     
    7 root      RT   0     0    0    0 S 15.3  0.0  21:58.56 [migration/1]                                                                                                                                                                                                      
   35 root      RT   0     0    0    0 S 15.3  0.0  20:02.05 [migration/10]                                                                                                                                                                                                     
   53 root      RT   0     0    0    0 S 14.0  0.0  12:51.46 [migration/16]                                                                                                                                                                                                     
11254 root       0 -20 21848 7532 2788 S 11.7  0.0  22:35.66 atop 1                                                                                                                                                                                                             
   14 root      RT   0     0    0    0 S 10.8  0.0  19:36.56 [migration/3]                                                                                                                                                                                                      
26463 postgres  20   0 8601m 7492 5100 R 10.8  0.0   0:00.33 postgres: other_user xxxxx(32835) INSERT                                                                                                                                                   
   32 root      RT   0     0    0    0 S 10.1  0.0  20:46.18 [migration/9]                                                                                                                                                                                                      
16793 root      20   0     0    0    0 S  6.5  0.0   1:12.72 [kworker/25:0]                                                                                                                                                                                                     
   20 root      RT   0     0    0    0 S  5.5  0.0  18:51.81 [migration/5]                                                                                                                                                                                                      
   48 root      20   0     0    0    0 S  5.5  0.0   3:52.93 [kworker/14:0]   


On Tue, Apr 2, 2013 at 10:16 AM, Armand du Plessis <adp@bank.io> wrote:
Touch wood but I think I found the problem thanks to these pointers. I checked the vm.zone_reclaim_mode and mine was set to 0. However just before the locking starts I can see many of my CPUs flashing red and jump to high percentage sys usage. When I look at top it's the migration kernel tasks that seem to trigger it. 

So it seems it was a bit trigger happy with task migrations, setting the kernel.sched_migration_cost to 5000000 (5ms) seemed to have resolved my woes. I'm yet to see locks climb and it's been running stable for a bit. This post was invaluable in explaining the cause -> http://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com

# Postgres Kernel Tweaks
kernel.sched_migration_cost = 5000000
# kernel.sched_autogroup_enabled = 0

The second recommended setting 'sched_autogroup_enabled' is not available on the kernel I'm running but it doesn't seem to be a problem. 

Again, thanks again for the help. It was seriously appreciated. Long night was long. 

If things change and the problem pops up again I'll update you guys. 

Cheers,

Armand


On Tue, Apr 2, 2013 at 8:43 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
Also it is worth checking what your sysctl vm.zone_reclaim_mode is set to - if 1 then override to 0. As Jeff mentioned, this gotcha for larger cpu number machines has been discussed at length on this list - but still traps us now and again!

Cheers

Mark

Re: Problems with pg_locks explosion

From
Armand du Plessis
Date:
It's now running as expected, I made a few other tweaks to get it to an operational state again. So just for closure on this dark period below some notes. 

There was two triggers that caused the almost instant backlog of locks. As suspected the one was scheduler that caused endless problems whenever it started migrating tasks. This would lock up the database (and server) for a second or more after which a few thousand locks existed. 

Setting the kernel.sched_migration_cost to 5ms didn't have the desired effect. The scheduler would still stop the world and break down for a few seconds. After much anguish and research (this is a pretty good explanation of the scheduler tunables http://events.linuxfoundation.org/slides/2011/linuxcon/lcna2011_rajan.pdf) I adjusted :

sysctl -w kernel.sched_min_granularity_ns=9000000
sysctl -w kernel.sched_wakeup_granularity_ns=12000000

Since then I haven't had an interruption from migration. 

I also switched hugepage to not be as aggressive, it was also intrusive and my Postgres not configured to use it. "echo madvise > /sys/kernel/mm/transparent_hugepage/defrag"

After these changes things started smoothing out and running like it should. I also found that if you are running on striped EBS volumes you should really try and get them busy to get consistent performance. I checked with Amazon and the usage I see on the individual modules were correct. (Not the RAID figure but the numbers on the individual volumes, they were idling)

Thanks again for all the help and suggestions. 

Armand


On Tue, Apr 2, 2013 at 11:55 AM, Armand du Plessis <adp@bank.io> wrote:
Jumped the gun a bit. the problem still exists like before. But it's definitely on the right track, below is the output from top in the seconds before the cluster locks up. For some reason still insisting on moving tasks around despite bumping the sched_migration_cost cost up to 100ms. 

77 root      RT   0     0    0    0 S 32.3  0.0  13:55.20 [migration/24]                                                                                                                                                                                                     
26512 postgres  20   0 8601m 7388 4992 R 32.3  0.0   0:02.17 postgres: other_user xxxx(52944) INSERT                                                                                                                                                 
   38 root      RT   0     0    0    0 S 31.3  0.0  17:26.15 [migration/11]                                                                                                                                                                                                        65 root      RT   0     0    0    0 S 30.0  0.0  13:18.66 [migration/20]                                                                                                                                                                                                     
   62 root      RT   0     0    0    0 S 29.7  0.0  12:58.81 [migration/19]                                                                                                                                                                                                     
   47 root      RT   0     0    0    0 S 29.0  0.0  18:16.43 [migration/14]                                                                                                                                                                                                     
   29 root      RT   0     0    0    0 S 28.7  0.0  25:21.47 [migration/8]                                                                                                                                                                                                      
   71 root      RT   0     0    0    0 S 28.4  0.0  13:20.31 [migration/22]                                                                                                                                                                                                     
   95 root      RT   0     0    0    0 S 23.8  0.0  13:37.31 [migration/30]                                                                                                                                                                                                     
26518 postgres  20   0 8601m 9684 5228 S 21.2  0.0   0:01.89 postgres: other_user xxxxx(52954) INSERT                                                                                                                                                 
    6 root      RT   0     0    0    0 S 20.5  0.0  39:17.72 [migration/0]                                                                                                                                                                                                      
   41 root      RT   0     0    0    0 S 19.6  0.0  18:21.36 [migration/12]                                                                                                                                                                                                     
   68 root      RT   0     0    0    0 S 19.6  0.0  13:04.62 [migration/21]                                                                                                                                                                                                     
   74 root      RT   0     0    0    0 S 18.9  0.0  13:39.41 [migration/23]                                                                                                                                                                                                     
  305 root      20   0     0    0    0 S 18.3  0.0  11:34.52 [kworker/27:1]                                                                                                                                                                                                     
   44 root      RT   0     0    0    0 S 17.0  0.0  18:30.71 [migration/13]                                                                                                                                                                                                     
   89 root      RT   0     0    0    0 S 16.0  0.0  12:13.42 [migration/28]                                                                                                                                                                                                     
    7 root      RT   0     0    0    0 S 15.3  0.0  21:58.56 [migration/1]                                                                                                                                                                                                      
   35 root      RT   0     0    0    0 S 15.3  0.0  20:02.05 [migration/10]                                                                                                                                                                                                     
   53 root      RT   0     0    0    0 S 14.0  0.0  12:51.46 [migration/16]                                                                                                                                                                                                     
11254 root       0 -20 21848 7532 2788 S 11.7  0.0  22:35.66 atop 1                                                                                                                                                                                                             
   14 root      RT   0     0    0    0 S 10.8  0.0  19:36.56 [migration/3]                                                                                                                                                                                                      
26463 postgres  20   0 8601m 7492 5100 R 10.8  0.0   0:00.33 postgres: other_user xxxxx(32835) INSERT                                                                                                                                                   
   32 root      RT   0     0    0    0 S 10.1  0.0  20:46.18 [migration/9]                                                                                                                                                                                                      
16793 root      20   0     0    0    0 S  6.5  0.0   1:12.72 [kworker/25:0]                                                                                                                                                                                                     
   20 root      RT   0     0    0    0 S  5.5  0.0  18:51.81 [migration/5]                                                                                                                                                                                                      
   48 root      20   0     0    0    0 S  5.5  0.0   3:52.93 [kworker/14:0]   


On Tue, Apr 2, 2013 at 10:16 AM, Armand du Plessis <adp@bank.io> wrote:
Touch wood but I think I found the problem thanks to these pointers. I checked the vm.zone_reclaim_mode and mine was set to 0. However just before the locking starts I can see many of my CPUs flashing red and jump to high percentage sys usage. When I look at top it's the migration kernel tasks that seem to trigger it. 

So it seems it was a bit trigger happy with task migrations, setting the kernel.sched_migration_cost to 5000000 (5ms) seemed to have resolved my woes. I'm yet to see locks climb and it's been running stable for a bit. This post was invaluable in explaining the cause -> http://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com

# Postgres Kernel Tweaks
kernel.sched_migration_cost = 5000000
# kernel.sched_autogroup_enabled = 0

The second recommended setting 'sched_autogroup_enabled' is not available on the kernel I'm running but it doesn't seem to be a problem. 

Again, thanks again for the help. It was seriously appreciated. Long night was long. 

If things change and the problem pops up again I'll update you guys. 

Cheers,

Armand


On Tue, Apr 2, 2013 at 8:43 AM, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:
Also it is worth checking what your sysctl vm.zone_reclaim_mode is set to - if 1 then override to 0. As Jeff mentioned, this gotcha for larger cpu number machines has been discussed at length on this list - but still traps us now and again!

Cheers

Mark