Thread: Problems with PG 9.3
Hello,
--
Regards
Dhruv
We quite recently moved our production database from 8.4 to 9.3. Its been just 2 months and the features which it allows for parallel backup and recovery has been a great use for us.
But with 9.3 we have also run into a lot of troubles.
Some background:
We run loaders to load data into the DB. These loaders are perl programs which run on a different server than the db server. The amount of data loaded each day is pretty high.
Problem:
Some of the loaders are performing in a bad way. they get stuck or EOF happens on client connection and things just dies. There is one piece which gets stuck daily ... but once killed runs normal.
One of the episode of today one of the loader got stuck for 2hr 15 min. I killed the whole process, restarted it and things ran within 10 mins. What I observed was the DB load hadnt changed or shifted significantly between the killing and the end of the program. At first we were thinking it was loaders but now it looks as if its the postgres database.
The piece where it was stuck for 2hrs was it was trying to insert 37K rows in a 500K row table, which is pretty small considering PG potential. But the bottleneck here we cant solve it.
Kindly help me find the solution, or the right place to dig into to find the exact problem, its been troubling some of our loaders for quite some time now.
Please let me know if you need further information regarding the settings on our DB server.
Regards
Dhruv
Have you tried attaching strace to one the loaders? That might give a clue where it's getting stuck. - Jay Sent from my iPad > On Aug 22, 2014, at 3:23 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote: > > Hello, > We quite recently moved our production database from 8.4 to 9.3. Its been just 2 months and the features which it allowsfor parallel backup and recovery has been a great use for us. > But with 9.3 we have also run into a lot of troubles. > > Some background: > We run loaders to load data into the DB. These loaders are perl programs which run on a different server than the db server.The amount of data loaded each day is pretty high. > > Problem: > Some of the loaders are performing in a bad way. they get stuck or EOF happens on client connection and things just dies.There is one piece which gets stuck daily ... but once killed runs normal. > > One of the episode of today one of the loader got stuck for 2hr 15 min. I killed the whole process, restarted it and thingsran within 10 mins. What I observed was the DB load hadnt changed or shifted significantly between the killing andthe end of the program. At first we were thinking it was loaders but now it looks as if its the postgres database. > > The piece where it was stuck for 2hrs was it was trying to insert 37K rows in a 500K row table, which is pretty small consideringPG potential. But the bottleneck here we cant solve it. > > > Kindly help me find the solution, or the right place to dig into to find the exact problem, its been troubling some ofour loaders for quite some time now. > > Please let me know if you need further information regarding the settings on our DB server. > > -- > Regards > Dhruv
Right now we are doing is only logs where on each statement for a sql function we are raising an info.
This way we know which step its getting stuck. Today it got stuck on the insert statement.
And we ran the same SQL function after killing the process, it ran normal without any issues.
-Dhruv
On Fri, Aug 22, 2014 at 3:11 PM, <jayknowsunix@gmail.com> wrote:
Have you tried attaching strace to one the loaders? That might give a clue where it's getting stuck.
-
Jay
Sent from my iPad
> On Aug 22, 2014, at 3:23 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
>
> Hello,
> We quite recently moved our production database from 8.4 to 9.3. Its been just 2 months and the features which it allows for parallel backup and recovery has been a great use for us.
> But with 9.3 we have also run into a lot of troubles.
>
> Some background:
> We run loaders to load data into the DB. These loaders are perl programs which run on a different server than the db server. The amount of data loaded each day is pretty high.
>
> Problem:
> Some of the loaders are performing in a bad way. they get stuck or EOF happens on client connection and things just dies. There is one piece which gets stuck daily ... but once killed runs normal.
>
> One of the episode of today one of the loader got stuck for 2hr 15 min. I killed the whole process, restarted it and things ran within 10 mins. What I observed was the DB load hadnt changed or shifted significantly between the killing and the end of the program. At first we were thinking it was loaders but now it looks as if its the postgres database.
>
> The piece where it was stuck for 2hrs was it was trying to insert 37K rows in a 500K row table, which is pretty small considering PG potential. But the bottleneck here we cant solve it.
>
>
> Kindly help me find the solution, or the right place to dig into to find the exact problem, its been troubling some of our loaders for quite some time now.
>
> Please let me know if you need further information regarding the settings on our DB server.
>
> --
> Regards
> Dhruv
Regards
Dhruv
404-551-2578
Dhruv Shukla wrote: > Right now we are doing is only logs where on each statement for a sql function we are raising an info. > > This way we know which step its getting stuck. Today it got stuck on the insert statement. > > And we ran the same SQL function after killing the process, it ran normal without any issues. That sounds a lot like locking problems. Examine pg_locks when a process gets stuck an see if it is waiting for a lock and who is holding the lock. Yours, Laurenz Albe
We have a script which notifies us if there are any locks that its waiting for and it sends an email out. But nothing has came thru ...
here is one pid which has got stuck from midnight and is still running, but doesnt seem like it will finish anytime soon. It was run manually by me using an strace.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
64144 postgres 20 0 81.2g 4.1g 4.1g R 100.0 1.1 537:59.86 postmaster
I even checked the lock scripts by running each of the query but nothing has came thru.
https://wiki.postgresql.org/wiki/Lock_Monitoring
Cent OS 6.3, Postgres 9.3.3, Postgis.
On Mon, Aug 25, 2014 at 1:54 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
Dhruv Shukla wrote:That sounds a lot like locking problems.
> Right now we are doing is only logs where on each statement for a sql function we are raising an info.
>
> This way we know which step its getting stuck. Today it got stuck on the insert statement.
>
> And we ran the same SQL function after killing the process, it ran normal without any issues.
Examine pg_locks when a process gets stuck an see if it is waiting for a lock
and who is holding the lock.
Yours,
Laurenz Albe
Regards
Dhruv
404-551-2578
Other informational details about configurations are:
shared_buffers = 80GB # min 128kB
# (change requires restart)
temp_buffers = 2GB # min 800kB
work_mem = 2GB # min 64kB
maintenance_work_mem = 16GB # min 1MB
checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
#wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers
#checkpoint_timeout = 5min # range 30s-1h
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_warning = 30s # 0 disables
effective_cache_size = 130GB
Let me know if you need more information.
On Mon, Aug 25, 2014 at 10:04 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
We have a script which notifies us if there are any locks that its waiting for and it sends an email out. But nothing has came thru ...here is one pid which has got stuck from midnight and is still running, but doesnt seem like it will finish anytime soon. It was run manually by me using an strace.PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND64144 postgres 20 0 81.2g 4.1g 4.1g R 100.0 1.1 537:59.86 postmasterI even checked the lock scripts by running each of the query but nothing has came thru.https://wiki.postgresql.org/wiki/Lock_MonitoringCent OS 6.3, Postgres 9.3.3, Postgis.On Mon, Aug 25, 2014 at 1:54 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:Dhruv Shukla wrote:That sounds a lot like locking problems.
> Right now we are doing is only logs where on each statement for a sql function we are raising an info.
>
> This way we know which step its getting stuck. Today it got stuck on the insert statement.
>
> And we ran the same SQL function after killing the process, it ran normal without any issues.
Examine pg_locks when a process gets stuck an see if it is waiting for a lock
and who is holding the lock.
Yours,
Laurenz Albe--
Regards
Dhruv
404-551-2578
Regards
Dhruv
404-551-2578
Dhruv Shukla <dhruvshukla82@gmail.com> wrote: > Other informational details about configurations are: > shared_buffers = 80GB > temp_buffers = 2GB > work_mem = 2GB > maintenance_work_mem = 16GB Well, at the default max_connections of 100 that could easily result in the server trying to allocate about 567GB of RAM. If your number of connections is more than 100, adjust proportionally higher. If the server doesn't have that, you could see extreme swapping or various other problems as the OS tries to survive. > effective_cache_size = 130GB And the planner will be generating plans based on the assumption that the combination of shared_buffers and the OS cache will have this much RAM available for caching, so if you have max_connections = 100 this configuration would only make sense on a machine with RAM of 617GB or more, plus whatever is needed for the OS and anything besides PostgreSQL that you want to run on the machine. How much RAM is on the machine (or VM)? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin,
Currently we have a max connection setting for 1000 connections.
And RAM on server is 384GB RAM.
4 Octa core CPU hyperthreaded so total 64 cores.
-Dhruv
On Mon, Aug 25, 2014 at 10:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
Dhruv Shukla <dhruvshukla82@gmail.com> wrote:> temp_buffers = 2GB
> Other informational details about configurations are:
> shared_buffers = 80GB
> work_mem = 2GB
> maintenance_work_mem = 16GB
Well, at the default max_connections of 100 that could easily
result in the server trying to allocate about 567GB of RAM. If
your number of connections is more than 100, adjust proportionally
higher. If the server doesn't have that, you could see extreme
swapping or various other problems as the OS tries to survive.
> effective_cache_size = 130GB
And the planner will be generating plans based on the assumption
that the combination of shared_buffers and the OS cache will have
this much RAM available for caching, so if you have max_connections
= 100 this configuration would only make sense on a machine with
RAM of 617GB or more, plus whatever is needed for the OS and
anything besides PostgreSQL that you want to run on the machine.
How much RAM is on the machine (or VM)?
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Regards
Dhruv
404-551-2578
Some more information about memory:
$free -m
total used free shared buffers cached
Mem: 387551 295326 92224 0 49 207903
-/+ buffers/cache: 87373 300177
Swap: 4095 46 4049
$ cat /proc/swaps
Filename Type Size Used Priority
/dev/dm-1 partition 4194296 48100 -1
DB resides on its own server and isnt hosted on a VM.
-Dhruv
On Mon, Aug 25, 2014 at 11:24 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
Kevin,Currently we have a max connection setting for 1000 connections.And RAM on server is 384GB RAM.4 Octa core CPU hyperthreaded so total 64 cores.-Dhruv--On Mon, Aug 25, 2014 at 10:51 AM, Kevin Grittner <kgrittn@ymail.com> wrote:Dhruv Shukla <dhruvshukla82@gmail.com> wrote:> temp_buffers = 2GB
> Other informational details about configurations are:
> shared_buffers = 80GB
> work_mem = 2GB
> maintenance_work_mem = 16GB
Well, at the default max_connections of 100 that could easily
result in the server trying to allocate about 567GB of RAM. If
your number of connections is more than 100, adjust proportionally
higher. If the server doesn't have that, you could see extreme
swapping or various other problems as the OS tries to survive.
> effective_cache_size = 130GB
And the planner will be generating plans based on the assumption
that the combination of shared_buffers and the OS cache will have
this much RAM available for caching, so if you have max_connections
= 100 this configuration would only make sense on a machine with
RAM of 617GB or more, plus whatever is needed for the OS and
anything besides PostgreSQL that you want to run on the machine.
How much RAM is on the machine (or VM)?
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Regards
Dhruv
404-551-2578
Regards
Dhruv
404-551-2578
Dhruv Shukla <dhruvshukla82@gmail.com> wrote: > Kevin Grittner <kgrittn@ymail.com> wrote: >> Dhruv Shukla <dhruvshukla82@gmail.com> wrote: >>> Other informational details about configurations are: >>> shared_buffers = 80GB >>> temp_buffers = 2GB >>> work_mem = 2GB >>> maintenance_work_mem = 16GB >> How much RAM is on the machine (or VM)? > Currently we have a max connection setting for 1000 connections. The temp_buffers setting is a limit *per connection*, and once RAM for a temporary table is allocated to a connection it is never released; so even when all connections are idle they could be setting aside 2TG of RAM for possible use for caching temporary tables. The work_mem setting is an attempt to limit RAM per node of an executing query; one connection can create many allocations of the size set for work_mem. Since not all queries have nodes that require such allocations, and not all connections are necessarily active at the same time, a general rule of thumb is to allow for one work_mem allocation per connection allowed by max_connections. So these settings can easily result in another 2TB of allocations, beyond the temp_buffers mentioned above. A high shared_buffers setting can result in "write gluts" at the OS level when a high percentage of that memory becomes dirty and is dumped to the OS in a short time. This can result in the OS appearing to "lock up" for as long as it takes it to flush all of the dirty data to disk. I don't know what your system is capable of, but most database servers I've seen fall between 20MB/second and 200MB/second. That would correspond to apparent stalls lasting between 6.8 minutes and 1.1 hours. By tweaking the bgwriter settings and the OS dirty settings you can drastically reduce this, but I don't think you've indicated having done that, so 80GB can be expected to cause apparent lockups of those durations. A high shared_buffers setting makes you more vulnerable to long stalls because of transparent huge page operations of the OS. > And RAM on server is 384GB RAM. And overcommitting RAM by a factor of more than 10x is not going to be pretty in any event. If I could not reduce max_connections, I would set work_mem to no more than 100MB and temp_buffers to no more than 50MB. I would drop maintenance_work_mem to 2GB. I would probably drastically reduce shared_buffers and would most certainly make autovacuum and bgwriter more aggressive than the default. If you make those changes and still see a problem, only then is it worth looking at other possible causes. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin,
This has been valuable information. Thanks a lot for it.
I was able to reduce the max connections down to 200, but still the other parameters I wanted to run over with you before I go ahead and make changes.
Now here comes the dilemma for me, we run memory intensive processes in morning using easily resident memory of 4-10G of memory for processing. I thought of using work_mem =1Gb so that should commit to around 200GB of RAM. and then there are other processes which uses temporary tables in which around 200K records are stored in temporary tables
A temp_mem setting of 1GB sounds good, but i am slightly worried if it could delay the process for those memory extensive processes.
Planning on getting shared buffers down to 60GB from 80GB.
OS:
vm.dirty_background_ratio = 10
In regards to vm.dirty_background_ratio, what number do you think will be reasonable for such kind of scenarios?
Thanks,
Dhruv
On Mon, Aug 25, 2014 at 12:53 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:
>> Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
>>> Other informational details about configurations are:
>>> shared_buffers = 80GB
>>> temp_buffers = 2GB
>>> work_mem = 2GB
>>> maintenance_work_mem = 16GB>> How much RAM is on the machine (or VM)?> Currently we have a max connection setting for 1000 connections.The temp_buffers setting is a limit *per connection*, and once RAM
for a temporary table is allocated to a connection it is never
released; so even when all connections are idle they could be
setting aside 2TG of RAM for possible use for caching temporary
tables.
The work_mem setting is an attempt to limit RAM per node of an
executing query; one connection can create many allocations of the
size set for work_mem. Since not all queries have nodes that
require such allocations, and not all connections are necessarily
active at the same time, a general rule of thumb is to allow for
one work_mem allocation per connection allowed by max_connections.
So these settings can easily result in another 2TB of allocations,
beyond the temp_buffers mentioned above.
A high shared_buffers setting can result in "write gluts" at the OS
level when a high percentage of that memory becomes dirty and is
dumped to the OS in a short time. This can result in the OS
appearing to "lock up" for as long as it takes it to flush all of
the dirty data to disk. I don't know what your system is capable
of, but most database servers I've seen fall between 20MB/second
and 200MB/second. That would correspond to apparent stalls lasting
between 6.8 minutes and 1.1 hours. By tweaking the bgwriter
settings and the OS dirty settings you can drastically reduce this,
but I don't think you've indicated having done that, so 80GB can be
expected to cause apparent lockups of those durations.
A high shared_buffers setting makes you more vulnerable to long
stalls because of transparent huge page operations of the OS.And overcommitting RAM by a factor of more than 10x is not going to
> And RAM on server is 384GB RAM.
be pretty in any event.
If I could not reduce max_connections, I would set work_mem to no
more than 100MB and temp_buffers to no more than 50MB. I would
drop maintenance_work_mem to 2GB. I would probably drastically
reduce shared_buffers and would most certainly make autovacuum and
bgwriter more aggressive than the default.
If you make those changes and still see a problem, only then is it
worth looking at other possible causes.
Regards
Dhruv
404-551-2578
Hello All,
We changed the settings to
max_connections=200
work_mem=1Gb
temp_mem=1Gb
shared_buffers=70GB
maintenance_work_mem = 2GB
effective_cache_size=180GB
max_locks_per_transaction=128
autovacuum_max_workers = 12
checkpoint_segments = 256
And ran an strace on one of the programs that was getting stuck and dying by itself. Here is the log of strace (Ran from another machine)
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0LSINFO\0C00000\0M -"..., 16384, 0, NULL, NULL) = 77
write(2, "INFO: ---> did 2\n", 30INFO: ---> did 2
) = 30
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "N\0\0\0MSINFO\0C00000\0M -"..., 16384, 0, NULL, NULL) = 78
write(2, "INFO: ---> did 14\n", 31INFO: ---> did 14
) = 31
poll([{fd=3, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
On viewing server logs on postgres server found the error as
FATAL: connection to client lost
There has been some improvements, like few of the getting stuck DB functions ran good. But the main ones are still failing.
-Dhruv
On Mon, Aug 25, 2014 at 1:48 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
Kevin,This has been valuable information. Thanks a lot for it.I was able to reduce the max connections down to 200, but still the other parameters I wanted to run over with you before I go ahead and make changes.Now here comes the dilemma for me, we run memory intensive processes in morning using easily resident memory of 4-10G of memory for processing. I thought of using work_mem =1Gb so that should commit to around 200GB of RAM. and then there are other processes which uses temporary tables in which around 200K records are stored in temporary tablesA temp_mem setting of 1GB sounds good, but i am slightly worried if it could delay the process for those memory extensive processes.Planning on getting shared buffers down to 60GB from 80GB.OS:vm.dirty_background_ratio = 10In regards to vm.dirty_background_ratio, what number do you think will be reasonable for such kind of scenarios?Thanks,Dhruv--On Mon, Aug 25, 2014 at 12:53 PM, Kevin Grittner <kgrittn@ymail.com> wrote:Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:
>> Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
>>> Other informational details about configurations are:
>>> shared_buffers = 80GB
>>> temp_buffers = 2GB
>>> work_mem = 2GB
>>> maintenance_work_mem = 16GB>> How much RAM is on the machine (or VM)?> Currently we have a max connection setting for 1000 connections.The temp_buffers setting is a limit *per connection*, and once RAM
for a temporary table is allocated to a connection it is never
released; so even when all connections are idle they could be
setting aside 2TG of RAM for possible use for caching temporary
tables.
The work_mem setting is an attempt to limit RAM per node of an
executing query; one connection can create many allocations of the
size set for work_mem. Since not all queries have nodes that
require such allocations, and not all connections are necessarily
active at the same time, a general rule of thumb is to allow for
one work_mem allocation per connection allowed by max_connections.
So these settings can easily result in another 2TB of allocations,
beyond the temp_buffers mentioned above.
A high shared_buffers setting can result in "write gluts" at the OS
level when a high percentage of that memory becomes dirty and is
dumped to the OS in a short time. This can result in the OS
appearing to "lock up" for as long as it takes it to flush all of
the dirty data to disk. I don't know what your system is capable
of, but most database servers I've seen fall between 20MB/second
and 200MB/second. That would correspond to apparent stalls lasting
between 6.8 minutes and 1.1 hours. By tweaking the bgwriter
settings and the OS dirty settings you can drastically reduce this,
but I don't think you've indicated having done that, so 80GB can be
expected to cause apparent lockups of those durations.
A high shared_buffers setting makes you more vulnerable to long
stalls because of transparent huge page operations of the OS.And overcommitting RAM by a factor of more than 10x is not going to
> And RAM on server is 384GB RAM.
be pretty in any event.
If I could not reduce max_connections, I would set work_mem to no
more than 100MB and temp_buffers to no more than 50MB. I would
drop maintenance_work_mem to 2GB. I would probably drastically
reduce shared_buffers and would most certainly make autovacuum and
bgwriter more aggressive than the default.
If you make those changes and still see a problem, only then is it
worth looking at other possible causes.
Regards
Dhruv
404-551-2578
Regards
Dhruv
404-551-2578
On Tue, Aug 26, 2014 at 9:27 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote: > On viewing server logs on postgres server found the error as > FATAL: connection to client lost > > There has been some improvements, like few of the getting stuck DB functions > ran good. But the main ones are still failing. This sounds like a firewall issue to me, but it could be a backend running out of memory and the oom killer in linux killing it. I believe oom killer logs to /var/log/syslog or /var/log/messages. Got anything in those logs about that time for the oom killer? If not, it could be a firewall, if you have one, between the app and db boxes. If that's the case you can stop it from happening by lowering the tcp keepalive settings.
On Tue, Aug 26, 2014 at 9:57 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Tue, Aug 26, 2014 at 9:27 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote: > >> On viewing server logs on postgres server found the error as >> FATAL: connection to client lost >> >> There has been some improvements, like few of the getting stuck DB functions >> ran good. But the main ones are still failing. > > This sounds like a firewall issue to me, but it could be a backend > running out of memory and the oom killer in linux killing it. I > believe oom killer logs to /var/log/syslog or /var/log/messages. Got > anything in those logs about that time for the oom killer? > > If not, it could be a firewall, if you have one, between the app and > db boxes. If that's the case you can stop it from happening by > lowering the tcp keepalive settings. Please note that the oom killer is deadly to a postgres box, as it can kill the postmaster which is a bad thing. It should be disabled by setting vm.overcommit_memory=2 in /etc/sysctl.conf and running sysctl -p -- To understand recursion, one must first understand recursion.
Scott,
Nothing appreared in /var/log/messages about the oom killer on both the boxes. So could be more like an firewall issue, will look into this and let you know. Current firewall setting that we have are:
tcp_keepalive_time=7200
tcp_keepalive_intvl=75
tcp_keepalive_probes=9
Thanks,
Drhuv Shukla.
Drhuv Shukla.
On Tue, Aug 26, 2014 at 10:58 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
Please note that the oom killer is deadly to a postgres box, as it canOn Tue, Aug 26, 2014 at 9:57 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Tue, Aug 26, 2014 at 9:27 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:
>
>> On viewing server logs on postgres server found the error as
>> FATAL: connection to client lost
>>
>> There has been some improvements, like few of the getting stuck DB functions
>> ran good. But the main ones are still failing.
>
> This sounds like a firewall issue to me, but it could be a backend
> running out of memory and the oom killer in linux killing it. I
> believe oom killer logs to /var/log/syslog or /var/log/messages. Got
> anything in those logs about that time for the oom killer?
>
> If not, it could be a firewall, if you have one, between the app and
> db boxes. If that's the case you can stop it from happening by
> lowering the tcp keepalive settings.
kill the postmaster which is a bad thing. It should be disabled by
setting vm.overcommit_memory=2 in /etc/sysctl.conf and running sysctl
-p
--
To understand recursion, one must first understand recursion.
Regards
Dhruv
404-551-2578
On Tue, Aug 26, 2014 at 10:24 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote: > Scott, > > Nothing appreared in /var/log/messages about the oom killer on both the > boxes. So could be more like an firewall issue, will look into this and let > you know. Current firewall setting that we have are: > > tcp_keepalive_time=7200 > tcp_keepalive_intvl=75 > tcp_keepalive_probes=9 In my experience dropping the keepalive to 300, and retries to 2 will keep connections alive without sending out a flood of keepalive pings
Its 15 hours now ... that the DB was restarted and things have started to get stuck. Apparently taking too long to finish with these settings.... any further suggesstions??
-Dhruv.
On Tue, Aug 26, 2014 at 11:32 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Tue, Aug 26, 2014 at 10:24 AM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:In my experience dropping the keepalive to 300, and retries to 2 will
> Scott,
>
> Nothing appreared in /var/log/messages about the oom killer on both the
> boxes. So could be more like an firewall issue, will look into this and let
> you know. Current firewall setting that we have are:
>
> tcp_keepalive_time=7200
> tcp_keepalive_intvl=75
> tcp_keepalive_probes=9
keep connections alive without sending out a flood of keepalive pings
Regards
Dhruv
404-551-2578
On Tue, Aug 26, 2014 at 12:22 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote: > Its 15 hours now ... that the DB was restarted and things have started to > get stuck. Apparently taking too long to finish with these settings.... any > further suggesstions?? Troubleshoot it while it's stuck. If your app isn't stopping / erroring out when it loses connection then it's broken and someone needs to code real error handling into it (or you're using a language that's fundamentally broken in terms of handling network errors). Esp because with a lower tcp keepalive the app should be told that the conn died in < 10 minutes. So I'm going on the assumption that you're losing connection. YOU need to figure out why. tools like netstat and strace etc are useful here. If a backend is crashing out, there'll be an entry in the pg logs, if networking is killing it then maybe a firewall will have logs, if the oom is killing it then the linux logs on the db server will say. Use tools like sar and sysstat and zabbix and other monitoring packages to see if you're running out of ram and oom is killing processes. I assume you've lowered your work_mem etc down to something more reasonable, like 16Meg etc. And that you restarted the server after dropping max conns down to 200. Note that 200 is still far too many, and you need to look into a pooler to reduce that number down to < 2 x CPU cores. Anything over that is counter productive and likely to cause performance issues. Using netstat -an can you find matching connections from the stalled machine to the db server? If not you've lost network connection. If there's no obvious cause in pg or sys logs on the db server then it's networking.
Scott,
Thanks for such a valuable information. I will have a look into it and have it debugged.
But now things have started slowing down really. Some programs have started to throw errors like
ERROR: DB COPY failed: DBD::Pg::db pg_endcopy
failed: ERROR: out of shared memory
HINT: You might need to increase max_locks_per_transaction.
These were things which were working before with higher memory settings. And some are just running slow, slow means literally slow even though they are connected via netstat and all.
Thanks,
Dhruv
On Tue, Aug 26, 2014 at 2:01 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Tue, Aug 26, 2014 at 12:22 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:Troubleshoot it while it's stuck. If your app isn't stopping /
> Its 15 hours now ... that the DB was restarted and things have started to
> get stuck. Apparently taking too long to finish with these settings.... any
> further suggesstions??
erroring out when it loses connection then it's broken and someone
needs to code real error handling into it (or you're using a language
that's fundamentally broken in terms of handling network errors). Esp
because with a lower tcp keepalive the app should be told that the
conn died in < 10 minutes.
So I'm going on the assumption that you're losing connection. YOU need
to figure out why. tools like netstat and strace etc are useful here.
If a backend is crashing out, there'll be an entry in the pg logs, if
networking is killing it then maybe a firewall will have logs, if the
oom is killing it then the linux logs on the db server will say. Use
tools like sar and sysstat and zabbix and other monitoring packages to
see if you're running out of ram and oom is killing processes.
I assume you've lowered your work_mem etc down to something more
reasonable, like 16Meg etc. And that you restarted the server after
dropping max conns down to 200. Note that 200 is still far too many,
and you need to look into a pooler to reduce that number down to < 2 x
CPU cores. Anything over that is counter productive and likely to
cause performance issues.
Using netstat -an can you find matching connections from the stalled
machine to the db server? If not you've lost network connection. If
there's no obvious cause in pg or sys logs on the db server then it's
networking.
Regards
Dhruv
404-551-2578
On Tue, Aug 26, 2014 at 1:32 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote: > Scott, > Thanks for such a valuable information. I will have a look into it and have > it debugged. > > But now things have started slowing down really. Some programs have started > to throw errors like > > ERROR: DB COPY failed: DBD::Pg::db pg_endcopy > failed: ERROR: out of shared memory > HINT: You might need to increase max_locks_per_transaction. > > > These were things which were working before with higher memory settings. And > some are just running slow, slow means literally slow even though they are > connected via netstat and all. OK increasing max locks per transaction is pretty safe. What did you change shared_mem from and too? Setting it a bit higher isn't terribly dangerous, like work_mem x max_conns can be.
Here are all the changes
max_connections =1000 -> 200
work_mem =4GB -> 1GB
temp_mem =2GB -> 1GB
shared_buffers =80GB -> 70GB
maintenance_work_mem = 16GB -> 2GB
effective_cache_size =130GB -> 180GB
max_locks_per_transaction =64 -> 128
autovacuum_max_workers = 3 -> 12
checkpoint_segments = 64 -> 256
-Dhruv
On Tue, Aug 26, 2014 at 2:37 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
On Tue, Aug 26, 2014 at 1:32 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote:OK increasing max locks per transaction is pretty safe. What did you
> Scott,
> Thanks for such a valuable information. I will have a look into it and have
> it debugged.
>
> But now things have started slowing down really. Some programs have started
> to throw errors like
>
> ERROR: DB COPY failed: DBD::Pg::db pg_endcopy
> failed: ERROR: out of shared memory
> HINT: You might need to increase max_locks_per_transaction.
>
>
> These were things which were working before with higher memory settings. And
> some are just running slow, slow means literally slow even though they are
> connected via netstat and all.
change shared_mem from and too? Setting it a bit higher isn't terribly
dangerous, like work_mem x max_conns can be.
Regards
Dhruv
404-551-2578
On Tue, Aug 26, 2014 at 1:44 PM, Dhruv Shukla <dhruvshukla82@gmail.com> wrote: > Here are all the changes > > max_connections =1000 -> 200 > work_mem =4GB -> 1GB > temp_mem =2GB -> 1GB > shared_buffers =80GB -> 70GB > maintenance_work_mem = 16GB -> 2GB > effective_cache_size =130GB -> 180GB > max_locks_per_transaction =64 -> 128 > autovacuum_max_workers = 3 -> 12 > checkpoint_segments = 64 -> 256 OK, so while this is happening, what's you memory look like on the db box? Is it running low on memory? Or does it have plenty left over for kernel cache? Is it releasing a lot of kernel cache to make room for work_mem? You need to profile your server while this is happening. vmstat iostat htop are good tools to start with. Are you running out of memory, CPU, or IO?
So have you had any luck so far?