Thread: MusicBrainz postgres performance issues

MusicBrainz postgres performance issues

From
Robert Kaye
Date:
Hi!

We at MusicBrainz have been having trouble with our Postgres install for the past few days. I’ve collected all the relevant information here:


If anyone could provide tips, suggestions or other relevant advice for what to poke at next, we would love it.

Thanks!

--

--ruaok        

Robert Kaye     --     rob@musicbrainz.org     --    http://musicbrainz.org

Re: MusicBrainz postgres performance issues

From
Josh Krupka
Date:
It sounds like you've hit the postgres basics, what about some of the linux check list items?

what does free -m show on your db server?

If the load problem really is being caused by swapping when things really shouldn't be swapping, it could be a matter of adjusting your swappiness - what does cat /proc/sys/vm/swappiness show on your server?

There are other linux memory management things that can cause postgres and the server running it to throw fits like THP and zone reclaim.  I don't have enough info about your system to say they are the cause either, but check out the many postings here and other places on the detrimental effect that those settings *can* have.  That would at least give you another angle to investigate.

Re: MusicBrainz postgres performance issues

From
Andreas Kretschmer
Date:
Robert Kaye <rob@musicbrainz.org> wrote:

> Hi!
>
> We at MusicBrainz have been having trouble with our Postgres install for the
> past few days. I’ve collected all the relevant information here:
>
>   http://blog.musicbrainz.org/2015/03/15/postgres-troubles/
>
> If anyone could provide tips, suggestions or other relevant advice for what to
> poke at next, we would love it.


just a wild guess: raid-controller BBU faulty


Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°


Re: MusicBrainz postgres performance issues

From
Robert Kaye
Date:

On Mar 15, 2015, at 12:13 PM, Josh Krupka <jkrupka@gmail.com> wrote:

It sounds like you've hit the postgres basics, what about some of the linux check list items?

what does free -m show on your db server?

             total       used       free     shared    buffers     cached
Mem:         48295      31673      16622          0          5      12670
-/+ buffers/cache:      18997      29298
Swap:        22852       2382      20470


If the load problem really is being caused by swapping when things really shouldn't be swapping, it could be a matter of adjusting your swappiness - what does cat /proc/sys/vm/swappiness show on your server?


We adjusted that too, but no effect.

(I’ve updated the blog post with these two comments)


There are other linux memory management things that can cause postgres and the server running it to throw fits like THP and zone reclaim.  I don't have enough info about your system to say they are the cause either, but check out the many postings here and other places on the detrimental effect that those settings *can* have.  That would at least give you another angle to investigate.

If there are specific things you’d like to know, I’ve be happy to be a human proxy. :)

Thanks!

--

--ruaok        

Robert Kaye     --     rob@musicbrainz.org     --    http://musicbrainz.org

Re: MusicBrainz postgres performance issues

From
Robert Kaye
Date:
> On Mar 15, 2015, at 12:41 PM, Andreas Kretschmer <akretschmer@spamfence.net> wrote:
>
> just a wild guess: raid-controller BBU faulty

We don’t have a BBU in this server, but at least we have redundant power supplies.

In any case, how would a fault batter possibly cause this?

--

--ruaok

Robert Kaye     --     rob@musicbrainz.org     --    http://musicbrainz.org



Re: MusicBrainz postgres performance issues

From
Rural Hunter
Date:
<div class="moz-cite-prefix">pls check this if it helps: <a class="moz-txt-link-freetext"
href="http://ubuntuforums.org/showthread.php?t=2258734">http://ubuntuforums.org/showthread.php?t=2258734</a><br/><br />
在2015/3/15 18:54, Robert Kaye 写道:<br /></div><blockquote
cite="mid:B008F2EE-45D7-4535-9163-56A8CDCA553C@musicbrainz.org"type="cite"> Hi! <div class=""><br class="" /></div><div
class="">Weat MusicBrainz have been having trouble with our Postgres install for the past few days. I’ve collected all
therelevant information here:</div><div class=""><br class="" /></div><div class="">  <a class=""
href="http://blog.musicbrainz.org/2015/03/15/postgres-troubles/"
moz-do-not-send="true">http://blog.musicbrainz.org/2015/03/15/postgres-troubles/</a></div><divclass=""><br class=""
/></div><divclass="">If anyone could provide tips, suggestions or other relevant advice for what to poke at next, we
wouldlove it.</div><div class=""><br class="" /></div><div class="">Thanks!</div><div class=""><br class=""
/></div><divclass=""><div class=""><div class="">--<br class="" /><br class="" /> --ruaok        <br class="" /><br
class=""/> Robert Kaye     --     <a class="" href="mailto:rob@musicbrainz.org"
moz-do-not-send="true">rob@musicbrainz.org</a>     --   <a class="" href="http://musicbrainz.org"
moz-do-not-send="true">http://musicbrainz.org</a></div></div><brclass="" /></div></blockquote><br /> 

Re: MusicBrainz postgres performance issues

From
Andres Freund
Date:
On 2015-03-15 13:08:13 +0100, Robert Kaye wrote:
> > On Mar 15, 2015, at 12:41 PM, Andreas Kretschmer <akretschmer@spamfence.net> wrote:
> >
> > just a wild guess: raid-controller BBU faulty
>
> We don’t have a BBU in this server, but at least we have redundant power supplies.
>
> In any case, how would a fault batter possibly cause this?

Many controllers disable write-back caching when the battery is dead.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: MusicBrainz postgres performance issues

From
Josh Krupka
Date:


On Sun, Mar 15, 2015 at 8:07 AM, Robert Kaye <rob@musicbrainz.org> wrote:
what does free -m show on your db server?

             total       used       free     shared    buffers     cached
Mem:         48295      31673      16622          0          5      12670
-/+ buffers/cache:      18997      29298
Swap:        22852       2382      20470


Hmm that's definitely odd that it's swapping since it has plenty of free memory at the moment.  Is it still under heavy load right now?  Has the output of free consistently looked like that during your trouble times?
 

If the load problem really is being caused by swapping when things really shouldn't be swapping, it could be a matter of adjusting your swappiness - what does cat /proc/sys/vm/swappiness show on your server?


We adjusted that too, but no effect.

(I’ve updated the blog post with these two comments)

That had been updated a while ago or just now?
 

There are other linux memory management things that can cause postgres and the server running it to throw fits like THP and zone reclaim.  I don't have enough info about your system to say they are the cause either, but check out the many postings here and other places on the detrimental effect that those settings *can* have.  That would at least give you another angle to investigate.

If there are specific things you’d like to know, I’ve be happy to be a human proxy. :)


If zone reclaim is enabled (I think linux usually decides whether or not to enable it at boot time depending on the numa architecture) it sometimes avoids using memory on remote numa nodes if it thinks that memory access is too expensive.  This can lead to way too much disk access (not sure if it would actually make linux swap or not...) and lots of ram sitting around doing nothing instead of being used for fs cache like it should be.  Check to see if zone reclaim is enabled with this command: cat /proc/sys/vm/zone_reclaim_mode.  If your server is a numa one, you can install the numactl utility and look at the numa layout with this: numactl --hardware

I'm not sure how THP would cause lots of swapping, but it's worth checking in general: cat /sys/kernel/mm/transparent_hugepage/enabled.  If it's spending too much time trying to compact memory pages it can cause stalls in your processes.  To get the thp metrics do egrep 'trans|thp' /proc/vmstat

Re: MusicBrainz postgres performance issues

From
Ilya Kosmodemiansky
Date:
Hi!

What shows your pg_stat_bgwriter for one day? 


On Mar 15, 2015, at 11:54, Robert Kaye <rob@musicbrainz.org> wrote:

Hi!

We at MusicBrainz have been having trouble with our Postgres install for the past few days. I’ve collected all the relevant information here:


If anyone could provide tips, suggestions or other relevant advice for what to poke at next, we would love it.

Thanks!

--

--ruaok        

Robert Kaye     --     rob@musicbrainz.org     --    http://musicbrainz.org

Re: MusicBrainz postgres performance issues

From
Tomas Vondra
Date:
On 15.3.2015 13:07, Robert Kaye wrote:
>
>> If the load problem really is being caused by swapping when things
>> really shouldn't be swapping, it could be a matter of adjusting your
>> swappiness - what does cat /proc/sys/vm/swappiness show on your server?
>
> 0
>
> We adjusted that too, but no effect.
>
> (I’ve updated the blog post with these two comments)

IMHO setting swappiness to 0 is way too aggressive. Just set it to
something like 10 - 20, that works better in my experience.


>> There are other linux memory management things that can cause
>> postgres and the server running it to throw fits like THP and zone
>> reclaim. I don't have enough info about your system to say they are
>> the cause either, but check out the many postings here and other
>> places on the detrimental effect that those settings *can* have.
>> That would at least give you another angle to investigate.
>
> If there are specific things you’d like to know, I’ve be happy to be a
> human proxy. :)

I'd start with vm.* configuration, so the output from this:

# sysctl -a | grep '^vm.*'

and possibly /proc/meminfo. I'm especially interested in the overcommit
settings, because per the free output you provided there's ~16GB of free
RAM.

BTW what amounts of data are we talking about? How large is the database
and how large is the active set?


I also noticed you use kernel 3.2 - that's not the best kernel version
for PostgreSQL - see [1] or [2] for example.

[1]

https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4

[2]
http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html



--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: MusicBrainz postgres performance issues

From
Ilya Kosmodemiansky
Date:


> On Mar 15, 2015, at 13:45, Josh Krupka <jkrupka@gmail.com> wrote:
> Hmm that's definitely odd that it's swapping since it has plenty of free memory at the moment.  Is it still under
heavyload right now?  Has the output of free consistently looked like that during your trouble times? 

And it seems better to disable swapiness


>


Re: MusicBrainz postgres performance issues

From
Andres Freund
Date:
On 2015-03-15 13:07:25 +0100, Robert Kaye wrote:
>
> > On Mar 15, 2015, at 12:13 PM, Josh Krupka <jkrupka@gmail.com> wrote:
> >
> > It sounds like you've hit the postgres basics, what about some of the linux check list items?
> >
> > what does free -m show on your db server?
>
>              total       used       free     shared    buffers     cached
> Mem:         48295      31673      16622          0          5      12670
> -/+ buffers/cache:      18997      29298
> Swap:        22852       2382      20470

Could you post /proc/meminfo instead? That gives a fair bit more
information.

Also:
* What hardware is this running on?
* Why do you need 500 connections (that are nearly all used) when you
  have a pgbouncer in front of the database? That's not going to be
  efficient.
* Do you have any data tracking the state connections are in?
  I.e. whether they're idle or not? The connections graph on you linked
  doesn't give that information?
* You're apparently not graphing CPU usage. How busy are the CPUs? How
  much time is spent in the kernel (i.e. system)?
* Consider installing perf (linux-utils-$something) and doing a
  systemwide profile.

3.2 isn't the greatest kernel around, efficiency wise. At some point you
might want to upgrade to something newer. I've seen remarkable
differences around this.

You really should upgrade postgres to a newer major version one of these
days. Especially 9.2. can give you a remarkable improvement in
performance with many connections in a read mostly workload.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: MusicBrainz postgres performance issues

From
Scott Marlowe
Date:
On Sun, Mar 15, 2015 at 7:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-03-15 13:07:25 +0100, Robert Kaye wrote:
>>
>> > On Mar 15, 2015, at 12:13 PM, Josh Krupka <jkrupka@gmail.com> wrote:
>> >
>> > It sounds like you've hit the postgres basics, what about some of the linux check list items?
>> >
>> > what does free -m show on your db server?
>>
>>              total       used       free     shared    buffers     cached
>> Mem:         48295      31673      16622          0          5      12670
>> -/+ buffers/cache:      18997      29298
>> Swap:        22852       2382      20470
>
> Could you post /proc/meminfo instead? That gives a fair bit more
> information.
>
> Also:
> * What hardware is this running on?
> * Why do you need 500 connections (that are nearly all used) when you
>   have a pgbouncer in front of the database? That's not going to be
>   efficient.
> * Do you have any data tracking the state connections are in?
>   I.e. whether they're idle or not? The connections graph on you linked
>   doesn't give that information?
> * You're apparently not graphing CPU usage. How busy are the CPUs? How
>   much time is spent in the kernel (i.e. system)?

htop is a great tool for watching the CPU cores live. Red == kernel btw.

> * Consider installing perf (linux-utils-$something) and doing a
>   systemwide profile.
>
> 3.2 isn't the greatest kernel around, efficiency wise. At some point you
> might want to upgrade to something newer. I've seen remarkable
> differences around this.

That is an understatement. Here's a nice article on why it's borked:

http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html

Had a 32 core machine with big RAID BBU and 512GB memory that was
dying using 3.2 kernel. went to 3.11 and it went from a load of 20 to
40 to a load of 5.

> You really should upgrade postgres to a newer major version one of these
> days. Especially 9.2. can give you a remarkable improvement in
> performance with many connections in a read mostly workload.

Agreed. ubuntu 12.04 with kernel 3.11/3.13 with pg 9.2 has been a
great improvement over debian squeeze and pg 8.4 that we were running
at work until recently.

As for the OP. if you've got swap activity causing issues when there's
plenty of free space just TURN IT OFF.

swapoff -a

I do this on all my big memory servers that don't really need swap,
esp when I was using hte 3.2 kernel which seems broken as regards swap
on bigger memory machines.


Re: MusicBrainz postgres performance issues

From
"Joshua D. Drake"
Date:
On 03/15/2015 05:08 AM, Robert Kaye wrote:
>
>
>> On Mar 15, 2015, at 12:41 PM, Andreas Kretschmer <akretschmer@spamfence.net> wrote:
>>
>> just a wild guess: raid-controller BBU faulty
>
> We don’t have a BBU in this server, but at least we have redundant power supplies.
>
> In any case, how would a fault batter possibly cause this?

The controller would turn off the cache.

JD

>
> --
>
> --ruaok
>
> Robert Kaye     --     rob@musicbrainz.org     --    http://musicbrainz.org
>
>
>


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc

Now I get it: your service is designed for a customer
base that grew up with Facebook, watches Japanese seizure
robot anime, and has the attention span of a gnat.
I'm not that user., "Tyler Riddle"



Re: MusicBrainz postgres performance issues

From
"Joshua D. Drake"
Date:
On 03/15/2015 09:43 AM, Scott Marlowe wrote:

>> * Consider installing perf (linux-utils-$something) and doing a
>>    systemwide profile.
>>
>> 3.2 isn't the greatest kernel around, efficiency wise. At some point you
>> might want to upgrade to something newer. I've seen remarkable
>> differences around this.

Not at some point, now. 3.2 - 3.8 are undeniably broken for PostgreSQL.

>
> That is an understatement. Here's a nice article on why it's borked:
>
> http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html
>
> Had a 32 core machine with big RAID BBU and 512GB memory that was
> dying using 3.2 kernel. went to 3.11 and it went from a load of 20 to
> 40 to a load of 5.

Yep, I can confirm this behavior.

>
>> You really should upgrade postgres to a newer major version one of these
>> days. Especially 9.2. can give you a remarkable improvement in
>> performance with many connections in a read mostly workload.

Seconded.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc

Now I get it: your service is designed for a customer
base that grew up with Facebook, watches Japanese seizure
robot anime, and has the attention span of a gnat.
I'm not that user., "Tyler Riddle"



Re: MusicBrainz postgres performance issues

From
Scott Marlowe
Date:
On Sun, Mar 15, 2015 at 10:43 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Sun, Mar 15, 2015 at 7:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2015-03-15 13:07:25 +0100, Robert Kaye wrote:
>>>
>>> > On Mar 15, 2015, at 12:13 PM, Josh Krupka <jkrupka@gmail.com> wrote:
>>> >
>>> > It sounds like you've hit the postgres basics, what about some of the linux check list items?
>>> >
>>> > what does free -m show on your db server?
>>>
>>>              total       used       free     shared    buffers     cached
>>> Mem:         48295      31673      16622          0          5      12670
>>> -/+ buffers/cache:      18997      29298
>>> Swap:        22852       2382      20470
>>
>> Could you post /proc/meminfo instead? That gives a fair bit more
>> information.
>>
>> Also:
>> * What hardware is this running on?
>> * Why do you need 500 connections (that are nearly all used) when you
>>   have a pgbouncer in front of the database? That's not going to be
>>   efficient.
>> * Do you have any data tracking the state connections are in?
>>   I.e. whether they're idle or not? The connections graph on you linked
>>   doesn't give that information?
>> * You're apparently not graphing CPU usage. How busy are the CPUs? How
>>   much time is spent in the kernel (i.e. system)?
>
> htop is a great tool for watching the CPU cores live. Red == kernel btw.
>
>> * Consider installing perf (linux-utils-$something) and doing a
>>   systemwide profile.
>>
>> 3.2 isn't the greatest kernel around, efficiency wise. At some point you
>> might want to upgrade to something newer. I've seen remarkable
>> differences around this.
>
> That is an understatement. Here's a nice article on why it's borked:
>
> http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html
>
> Had a 32 core machine with big RAID BBU and 512GB memory that was
> dying using 3.2 kernel. went to 3.11 and it went from a load of 20 to
> 40 to a load of 5.
>
>> You really should upgrade postgres to a newer major version one of these
>> days. Especially 9.2. can give you a remarkable improvement in
>> performance with many connections in a read mostly workload.
>
> Agreed. ubuntu 12.04 with kernel 3.11/3.13 with pg 9.2 has been a
> great improvement over debian squeeze and pg 8.4 that we were running
> at work until recently.
>
> As for the OP. if you've got swap activity causing issues when there's
> plenty of free space just TURN IT OFF.
>
> swapoff -a
>
> I do this on all my big memory servers that don't really need swap,
> esp when I was using hte 3.2 kernel which seems broken as regards swap
> on bigger memory machines.

OK I've now read your blog post. A few pointers I'd make.

shared_mem of 12G is almost always too large. I'd drop it down to ~1G or so.

64MB work mem AND max_connections = 500 is a recipe for disaster. No
db can actively process 500 queries at once without going kaboom, ad
having 64MB work_mem means it will go kaboom long before it reaches
500 active connections. Lower that and let pgbouncer handle the extra
connections for you.

Get some monitoring installed if you don't already have it so you can
track memory usage, cpu usage, disk usage etc. Zabbix or Nagios work
well. Without some kind of system monitoring you're missing half the
information you need to troubleshoot with.

Install iotop, sysstat, and htop. Configure sysstat to collect data so
you can use sar to see what the machine's been doing in the past few
days etc. Set it to 1 minute intervals in the /etc/cron.d/sysstat
file.

Do whatever you have to to get kernel 3.11 or greater on that machine
(or a new one). You don't have to upgrade pg just yet but the upgrade
of the kernel is essential.

Good luck. Let us know what you find and if you can get that machine
back on its feet.

--
To understand recursion, one must first understand recursion.


Re: MusicBrainz postgres performance issues

From
Scott Marlowe
Date:
On Sun, Mar 15, 2015 at 11:09 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

Clarification:

> 64MB work mem AND max_connections = 500 is a recipe for disaster. No
> db can actively process 500 queries at once without going kaboom, ad
> having 64MB work_mem means it will go kaboom long before it reaches
> 500 active connections. Lower that and let pgbouncer handle the extra
> connections for you.

Lower max_connections. work_mem 64MB is fine as long as
max_connections is something reasonable (reasonable is generally #CPU
cores * 2 or so).

work_mem is per sort. A single query could easily use 2 or 4x work_mem
all by itself. You can see how having hundreds of active connections
each using 64MB or more at the same time can kill your server.


Re: MusicBrainz postgres performance issues

From
Andres Freund
Date:
On 2015-03-15 11:09:34 -0600, Scott Marlowe wrote:
> shared_mem of 12G is almost always too large. I'd drop it down to ~1G or so.

I think that's a outdated wisdom, i.e. not generally true. I've now seen
a significant number of systems where a larger shared_buffers can help
quite massively.  The primary case where it can, in my experience, go
bad are write mostly database where every buffer acquiration has to
write out dirty data while holding locks. Especially during relation
extension that's bad.  A new enough kernel, a sane filesystem
(i.e. not ext3) and sane checkpoint configuration takes care of most of
the other disadvantages.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: MusicBrainz postgres performance issues

From
Ilya Kosmodemiansky
Date:
On Sun, Mar 15, 2015 at 8:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-03-15 11:09:34 -0600, Scott Marlowe wrote:
>> shared_mem of 12G is almost always too large. I'd drop it down to ~1G or so.
>
> I think that's a outdated wisdom, i.e. not generally true.

Quite agreed. With note, that proper configured controller with BBU is needed.


> A new enough kernel, a sane filesystem
> (i.e. not ext3) and sane checkpoint configuration takes care of most of
> the other disadvantages.

Most likely. And better to be sure that filesystem mounted without barrier.

And I agree with Scott - 64MB work mem AND max_connections = 500 is a
recipe for disaster. The problem could be in session mode of
pgbouncer. If you can work with transaction mode - do it.


Best regards,
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com


Re: MusicBrainz postgres performance issues

From
Andres Freund
Date:
On 2015-03-15 20:42:51 +0300, Ilya Kosmodemiansky wrote:
> On Sun, Mar 15, 2015 at 8:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2015-03-15 11:09:34 -0600, Scott Marlowe wrote:
> >> shared_mem of 12G is almost always too large. I'd drop it down to ~1G or so.
> >
> > I think that's a outdated wisdom, i.e. not generally true.
>
> Quite agreed. With note, that proper configured controller with BBU is needed.

That imo doesn't really have anything to do with it. The primary benefit
of a BBU with writeback caching is accelerating (near-)synchronous
writes. Like the WAL. But, besides influencing the default for
wal_buffers, a larger shared_buffers doesn't change the amount of
synchronous writes.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: MusicBrainz postgres performance issues

From
Ilya Kosmodemiansky
Date:
On Sun, Mar 15, 2015 at 8:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> That imo doesn't really have anything to do with it. The primary benefit
> of a BBU with writeback caching is accelerating (near-)synchronous
> writes. Like the WAL.

My point was, that having no proper raid controller (today bbu surely
needed for the controller to be a proper one) + heavy writes of any
kind, it is absolutely impossible to live with large shared_buffers
and without io problems.

>
> Greetings,
>
> Andres Freund
>
> --
>  Andres Freund                     http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



--
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com


Re: MusicBrainz postgres performance issues

From
Scott Marlowe
Date:
On Sun, Mar 15, 2015 at 11:46 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-03-15 20:42:51 +0300, Ilya Kosmodemiansky wrote:
>> On Sun, Mar 15, 2015 at 8:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > On 2015-03-15 11:09:34 -0600, Scott Marlowe wrote:
>> >> shared_mem of 12G is almost always too large. I'd drop it down to ~1G or so.
>> >
>> > I think that's a outdated wisdom, i.e. not generally true.
>>
>> Quite agreed. With note, that proper configured controller with BBU is needed.
>
> That imo doesn't really have anything to do with it. The primary benefit
> of a BBU with writeback caching is accelerating (near-)synchronous
> writes. Like the WAL. But, besides influencing the default for
> wal_buffers, a larger shared_buffers doesn't change the amount of
> synchronous writes.

Here's the problem with a large shared_buffers on a machine that's
getting pushed into swap. It starts to swap BUFFERs. Once buffers
start getting swapped you're not just losing performance, that huge
shared_buffers is now working against you because what you THINK are
buffers in RAM to make things faster are in fact blocks on a hard
drive being swapped in and out during reads. It's the exact opposite
of fast. :)


Re: MusicBrainz postgres performance issues

From
Andres Freund
Date:
On 2015-03-15 12:25:07 -0600, Scott Marlowe wrote:
> Here's the problem with a large shared_buffers on a machine that's
> getting pushed into swap. It starts to swap BUFFERs. Once buffers
> start getting swapped you're not just losing performance, that huge
> shared_buffers is now working against you because what you THINK are
> buffers in RAM to make things faster are in fact blocks on a hard
> drive being swapped in and out during reads. It's the exact opposite
> of fast. :)

IMNSHO that's tackling things from the wrong end. If 12GB of shared
buffers drive your 48GB dedicated OLTP postgres server into swapping out
actively used pages, the problem isn't the 12GB of shared buffers, but
that you require so much memory for other things. That needs to be
fixed.

But! We haven't even established that swapping is an actual problem
here. The ~2GB of swapped out memory could just as well be the java raid
controller management monstrosity or something similar. Those pages
won't ever be used and thus can better be used to buffer IO.

You can check what's actually swapped out using:
grep ^VmSwap /proc/[0-9]*/status|grep -v '0 kB'

For swapping to be actually harmful you need to have pages that are
regularly swapped in. vmstat will tell.

In a concurrent OLTP workload (~450 established connections do suggest
that) with a fair amount of data keeping the hot data set in
shared_buffers can significantly reduce problems. Constantly searching
for victim buffers isn't a nice thing, and that will happen if your most
frequently used data doesn't fit into s_b.  On the other hand, if your
data set is so large that even the hottest part doesn't fit into memory
(perhaps because there's no hottest part as there's no locality at all),
a smaller shared buffers can make things more efficient, because the
search for replacement buffers is cheaper with a smaller shared buffers
setting.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: MusicBrainz postgres performance issues

From
"michael@sqlexec.com"
Date:
How many CPUs in play here on the PG Cluster Server,
cat /proc/cpuinfo | grep processor | wc -l
 

I see you got pg_stat_statements enabled, what are the SQL you experience during this heavy load time?  And does explain on them show a lot of sorting activity that requires more work_mem.

Please enable log_checkpoints, so we can see if your checkpoint_segments is adequate.

Sunday, March 15, 2015 6:47 PM

IMNSHO that's tackling things from the wrong end. If 12GB of shared
buffers drive your 48GB dedicated OLTP postgres server into swapping out
actively used pages, the problem isn't the 12GB of shared buffers, but
that you require so much memory for other things. That needs to be
fixed.

But! We haven't even established that swapping is an actual problem
here. The ~2GB of swapped out memory could just as well be the java raid
controller management monstrosity or something similar. Those pages
won't ever be used and thus can better be used to buffer IO.

You can check what's actually swapped out using:
grep ^VmSwap /proc/[0-9]*/status|grep -v '0 kB'

For swapping to be actually harmful you need to have pages that are
regularly swapped in. vmstat will tell.

In a concurrent OLTP workload (~450 established connections do suggest
that) with a fair amount of data keeping the hot data set in
shared_buffers can significantly reduce problems. Constantly searching
for victim buffers isn't a nice thing, and that will happen if your most
frequently used data doesn't fit into s_b. On the other hand, if your
data set is so large that even the hottest part doesn't fit into memory
(perhaps because there's no hottest part as there's no locality at all),
a smaller shared buffers can make things more efficient, because the
search for replacement buffers is cheaper with a smaller shared buffers
setting.

Greetings,

Andres Freund

Sunday, March 15, 2015 2:25 PM

Here's the problem with a large shared_buffers on a machine that's
getting pushed into swap. It starts to swap BUFFERs. Once buffers
start getting swapped you're not just losing performance, that huge
shared_buffers is now working against you because what you THINK are
buffers in RAM to make things faster are in fact blocks on a hard
drive being swapped in and out during reads. It's the exact opposite
of fast. :)


Sunday, March 15, 2015 1:46 PM

That imo doesn't really have anything to do with it. The primary benefit
of a BBU with writeback caching is accelerating (near-)synchronous
writes. Like the WAL. But, besides influencing the default for
wal_buffers, a larger shared_buffers doesn't change the amount of
synchronous writes.

Greetings,

Andres Freund

Sunday, March 15, 2015 1:42 PM
On Sun, Mar 15, 2015 at 8:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2015-03-15 11:09:34 -0600, Scott Marlowe wrote:
shared_mem of 12G is almost always too large. I'd drop it down to ~1G or so.
I think that's a outdated wisdom, i.e. not generally true.
Quite agreed. With note, that proper configured controller with BBU is needed.


A new enough kernel, a sane filesystem
(i.e. not ext3) and sane checkpoint configuration takes care of most of
the other disadvantages.
Most likely. And better to be sure that filesystem mounted without barrier.

And I agree with Scott - 64MB work mem AND max_connections = 500 is a
recipe for disaster. The problem could be in session mode of
pgbouncer. If you can work with transaction mode - do it.


Best regards,
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com


Sunday, March 15, 2015 1:20 PM

I think that's a outdated wisdom, i.e. not generally true. I've now seen
a significant number of systems where a larger shared_buffers can help
quite massively. The primary case where it can, in my experience, go
bad are write mostly database where every buffer acquiration has to
write out dirty data while holding locks. Especially during relation
extension that's bad. A new enough kernel, a sane filesystem
(i.e. not ext3) and sane checkpoint configuration takes care of most of
the other disadvantages.

Greetings,

Andres Freund

Attachment

Re: MusicBrainz postgres performance issues

From
Tomas Vondra
Date:
On 15.3.2015 18:54, Ilya Kosmodemiansky wrote:
> On Sun, Mar 15, 2015 at 8:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> That imo doesn't really have anything to do with it. The primary
>> benefit of a BBU with writeback caching is accelerating
>> (near-)synchronous writes. Like the WAL.
>
> My point was, that having no proper raid controller (today bbu
> surely needed for the controller to be a proper one) + heavy writes
> of any kind, it is absolutely impossible to live with large
> shared_buffers and without io problems.

That is not really true, IMHO.

The benefit of the write cache is that it can absorb certain amount of
writes, equal to the size of the cache (nowadays usually 512MB or 1GB),
without forcing them to disks.

It however still has to flush the dirty data to the drives later, but
that side usually has much lower throughput - e.g. while you can easily
write several GB/s to the controller, the drives usually handle only
~1MB/s of random writes each (I assume rotational drives here).

But if you do a lot of random writes (which is likely the case for
write-heavy databases), you'll fill the write cache pretty soon and will
be bounded by the drives anyway.

The controller really can't help with sequential writes much, because
the drives already handle that quite well. And SSDs are a completely
different story of course.

That does not mean the write cache is useless - it can absorb short
bursts of random writes, fix the write hole with RAID5, the controller
may compute the parity computation etc. Whenever someone asks me whether
they should buy a RAID controller with write cache for their database
server, my answer is "absolutely yes" in 95.23% cases ...

... but really it's not something that magically changes the limits for
write-heavy databases - the main limit are still the drives.

regards
Tomas

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: MusicBrainz postgres performance issues

From
Andres Freund
Date:
On 2015-03-15 20:54:51 +0300, Ilya Kosmodemiansky wrote:
> On Sun, Mar 15, 2015 at 8:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > That imo doesn't really have anything to do with it. The primary benefit
> > of a BBU with writeback caching is accelerating (near-)synchronous
> > writes. Like the WAL.
>
> My point was, that having no proper raid controller (today bbu surely
> needed for the controller to be a proper one) + heavy writes of any
> kind, it is absolutely impossible to live with large shared_buffers
> and without io problems.

And my point is that you're mostly wrong. What a raid controller's
writeback usefully cache accelerates is synchronous writes. I.e. writes
that the application waits for. Usually raid controllers don't have much
chance to reorderer the queued writes (i.e. turning neighboring writes
into one larger sequential write). What they do excel at is making
synchronous writes to disk return faster because the data is only
written to the the controller's memory, not to actual storage. They're
also good at avoiding actual writes to disk when the *same* page is
written to multiple times in short amount of time.

In postgres writes for data that goes through shared_buffers are usally
asynchronous. We write them to the OS's page cache when a page is needed
for other contents, is undirtied by the bgwriter, or written out during
a checkpoint; but do *not* wait for the write to hit the disk.  The
controller's write back cache doesn't hugely help here, because it
doesn't make *that* much of a difference whether the dirty data stays in
the kernel's page cache or in the controller.

In contrast to that, writes to the WAL are often more or les
synchronous. We actually wait (via fdatasync()/fsync() syscalls) for
writes to hit disk in a bunch of scenarios, most commonly when
committing a transaction. Unless synchronous_commit = off every COMMIT
in a transaction that wrote data implies a fdatasync() of a file in
pg_xlog (well, we do optimize that in some condition, but let's leave
that out for now).

Additionally, if there are many smaller/interleaved transactions, we
will write()/fdatasync() out the same 8kb WAL page repeatedly. Everytime
a transaction commits (and some other things) the page that commit
record is on will be flushed. As the WAL records for insertions,
updates, deletes, commits are frequently much smaller than 8kb that will
often happen 20-100 for the same page in OLTP scenarios with narrow
rows.  That's why synchronous_commit = off can be such a huge win for
OLTP write workloads without a writeback cache - synchronous writes are
turned into asynchronous writes, and repetitive writes to the same page
are avoided. It also explains why synchronous_commit = off has much less
an effect for bulk write workloads: As there are no synchronous disk
writes due to WAL flushes at commit time (there's only very few
commits), synchronous commit doesn't have much of an effect.


That said, there's a couple reasons why you're not completely wrong:

Historically, when using ext3 with data=ordered and some other
filesystems, synchronous writes to one file forced *all* other
previously dirtied data to also be flushed. That means that if you have
pg_xlog and the normal relation files on the same filesystem, the
synchronous writes to the WAL will not only have to write out the new
WAL (often not that much data), but also all the other dirty data.  The
OS will often be unable to do efficient write combining in that case,
because a) there's not yet that much data there, b) postgres doesn't
order writes during checkpoints.  That means that WAL writes will
suddenly have to write out much more data => COMMITs are slow.  That's
where the suggestion to keep pg_xlog on a separate partion largely comes
from.

Writes going through shared_buffers are sometimes indirectly turned into
synchronous writes (from the OSs perspective at least. Which means
they'll done at a higher priority). That happens when the checkpointer
fsync()s all the files at the end of a checkpoint. When things are going
well and checkpoints are executed infrequently and completely guided by
time (i.e. triggered solely by checkpoint_timeout, and not
checkpoint_segments) that's usually not too bad. You'll see a relatively
small latency spike for transactions.
Unfortunately the ext* filesystems have a implementation problem here,
which can make this problem much worse: The way writes are priorized
during an fsync() can stall out concurrent synchronous reads/writes
pretty badly. That's much less of a problem with e.g. xfs. Which is why
I'd atm not suggest ext4 for write intensive applications.

The other situation where this can lead to big problems is if your
checkpoints aren't scheduled by time (you can recognize that by enabling
log_checkpoints and check a) that time is the trigger, b) they're
actually happening in a interval consistent with checkpoint_timeout). If
the relation files are not writtten out in a smoothed out fashion
(configured by checkpoint_completion_target) a *lot* of dirty buffers
can exist in the OS's page cache. Especially because the default 'dirty'
settings in linux on servers with a lot of IO are often completely
insane; especially with older kernels (pretty much everything before
3.11 is badly affected).  The important thing to do here is to configure
checkpoint_timeout, checkpoint_segments and checkpoint_completion_target
in a consistent way.  In my opinion the default checkpoint_timeout is
*way* too low; and just leads to a large increase in overall writes (due
to more frequent checkpoints repeated writes to the same page aren't
coalesced) *and* an increase in WAL volume (many more full_page_writes).


Lots more could be written about this topic; but I think I've blathered
on enough for the moment ;)

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: MusicBrainz postgres performance issues

From
Tomas Vondra
Date:
On 15.3.2015 23:47, Andres Freund wrote:
> On 2015-03-15 12:25:07 -0600, Scott Marlowe wrote:
>> Here's the problem with a large shared_buffers on a machine that's
>> getting pushed into swap. It starts to swap BUFFERs. Once buffers
>> start getting swapped you're not just losing performance, that huge
>> shared_buffers is now working against you because what you THINK are
>> buffers in RAM to make things faster are in fact blocks on a hard
>> drive being swapped in and out during reads. It's the exact opposite
>> of fast. :)
>
> IMNSHO that's tackling things from the wrong end. If 12GB of shared
> buffers drive your 48GB dedicated OLTP postgres server into swapping
> out actively used pages, the problem isn't the 12GB of shared
> buffers, but that you require so much memory for other things. That
> needs to be fixed.

I second this opinion.

As was already pointed out, the 500 connections is rather insane
(assuming the machine does not have hundreds of cores).

If there are memory pressure issues, it's likely because many queries
are performing memory-expensive operations at the same time (might even
be a bad estimate causing hashagg to use much more than work_mem).


> But! We haven't even established that swapping is an actual problem
> here. The ~2GB of swapped out memory could just as well be the java raid
> controller management monstrosity or something similar. Those pages
> won't ever be used and thus can better be used to buffer IO.
>
> You can check what's actually swapped out using:
> grep ^VmSwap /proc/[0-9]*/status|grep -v '0 kB'
>
> For swapping to be actually harmful you need to have pages that are
> regularly swapped in. vmstat will tell.

I've already asked for vmstat logs, so let's wait.

> In a concurrent OLTP workload (~450 established connections do
> suggest that) with a fair amount of data keeping the hot data set in
> shared_buffers can significantly reduce problems. Constantly
> searching for victim buffers isn't a nice thing, and that will happen
> if your most frequently used data doesn't fit into s_b. On the other
> hand, if your data set is so large that even the hottest part doesn't
> fit into memory (perhaps because there's no hottest part as there's
> no locality at all), a smaller shared buffers can make things more
> efficient, because the search for replacement buffers is cheaper with
> a smaller shared buffers setting.

I've met many systems with max_connections values this high, and it was
mostly idle connections because of separate connection pools on each
application server. So mostly idle (90% of the time), but at peak time
all the application servers want to od stuff at the same time. And it
all goes KABOOOM! just like here.


--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: MusicBrainz postgres performance issues

From
"michael@sqlexec.com"
Date:
Why is 500 connections "insane".  We got 32 CPU with 96GB and 3000 max connections, and we are doing fine, even when hitting our max concurrent connection peaks around 4500.  At a previous site, we were using 2000 max connections on 24 CPU and 64GB RAM, with about 1500 max concurrent connections.  So I wouldn't be too hasty in saying more than 500 is asking for trouble.  Just as long as you got your kernel resources set high enough to sustain it (SHMMAX, SHMALL, SEMMNI, and ulimits), and RAM for work_mem.
Sunday, March 15, 2015 7:41 PM
On 15.3.2015 23:47, Andres Freund wrote:
On 2015-03-15 12:25:07 -0600, Scott Marlowe wrote:
Here's the problem with a large shared_buffers on a machine that's
getting pushed into swap. It starts to swap BUFFERs. Once buffers
start getting swapped you're not just losing performance, that huge
shared_buffers is now working against you because what you THINK are
buffers in RAM to make things faster are in fact blocks on a hard
drive being swapped in and out during reads. It's the exact opposite
of fast. :)
IMNSHO that's tackling things from the wrong end. If 12GB of shared 
buffers drive your 48GB dedicated OLTP postgres server into swapping
out actively used pages, the problem isn't the 12GB of shared
buffers, but that you require so much memory for other things. That
needs to be fixed.
I second this opinion.

As was already pointed out, the 500 connections is rather insane
(assuming the machine does not have hundreds of cores).

If there are memory pressure issues, it's likely because many queries
are performing memory-expensive operations at the same time (might even
be a bad estimate causing hashagg to use much more than work_mem).


But! We haven't even established that swapping is an actual problem
here. The ~2GB of swapped out memory could just as well be the java raid
controller management monstrosity or something similar. Those pages
won't ever be used and thus can better be used to buffer IO.

You can check what's actually swapped out using:
grep ^VmSwap /proc/[0-9]*/status|grep -v '0 kB'

For swapping to be actually harmful you need to have pages that are 
regularly swapped in. vmstat will tell.
I've already asked for vmstat logs, so let's wait.

In a concurrent OLTP workload (~450 established connections do
suggest that) with a fair amount of data keeping the hot data set in 
shared_buffers can significantly reduce problems. Constantly
searching for victim buffers isn't a nice thing, and that will happen
if your most frequently used data doesn't fit into s_b. On the other
hand, if your data set is so large that even the hottest part doesn't
fit into memory (perhaps because there's no hottest part as there's
no locality at all), a smaller shared buffers can make things more
efficient, because the search for replacement buffers is cheaper with
a smaller shared buffers setting.
I've met many systems with max_connections values this high, and it was
mostly idle connections because of separate connection pools on each
application server. So mostly idle (90% of the time), but at peak time
all the application servers want to od stuff at the same time. And it
all goes KABOOOM! just like here.


Sunday, March 15, 2015 6:47 PM

IMNSHO that's tackling things from the wrong end. If 12GB of shared
buffers drive your 48GB dedicated OLTP postgres server into swapping out
actively used pages, the problem isn't the 12GB of shared buffers, but
that you require so much memory for other things. That needs to be
fixed.

But! We haven't even established that swapping is an actual problem
here. The ~2GB of swapped out memory could just as well be the java raid
controller management monstrosity or something similar. Those pages
won't ever be used and thus can better be used to buffer IO.

You can check what's actually swapped out using:
grep ^VmSwap /proc/[0-9]*/status|grep -v '0 kB'

For swapping to be actually harmful you need to have pages that are
regularly swapped in. vmstat will tell.

In a concurrent OLTP workload (~450 established connections do suggest
that) with a fair amount of data keeping the hot data set in
shared_buffers can significantly reduce problems. Constantly searching
for victim buffers isn't a nice thing, and that will happen if your most
frequently used data doesn't fit into s_b. On the other hand, if your
data set is so large that even the hottest part doesn't fit into memory
(perhaps because there's no hottest part as there's no locality at all),
a smaller shared buffers can make things more efficient, because the
search for replacement buffers is cheaper with a smaller shared buffers
setting.

Greetings,

Andres Freund

Sunday, March 15, 2015 2:25 PM

Here's the problem with a large shared_buffers on a machine that's
getting pushed into swap. It starts to swap BUFFERs. Once buffers
start getting swapped you're not just losing performance, that huge
shared_buffers is now working against you because what you THINK are
buffers in RAM to make things faster are in fact blocks on a hard
drive being swapped in and out during reads. It's the exact opposite
of fast. :)


Sunday, March 15, 2015 1:46 PM

That imo doesn't really have anything to do with it. The primary benefit
of a BBU with writeback caching is accelerating (near-)synchronous
writes. Like the WAL. But, besides influencing the default for
wal_buffers, a larger shared_buffers doesn't change the amount of
synchronous writes.

Greetings,

Andres Freund

Sunday, March 15, 2015 1:42 PM
On Sun, Mar 15, 2015 at 8:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2015-03-15 11:09:34 -0600, Scott Marlowe wrote:
shared_mem of 12G is almost always too large. I'd drop it down to ~1G or so.
I think that's a outdated wisdom, i.e. not generally true.
Quite agreed. With note, that proper configured controller with BBU is needed.


A new enough kernel, a sane filesystem
(i.e. not ext3) and sane checkpoint configuration takes care of most of
the other disadvantages.
Most likely. And better to be sure that filesystem mounted without barrier.

And I agree with Scott - 64MB work mem AND max_connections = 500 is a
recipe for disaster. The problem could be in session mode of
pgbouncer. If you can work with transaction mode - do it.


Best regards,
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com


Attachment

Re: MusicBrainz postgres performance issues

From
Tomas Vondra
Date:
On 16.3.2015 00:55, michael@sqlexec.com wrote:
> Why is 500 connections "insane". We got 32 CPU with 96GB and 3000
> max connections, and we are doing fine, even when hitting our max
> concurrent connection peaks around 4500. At a previous site, we were
> using 2000 max connections on 24 CPU and 64GB RAM, with about 1500
> max concurrent connections. So I wouldn't be too hasty in saying more
> than 500 is asking for trouble. Just as long as you got your kernel
> resources set high enough to sustain it (SHMMAX, SHMALL, SEMMNI, and
> ulimits), and RAM for work_mem.

If all the connections are active at the same time (i.e. running
queries), they have to share the 32 cores somehow. Or I/O, if that's the
bottleneck.

In other words, you're not improving the throughput of the system,
you're merely increasing latencies. And it may easily happen that the
latency increase is not linear, but grows faster - because of locking,
context switches and other process-related management.

Imagine you have a query taking 1 second of CPU time. If you have 64
such queries running concurrently on 32 cores, each gets only 1/2 a CPU
and so takes >=2 seconds. With 500 queries, it's >=15 seconds per, etc.

If those queries are acquiring the same locks (e.g. updating the same
rows, or so), you can imagine what happens ...

Also, if part of the query required a certain amount of memory for part
of the plan, it now holds that memory for much longer too. That only
increases the change of OOM issues.

It may work fine when most of the connections are idle, but it makes
storms like this possible.


--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: MusicBrainz postgres performance issues

From
Andres Freund
Date:
(please quote properly)

On 2015-03-15 19:55:23 -0400, michael@sqlexec.com wrote:
> Why is 500 connections "insane".  We got 32 CPU with 96GB and 3000 max
> connections, and we are doing fine, even when hitting our max concurrent
> connection peaks around 4500.  At a previous site, we were using 2000 max
> connections on 24 CPU and 64GB RAM, with about 1500 max concurrent
> connections.  So I wouldn't be too hasty in saying more than 500 is asking
> for trouble.  Just as long as you got your kernel resources set high enough
> to sustain it (SHMMAX, SHMALL, SEMMNI, and ulimits), and RAM for work_mem.

It may work acceptably in some scenarios, but it can lead to significant
problems. Several things in postgres things in postgres scale linearly
(from the algorithmic point of view, often CPU characteristics like
cache sizes make it wors) with max_connections, most notably acquiring a
snapshot.  It usually works ok enough if you don't have a high number of
queries per second, but if you do, you can run into horrible contention
problems.  Absurdly enough that matters *more* on bigger machines with
several sockets. It's especially bad on 4+ socket systems.

The other aspect is that such a high number of full connections usually
just isn't helpful for throughput. Not even the most massive NUMA
systems (~256 hardware threads is the realistic max atm IIRC) can
process 4.5k queries at the same time.  It'll often be much more
efficient if all connections above a certain number aren't allocated a
full postgres backend, with all it's overhead, but use a much more
lightweight pooler connection.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: MusicBrainz postgres performance issues

From
"michael@sqlexec.com"
Date:
I agree with your counter argument about how high max_connections "can" cause problems, but max_connections may not part of the problem here.  There's a bunch of "depends stuff" in there based on workload details, # cpus, RAM, etc.

I'm still waiting to find out how many CPUs on this DB server.  Did i miss it somewhere in the email thread below?

Sunday, March 15, 2015 8:07 PM

If all the connections are active at the same time (i.e. running
queries), they have to share the 32 cores somehow. Or I/O, if that's the
bottleneck.

In other words, you're not improving the throughput of the system,
you're merely increasing latencies. And it may easily happen that the
latency increase is not linear, but grows faster - because of locking,
context switches and other process-related management.

Imagine you have a query taking 1 second of CPU time. If you have 64
such queries running concurrently on 32 cores, each gets only 1/2 a CPU
and so takes >=2 seconds. With 500 queries, it's >=15 seconds per, etc.

If those queries are acquiring the same locks (e.g. updating the same
rows, or so), you can imagine what happens ...

Also, if part of the query required a certain amount of memory for part
of the plan, it now holds that memory for much longer too. That only
increases the change of OOM issues.

It may work fine when most of the connections are idle, but it makes
storms like this possible.


Sunday, March 15, 2015 7:55 PM
Why is 500 connections "insane".  We got 32 CPU with 96GB and 3000 max connections, and we are doing fine, even when hitting our max concurrent connection peaks around 4500.  At a previous site, we were using 2000 max connections on 24 CPU and 64GB RAM, with about 1500 max concurrent connections.  So I wouldn't be too hasty in saying more than 500 is asking for trouble.  Just as long as you got your kernel resources set high enough to sustain it (SHMMAX, SHMALL, SEMMNI, and ulimits), and RAM for work_mem.
Sunday, March 15, 2015 7:41 PM
On 15.3.2015 23:47, Andres Freund wrote:
On 2015-03-15 12:25:07 -0600, Scott Marlowe wrote:
Here's the problem with a large shared_buffers on a machine that's
getting pushed into swap. It starts to swap BUFFERs. Once buffers
start getting swapped you're not just losing performance, that huge
shared_buffers is now working against you because what you THINK are
buffers in RAM to make things faster are in fact blocks on a hard
drive being swapped in and out during reads. It's the exact opposite
of fast. :)
IMNSHO that's tackling things from the wrong end. If 12GB of shared 
buffers drive your 48GB dedicated OLTP postgres server into swapping
out actively used pages, the problem isn't the 12GB of shared
buffers, but that you require so much memory for other things. That
needs to be fixed.
I second this opinion.

As was already pointed out, the 500 connections is rather insane
(assuming the machine does not have hundreds of cores).

If there are memory pressure issues, it's likely because many queries
are performing memory-expensive operations at the same time (might even
be a bad estimate causing hashagg to use much more than work_mem).


But! We haven't even established that swapping is an actual problem
here. The ~2GB of swapped out memory could just as well be the java raid
controller management monstrosity or something similar. Those pages
won't ever be used and thus can better be used to buffer IO.

You can check what's actually swapped out using:
grep ^VmSwap /proc/[0-9]*/status|grep -v '0 kB'

For swapping to be actually harmful you need to have pages that are 
regularly swapped in. vmstat will tell.
I've already asked for vmstat logs, so let's wait.

In a concurrent OLTP workload (~450 established connections do
suggest that) with a fair amount of data keeping the hot data set in 
shared_buffers can significantly reduce problems. Constantly
searching for victim buffers isn't a nice thing, and that will happen
if your most frequently used data doesn't fit into s_b. On the other
hand, if your data set is so large that even the hottest part doesn't
fit into memory (perhaps because there's no hottest part as there's
no locality at all), a smaller shared buffers can make things more
efficient, because the search for replacement buffers is cheaper with
a smaller shared buffers setting.
I've met many systems with max_connections values this high, and it was
mostly idle connections because of separate connection pools on each
application server. So mostly idle (90% of the time), but at peak time
all the application servers want to od stuff at the same time. And it
all goes KABOOOM! just like here.


Sunday, March 15, 2015 6:47 PM

IMNSHO that's tackling things from the wrong end. If 12GB of shared
buffers drive your 48GB dedicated OLTP postgres server into swapping out
actively used pages, the problem isn't the 12GB of shared buffers, but
that you require so much memory for other things. That needs to be
fixed.

But! We haven't even established that swapping is an actual problem
here. The ~2GB of swapped out memory could just as well be the java raid
controller management monstrosity or something similar. Those pages
won't ever be used and thus can better be used to buffer IO.

You can check what's actually swapped out using:
grep ^VmSwap /proc/[0-9]*/status|grep -v '0 kB'

For swapping to be actually harmful you need to have pages that are
regularly swapped in. vmstat will tell.

In a concurrent OLTP workload (~450 established connections do suggest
that) with a fair amount of data keeping the hot data set in
shared_buffers can significantly reduce problems. Constantly searching
for victim buffers isn't a nice thing, and that will happen if your most
frequently used data doesn't fit into s_b. On the other hand, if your
data set is so large that even the hottest part doesn't fit into memory
(perhaps because there's no hottest part as there's no locality at all),
a smaller shared buffers can make things more efficient, because the
search for replacement buffers is cheaper with a smaller shared buffers
setting.

Greetings,

Andres Freund

Sunday, March 15, 2015 2:25 PM

Here's the problem with a large shared_buffers on a machine that's
getting pushed into swap. It starts to swap BUFFERs. Once buffers
start getting swapped you're not just losing performance, that huge
shared_buffers is now working against you because what you THINK are
buffers in RAM to make things faster are in fact blocks on a hard
drive being swapped in and out during reads. It's the exact opposite
of fast. :)


Attachment

Re: MusicBrainz postgres performance issues

From
Gavin Flower
Date:
On 16/03/15 13:07, Tomas Vondra wrote:
> On 16.3.2015 00:55, michael@sqlexec.com wrote:
>> Why is 500 connections "insane". We got 32 CPU with 96GB and 3000
>> max connections, and we are doing fine, even when hitting our max
>> concurrent connection peaks around 4500. At a previous site, we were
>> using 2000 max connections on 24 CPU and 64GB RAM, with about 1500
>> max concurrent connections. So I wouldn't be too hasty in saying more
>> than 500 is asking for trouble. Just as long as you got your kernel
>> resources set high enough to sustain it (SHMMAX, SHMALL, SEMMNI, and
>> ulimits), and RAM for work_mem.
[...]
> Also, if part of the query required a certain amount of memory for part
> of the plan, it now holds that memory for much longer too. That only
> increases the change of OOM issues.
>
[...]

Also you could get a situation where a small number of queries & their
data, relevant indexes, and working memory etc can all just fit into
RAM, but the extra queries suddenly reduce the RAM so that even these
queries spill to disk, plus the time required to process the extra
queries.  So a nicely behaved system could suddenly get a lot worse.
Even before you consider additional lock contention and other nasty things!

It all depends...


Cheers,
Gavin


Re: MusicBrainz postgres performance issues

From
Roxanne Reid-Bennett
Date:
On 3/15/2015 6:54 AM, Robert Kaye wrote:
Hi!

We at MusicBrainz have been having trouble with our Postgres install for the past few days. I’ve collected all the relevant information here:


If anyone could provide tips, suggestions or other relevant advice for what to poke at next, we would love it.
Robert,

Wow - You've engaged the wizards indeed.

I haven't heard or seen anything that would answer my *second* question if faced with this (my first would have been "what changed")....

What is the database actually trying to do when it spikes?  e.g. what queries are running ?
Is there any pattern in the specific activity (exactly the same query, or same query different data, or even just same tables, and/or same users, same apps) when it spikes?

I know from experience that well behaved queries can stop being well behaved if underlying data changes

and for the experts... what would a corrupt index do to memory usage?

Roxanne

Re: MusicBrainz postgres performance issues

From
Robert Kaye
Date:
On March 16, 2015 at 3:24:34 AM, Roxanne Reid-Bennett (rox@tara-lu.com) wrote:
Robert,

Wow - You've engaged the wizards indeed.

I haven't heard or seen anything that would answer my *second* question if faced with this (my first would have been "what changed")....

Yes, indeed — I feel honored to have so many people chime into this issue.

The problem was that nothing abnormal was happening — just the normal queries were running that hadn’t given us any problems for months. We undid everything that had been recently changed in an effort to address “what changed”. Nothing helped, which is what had us so perplexed.

However, I am glad to report that our problems are fixed and that our server is back to humming along nicely. 

What we changed:

1. As it was pointed out here, max_connections of 500 was in fact insanely high, especially in light of using PGbouncer. Before we used PGbouncer we needed a lot more connections and when we started using PGbouncer, we never reduced this number.

2. Our server_lifetime was set far too high (1 hour). Josh Berkus suggested lowering that to 5 minutes.

3. We reduced the number of PGbouncer active connections to the DB.

What we learned:

1. We had too many backends

2. The backends were being kept around for too long by PGbouncer.

3. This caused too many idle backends to kick around. Once we exhausted physical ram, we started swapping.

4. Linux 3.2 apparently has some less than desirable swap behaviours. Once we started swapping, everything went nuts. 

Going forward we’re going to upgrade our kernel the next time we have down time for our site and the rest should be sorted now.

I wanted to thank everyone who contributed their thoughts to this thread — THANK YOU.

And as I said to Josh earlier: "Postgres rocks our world. I’m immensely pleased that once again the problems were our own stupidity and not PG’s fault. In over 10 years of us using PG, it has never been PG’s fault. Not once.

And thus we’re one tiny bit smarter today. Thank you everyone!


P.S. If anyone would still like to get some more information about this problem for their own edification, please let me know. Given that we’ve fixed the issue, I don’t want to spam this list by responding to all the questions that were posed.


--

--ruaok        

Robert Kaye     --     rob@musicbrainz.org     --    http://musicbrainz.org

Re: MusicBrainz postgres performance issues

From
Thomas Kellerer
Date:
Robert Kaye schrieb am 16.03.2015 um 13:59:
> However, I am glad to report that our problems are fixed and that our
> server is back to humming along nicely.
>
> And as I said to Josh earlier: "Postgres rocks our world. I’m
> immensely pleased that once again the problems were our own stupidity
> and not PG’s fault. In over 10 years of us using PG, it has never
> been PG’s fault. Not once.”
>
> And thus we’re one tiny bit smarter today. Thank you everyone!
>

I think it would be nice if you can amend your blog posting to include the solution that you found.

Otherwise this will simply stick around as yet another unsolved performance problem

Thomas

Re: MusicBrainz postgres performance issues

From
Andreas Kretschmer
Date:
Robert Kaye <rob@musicbrainz.org> wrote:

> However, I am glad to report that our problems are fixed and that our server is
> back to humming along nicely.
>
> What we changed:
>
> 1. As it was pointed out here, max_connections of 500 was in fact insanely
> high, especially in light of using PGbouncer. Before we used PGbouncer we
> needed a lot more connections and when we started using PGbouncer, we never
> reduced this number.
>
> 2. Our server_lifetime was set far too high (1 hour). Josh Berkus suggested
> lowering that to 5 minutes.
>
> 3. We reduced the number of PGbouncer active connections to the DB.
>


Many thanks for the feedback!


Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°


Re: MusicBrainz postgres performance issues

From
Robert Kaye
Date:
> On Mar 16, 2015, at 2:22 PM, Thomas Kellerer <spam_eater@gmx.net> wrote:
>
> I think it would be nice if you can amend your blog posting to include the solution that you found.
>
> Otherwise this will simply stick around as yet another unsolved performance problem


Good thinking:

  http://blog.musicbrainz.org/2015/03/16/postgres-troubles-resolved/

I’ve also updated the original post with the like to the above. Case closed. :)

--

--ruaok

Robert Kaye     --     rob@musicbrainz.org     --    http://musicbrainz.org



Re: MusicBrainz postgres performance issues

From
Josh Berkus
Date:
On 03/16/2015 05:59 AM, Robert Kaye wrote:
> 4. Linux 3.2 apparently has some less than desirable swap behaviours.
> Once we started swapping, everything went nuts.

Relevant to this:

http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html

Anybody who is on Linux Kernels 3.0 to 3.8 really needs to upgrade soon.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: MusicBrainz postgres performance issues

From
Joao Junior
Date:
Robert many thanks for feedback!!

Could you post your new pgbouncer config file??

How many postgresql process do you have now at OS with this new conf??

How many clients from app server hit your pgbouncer??


Regards,

Regards,

2015-03-16 11:32 GMT-03:00 Robert Kaye <rob@musicbrainz.org>:

> On Mar 16, 2015, at 2:22 PM, Thomas Kellerer <spam_eater@gmx.net> wrote:
>
> I think it would be nice if you can amend your blog posting to include the solution that you found.
>
> Otherwise this will simply stick around as yet another unsolved performance problem


Good thinking:

  http://blog.musicbrainz.org/2015/03/16/postgres-troubles-resolved/

I’ve also updated the original post with the like to the above. Case closed. :)

--

--ruaok

Robert Kaye     --     rob@musicbrainz.org     --    http://musicbrainz.org



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: MusicBrainz postgres performance issues

From
Jim Nasby
Date:
On 3/15/15 7:17 PM, michael@sqlexec.com wrote:
Please avoid top-posting.

> I agree with your counter argument about how high max_connections "can"
> cause problems, but max_connections may not part of the problem here.
> There's a bunch of "depends stuff" in there based on workload details, #
> cpus, RAM, etc.

Sure, but the big, huge danger with a very large max_connections is that
you now have a large grenade with the pin pulled out. If *anything*
happens to disturb the server and push the active connection count past
the number of actual cores the box is going to fall over and not recover.

In contrast, if max_connections is <= the number of cores this is far
less likely to happen. Each connection will get a CPU to run on, and as
long as they're not all clamoring for the same locks the server will be
making forward progress. Clients may have to wait in the pool for a free
connection for some time, but once they get one their work will get done.

> I'm still waiting to find out how many CPUs on this DB server.  Did i
> miss it somewhere in the email thread below?

http://blog.musicbrainz.org/2015/03/15/postgres-troubles/ might show it
somewhere...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: MusicBrainz postgres performance issues

From
Scott Marlowe
Date:
On Mon, Mar 16, 2015 at 6:59 AM, Robert Kaye <rob@musicbrainz.org> wrote:
>
> 4. Linux 3.2 apparently has some less than desirable swap behaviours. Once
> we started swapping, everything went nuts.

On older machines I used to just turn off swap altogether. Esp if I
wasn't running out of memory but swap was engaging anyway. swappiness
= 0 didn't help, nothing did, I just kept seeing kswapd working it's
butt off doing nothing but hitting the swap partition.

So glad to be off those old kernels.