Thread: OT: Performance of VM

OT: Performance of VM

From
Thomas Güttler
Date:
This is a bit off-topic, since it is not about the performance of PG itself.

But maybe some have the same issue.

We run PostgreSQL in virtual machines which get provided by our customer.

We are not responsible for the hypervisor and have not access to it.

The IO performance of our application was terrible slow yesterday.

The users blamed us, but it seems that there was something wrong with the hypervisor.

For the next time I would like to have reliable figures, to underline my guess that the hypervisor (and not our 
application) is the bottle neck.

I have the vague strategy to make some io performance check every N minutes and record the numbers.

Of course I could do some dirty scripting, but I would like to avoid to re-invent things. I guess this was already 
solved by people which have more brain and more experience than I have :-)

What do you suggest to get some reliable figures?

Regards,
   Thomas Güttler

-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


Re: OT: Performance of VM

From
Andreas Kretschmer
Date:

Am 05.02.2018 um 14:14 schrieb Thomas Güttler:
> What do you suggest to get some reliable figures? 

sar is often recommended, see 
https://blog.2ndquadrant.com/in-the-defense-of-sar/.

Can you exclude other reasons like vacuum / vacuum freeze?



Regards, Andreas

-- 
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com



Re: OT: Performance of VM

From
Andrew Kerber
Date:
Have them check the memory and CPU allocation of the hypervisor, make sure its not overallocated. Make sure the partitions for stroage are aligned (see here: https://blogs.vmware.com/vsphere/2011/08/guest-os-partition-alignment.html) . Install tuned, and enable the throughput performance profile. Oracle has a problem with transparent hugepages, postgres may well have the same problem, so consider disabling transparent hugepages.  There is no reason why performance on a VM would be worse than performance on a physical server.

On Mon, Feb 5, 2018 at 7:26 AM, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:


Am 05.02.2018 um 14:14 schrieb Thomas Güttler:
What do you suggest to get some reliable figures?

sar is often recommended, see https://blog.2ndquadrant.com/in-the-defense-of-sar/.

Can you exclude other reasons like vacuum / vacuum freeze?



Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com





--
Andrew W. Kerber

'If at first you dont succeed, dont take up skydiving.'

Re: OT: Performance of VM

From
Andreas Kretschmer
Date:

Am 05.02.2018 um 17:22 schrieb Andrew Kerber:
> Oracle has a problem with transparent hugepages, postgres may well 
> have the same problem, so consider disabling transparent hugepages. 

yes, that's true.


Regards, Andreas

-- 
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com



Details after Load Peak was: OT: Performance of VM

From
Thomas Güttler
Date:

Am 05.02.2018 um 14:26 schrieb Andreas Kretschmer:
> 
> 
> Am 05.02.2018 um 14:14 schrieb Thomas Güttler:
>> What do you suggest to get some reliable figures? 
> 
> sar is often recommended, see https://blog.2ndquadrant.com/in-the-defense-of-sar/.
> 
> Can you exclude other reasons like vacuum / vacuum freeze?

In the current case it was a problem in the hypervisor.

But I want to be prepared for the next time.

The tool sar looks good. This way I can generate a chart where I can see peaks. Nice.

.... But one thing is still unclear. Imagine I see a peak in the chart. The peak
was some hours ago. AFAIK sar has only the aggregated numbers.

But I need to know details if I want to answer the question "Why?". The peak
has gone and ps/top/iotop don't help me anymore.

Any idea?

Regards,
   Thomas Güttler





-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines


Re: Details after Load Peak was: OT: Performance of VM

From
Alan Hodgson
Date:
On Tue, 2018-02-06 at 15:31 +0100, Thomas Güttler wrote:
.... But one thing is still unclear. Imagine I see a peak in the chart. The peak
was some hours ago. AFAIK sar has only the aggregated numbers.

But I need to know details if I want to answer the question "Why?". The peak
has gone and ps/top/iotop don't help me anymore.


The typical solution is to store stats on everything you can think of with munin, cacti, ganglia, or similar systems.

I know with ganglia at least, in addition to all the many details it already tracks on a system and the many plugins already available for it, you can write your own plugins or simple agents, so you can keep stats on anything you can code around.

Munin's probably the easiest to try out, though.

Re: OT: Performance of VM

From
Robert Klemme
Date:
On Mon, Feb 5, 2018 at 5:22 PM, Andrew Kerber <andrew.kerber@gmail.com> wrote:
> Have them check the memory and CPU allocation of the hypervisor, make sure
> its not overallocated. Make sure the partitions for stroage are aligned (see
> here:
> https://blogs.vmware.com/vsphere/2011/08/guest-os-partition-alignment.html)
> . Install tuned, and enable the throughput performance profile. Oracle has a
> problem with transparent hugepages, postgres may well have the same problem,
> so consider disabling transparent hugepages.  There is no reason why
> performance on a VM would be worse than performance on a physical server.

Not theoretically. But in practice if you have anything run in a VM
like in this case you do not know what else is working on that box.
Analyzing these issues can be really cumbersome and tricky. This is
why I am generally skeptical of running a resource intensive
application like a RDBMS in a VM. To get halfway predictable results
you want at least a minimum of resources (CPU, memory, IO bandwidth)
reserved for that VM.

Anecdote: we once had a customer run our application in a VM (which is
supported) and complain about slowness. Eventually we found out that
they over committed memory - not in sum for all VMs which is common,
but this single VM had been configured to have more memory than was
physically available in the machine.

Kind regards

robert

-- 
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/


Re: OT: Performance of VM

From
Andrew Kerber
Date:
I am consultant that specializes in virtualizing oracle enterprise level workloads.  I’m picking up Postgres as a
secondaryskill.  You are right if you don’t manage it properly, you can have problems running enterprise workloads on
vms. But it can be done with proper management. And the HA and DR advantages of virtual systems are huge.  

Sent from my iPhone

> On Feb 10, 2018, at 5:20 AM, Robert Klemme <shortcutter@googlemail.com> wrote:
>
>> On Mon, Feb 5, 2018 at 5:22 PM, Andrew Kerber <andrew.kerber@gmail.com> wrote:
>> Have them check the memory and CPU allocation of the hypervisor, make sure
>> its not overallocated. Make sure the partitions for stroage are aligned (see
>> here:
>> https://blogs.vmware.com/vsphere/2011/08/guest-os-partition-alignment.html)
>> . Install tuned, and enable the throughput performance profile. Oracle has a
>> problem with transparent hugepages, postgres may well have the same problem,
>> so consider disabling transparent hugepages.  There is no reason why
>> performance on a VM would be worse than performance on a physical server.
>
> Not theoretically. But in practice if you have anything run in a VM
> like in this case you do not know what else is working on that box.
> Analyzing these issues can be really cumbersome and tricky. This is
> why I am generally skeptical of running a resource intensive
> application like a RDBMS in a VM. To get halfway predictable results
> you want at least a minimum of resources (CPU, memory, IO bandwidth)
> reserved for that VM.
>
> Anecdote: we once had a customer run our application in a VM (which is
> supported) and complain about slowness. Eventually we found out that
> they over committed memory - not in sum for all VMs which is common,
> but this single VM had been configured to have more memory than was
> physically available in the machine.
>
> Kind regards
>
> robert
>
> --
> [guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
> - without end}
> http://blog.rubybestpractices.com/


Re: Details after Load Peak was: OT: Performance of VM

From
"Gunnar \"Nick\" Bluth"
Date:
Am 06.02.2018 um 15:31 schrieb Thomas Güttler:
>
>
> Am 05.02.2018 um 14:26 schrieb Andreas Kretschmer:
>>
>>
>> Am 05.02.2018 um 14:14 schrieb Thomas Güttler:
>>> What do you suggest to get some reliable figures?
>>
>> sar is often recommended, see
>> https://blog.2ndquadrant.com/in-the-defense-of-sar/.
>>
>> Can you exclude other reasons like vacuum / vacuum freeze?
>
> In the current case it was a problem in the hypervisor.
>
> But I want to be prepared for the next time.
>
> The tool sar looks good. This way I can generate a chart where I can see
> peaks. Nice.
>
> .... But one thing is still unclear. Imagine I see a peak in the chart.
> The peak
> was some hours ago. AFAIK sar has only the aggregated numbers.
>
> But I need to know details if I want to answer the question "Why?". The
> peak
> has gone and ps/top/iotop don't help me anymore.
>
> Any idea?

I love atop (atoptool.nl) for exactly that kind of situation. It will
save a snapshot every 10 minutes by default, which you can then simply
"scroll" back to. Helped me pinpointing nightly issues countless times.

Only really available for Linux though (in case you're on *BSD).

Best regards,
--
Gunnar "Nick" Bluth
RHCE/SCLA

Mobil +49 172 8853339
Email: gunnar.bluth@pro-open.de
_____________________________________________________________
In 1984 mainstream users were choosing VMS over UNIX.
Ten years later they are choosing Windows over UNIX.
What part of that message aren't you getting? - Tom Payne



Attachment

Re: Details after Load Peak was: OT: Performance of VM

From
Micky Gough
Date:
+1 for atop. Be sure to adjust the sampling interval so it suits your needs. It'll tell you what caused the spike.

Alternatively you could probably use sysdig, but I expect that'd result in a fair performance hit if your system is already struggling.

Micky

On 14 February 2018 at 08:15, Gunnar "Nick" Bluth <gunnar.bluth@pro-open.de> wrote:
Am 06.02.2018 um 15:31 schrieb Thomas Güttler:
>
>
> Am 05.02.2018 um 14:26 schrieb Andreas Kretschmer:
>>
>>
>> Am 05.02.2018 um 14:14 schrieb Thomas Güttler:
>>> What do you suggest to get some reliable figures?
>>
>> sar is often recommended, see
>> https://blog.2ndquadrant.com/in-the-defense-of-sar/.
>>
>> Can you exclude other reasons like vacuum / vacuum freeze?
>
> In the current case it was a problem in the hypervisor.
>
> But I want to be prepared for the next time.
>
> The tool sar looks good. This way I can generate a chart where I can see
> peaks. Nice.
>
> .... But one thing is still unclear. Imagine I see a peak in the chart.
> The peak
> was some hours ago. AFAIK sar has only the aggregated numbers.
>
> But I need to know details if I want to answer the question "Why?". The
> peak
> has gone and ps/top/iotop don't help me anymore.
>
> Any idea?

I love atop (atoptool.nl) for exactly that kind of situation. It will
save a snapshot every 10 minutes by default, which you can then simply
"scroll" back to. Helped me pinpointing nightly issues countless times.

Only really available for Linux though (in case you're on *BSD).

Best regards,
--
Gunnar "Nick" Bluth
RHCE/SCLA

Mobil +49 172 8853339
Email: gunnar.bluth@pro-open.de
_____________________________________________________________
In 1984 mainstream users were choosing VMS over UNIX.
Ten years later they are choosing Windows over UNIX.
What part of that message aren't you getting? - Tom Payne



Re: OT: Performance of VM

From
Mark Kirkwood
Date:

On 11/02/18 00:20, Robert Klemme wrote:
> On Mon, Feb 5, 2018 at 5:22 PM, Andrew Kerber <andrew.kerber@gmail.com> wrote:
>> Have them check the memory and CPU allocation of the hypervisor, make sure
>> its not overallocated. Make sure the partitions for stroage are aligned (see
>> here:
>> https://blogs.vmware.com/vsphere/2011/08/guest-os-partition-alignment.html)
>> . Install tuned, and enable the throughput performance profile. Oracle has a
>> problem with transparent hugepages, postgres may well have the same problem,
>> so consider disabling transparent hugepages.  There is no reason why
>> performance on a VM would be worse than performance on a physical server.
> Not theoretically. But in practice if you have anything run in a VM
> like in this case you do not know what else is working on that box.
> Analyzing these issues can be really cumbersome and tricky. This is
> why I am generally skeptical of running a resource intensive
> application like a RDBMS in a VM. To get halfway predictable results
> you want at least a minimum of resources (CPU, memory, IO bandwidth)
> reserved for that VM.
>
> Anecdote: we once had a customer run our application in a VM (which is
> supported) and complain about slowness. Eventually we found out that
> they over committed memory - not in sum for all VMs which is common,
> but this single VM had been configured to have more memory than was
> physically available in the machine.
>

Agreed. If you can get the IO layer to have some type of guaranteed 
performance (e.g AWS Provisioned IOPS), then that is a big help. However 
(as you say above) debugging memory and cpu contention (from within the 
guest) is tricky indeed.

Anecdote: concluded VM needed more cpu, so went to 8 to 16 - performance 
got significantly *worse*. We prevailed on the devops guys (this was 
*not* AWS) to migrate the VM is a less busy host. Everything was fine 
thereafter.

regards
Mark