Thread: [GENERAL] UDP buffer drops / statistics collector

[GENERAL] UDP buffer drops / statistics collector

From
Tim Kane
Date:
Hi all,

I'm seeing sporadic (but frequent) UDP buffer drops on a host that so far I've not been able to resolve.

The drops are originating from postgres processes, and from what I know - the only UDP traffic generated by postgres should be consumed by the statistics collector - but for whatever reason, it's failing to read the packets quickly enough.

Interestingly, I'm seeing these drops occur even when the system is idle..  but every 15 minutes or so (not consistently enough to isolate any particular activity) we'll see in the order of ~90 packets dropped at a time.

I'm running 9.6.2, but the issue was previously occurring on 9.2.4 (on the same hardware)


If it's relevant..  there are two instances of postgres running (and consequently, 2 instances of the stats collector process) though 1 of those instances is most definitely idle for most of the day.

In an effort to try to resolve the problem, I've increased (x2) the UDP recv buffer sizes on the host - but it seems to have had no effect.

cat /proc/sys/net/core/rmem_max
1677216

The following parameters are configured

track_activities on
track_counts     on
track_functions  none
track_io_timing  off


There are approximately 80-100 connections at any given time.

It seems that the issue started a few weeks ago, around the time of a reboot on the given host... but it's difficult to know what (if anything) has changed, or why :-/


Incidentally... the documentation doesn't seem to have any mention of UDP whatsoever.  I'm going to use this as an opportunity to dive into the source - but perhaps it's worth improving the documentation around this?

My next step is to try disabling track_activities and track_counts to see if they improve matters any, but I wouldn't expect these to generate enough data to flood the UDP buffers :-/

Any ideas?



Re: [GENERAL] UDP buffer drops / statistics collector

From
Tim Kane
Date:
Okay, so I've run an strace on the collector process during a buffer drop event.
I can see evidence of a recvfrom loop pulling in a maximum of 142kb.

While I've had already increased rmem_max, it would appear this is not being observed by the kernel.
rmem_default is set to 124kb, which would explain the above read maxing out just slightly beyond this (presuming a ring buffer filling up behind the read).

I'm going to try increasing rmem_default and see if it has any positive effect.. (and then investigate why the kernel doesn't want to consider rmem_max)..





On Tue, Apr 18, 2017 at 8:05 AM Tim Kane <tim.kane@gmail.com> wrote:
Hi all,

I'm seeing sporadic (but frequent) UDP buffer drops on a host that so far I've not been able to resolve.

The drops are originating from postgres processes, and from what I know - the only UDP traffic generated by postgres should be consumed by the statistics collector - but for whatever reason, it's failing to read the packets quickly enough.

Interestingly, I'm seeing these drops occur even when the system is idle..  but every 15 minutes or so (not consistently enough to isolate any particular activity) we'll see in the order of ~90 packets dropped at a time.

I'm running 9.6.2, but the issue was previously occurring on 9.2.4 (on the same hardware)


If it's relevant..  there are two instances of postgres running (and consequently, 2 instances of the stats collector process) though 1 of those instances is most definitely idle for most of the day.

In an effort to try to resolve the problem, I've increased (x2) the UDP recv buffer sizes on the host - but it seems to have had no effect.

cat /proc/sys/net/core/rmem_max
1677216

The following parameters are configured

track_activities on
track_counts     on
track_functions  none
track_io_timing  off


There are approximately 80-100 connections at any given time.

It seems that the issue started a few weeks ago, around the time of a reboot on the given host... but it's difficult to know what (if anything) has changed, or why :-/


Incidentally... the documentation doesn't seem to have any mention of UDP whatsoever.  I'm going to use this as an opportunity to dive into the source - but perhaps it's worth improving the documentation around this?

My next step is to try disabling track_activities and track_counts to see if they improve matters any, but I wouldn't expect these to generate enough data to flood the UDP buffers :-/

Any ideas?



Re: [GENERAL] UDP buffer drops / statistics collector

From
Tim Kane
Date:
Well, this is frustrating..
The buffer drops are still occurring - so I thought it worth trying use a ramdisk and set stats_temp_directory accordingly.

I've reloaded the instance, and can see that the stats directory is now being populated in the new location.  Except - there is one last file (pgss_query_texts.stat) that continues to be updated in the old pg_stat_tmp path..  Is that supposed to happen?


Fairly similar to this guy (but not quite the same).

I can see the packets arriving and being consumed by the collector..  and, the collector is indeed updating in the new stats_temp_directory.. just not for that one file.


It also failed to resolve the buffer drops.. At this point, I'm not sure I expected it to.  They tend to occur semi-regularly (every 8-13 minutes) but I can't correlate them with any kind of activity (and if I'm honest, it's possibly starting to drive me a little bit mad).




On Tue, Apr 18, 2017 at 2:53 PM Tim Kane <tim.kane@gmail.com> wrote:
Okay, so I've run an strace on the collector process during a buffer drop event.
I can see evidence of a recvfrom loop pulling in a maximum of 142kb.

While I've had already increased rmem_max, it would appear this is not being observed by the kernel.
rmem_default is set to 124kb, which would explain the above read maxing out just slightly beyond this (presuming a ring buffer filling up behind the read).

I'm going to try increasing rmem_default and see if it has any positive effect.. (and then investigate why the kernel doesn't want to consider rmem_max)..





On Tue, Apr 18, 2017 at 8:05 AM Tim Kane <tim.kane@gmail.com> wrote:
Hi all,

I'm seeing sporadic (but frequent) UDP buffer drops on a host that so far I've not been able to resolve.

The drops are originating from postgres processes, and from what I know - the only UDP traffic generated by postgres should be consumed by the statistics collector - but for whatever reason, it's failing to read the packets quickly enough.

Interestingly, I'm seeing these drops occur even when the system is idle..  but every 15 minutes or so (not consistently enough to isolate any particular activity) we'll see in the order of ~90 packets dropped at a time.

I'm running 9.6.2, but the issue was previously occurring on 9.2.4 (on the same hardware)


If it's relevant..  there are two instances of postgres running (and consequently, 2 instances of the stats collector process) though 1 of those instances is most definitely idle for most of the day.

In an effort to try to resolve the problem, I've increased (x2) the UDP recv buffer sizes on the host - but it seems to have had no effect.

cat /proc/sys/net/core/rmem_max
1677216

The following parameters are configured

track_activities on
track_counts     on
track_functions  none
track_io_timing  off


There are approximately 80-100 connections at any given time.

It seems that the issue started a few weeks ago, around the time of a reboot on the given host... but it's difficult to know what (if anything) has changed, or why :-/


Incidentally... the documentation doesn't seem to have any mention of UDP whatsoever.  I'm going to use this as an opportunity to dive into the source - but perhaps it's worth improving the documentation around this?

My next step is to try disabling track_activities and track_counts to see if they improve matters any, but I wouldn't expect these to generate enough data to flood the UDP buffers :-/

Any ideas?



Re: [GENERAL] UDP buffer drops / statistics collector

From
Alban Hertroys
Date:
> On 19 Apr 2017, at 20:36, Tim Kane <tim.kane@gmail.com> wrote:
>
> Well, this is frustrating..
> The buffer drops are still occurring - so I thought it worth trying use a ramdisk and set stats_temp_directory
accordingly.
>
> I've reloaded the instance, and can see that the stats directory is now being populated in the new location.  Except
-there is one last file (pgss_query_texts.stat) that continues to be updated in the old pg_stat_tmp path..  Is that
supposedto happen? 
>
>
> Fairly similar to this guy (but not quite the same).
> https://www.postgresql.org/message-id/D6E71BEFAD7BEB4FBCD8AE74FADB1265011BB40FC749@win-8-eml-ex1.eml.local
>
> I can see the packets arriving and being consumed by the collector..  and, the collector is indeed updating in the
newstats_temp_directory.. just not for that one file. 
>
>
> It also failed to resolve the buffer drops.. At this point, I'm not sure I expected it to.  They tend to occur
semi-regularly(every 8-13 minutes) but I can't correlate them with any kind of activity (and if I'm honest, it's
possiblystarting to drive me a little bit mad). 

This rings a bell for me. I recently had a similar issue in an MMO (Windows) where every 15 minutes I would get a
numberof consecutive freezes in-game. You could set your alarm by it, so regular. 

That suddenly went away after I rearranged my home-network (for unrelated reasons), which incidentally moved several
connectionsfrom the switch the game-system was connected to to another switch. I never pinpointed it to UDP, but then
again,TCP would correct for the lost transfers (probably at the cost of UDP traffic). 

Perhaps you have a switch somewhere that's overburdened?

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



Re: [GENERAL] UDP buffer drops / statistics collector

From
Tim Kane
Date:

Ok, fixed it! :D
Posting here for future me (others like me).

It would seem (having not read kernel source) that increasing the kernel buffer sizes (rmin_default / rmin_max) does not take effect for any processes that are already bound or listening to a port/socket.  I had previously assumed this was a global kernel buffer, perhaps not.

I had been trying to manage these buffer sizes to resolve the UDP drop issues, but I had not at any time restarted the statistics collector process.  I restarted the cluster in a moment of last resort (something I had tried numerous times before playing with the buffer sizes) and lo and behold!! no more buffer drops!

Problem solved.
The pgss_query_texts.stat still wants to live in the default pg_stat_tmp directory, wether by design or not..  but that's a non-issue for me now.

Thanks for listening :)



On Wed, Apr 19, 2017 at 7:36 PM Tim Kane <tim.kane@gmail.com> wrote:
Well, this is frustrating..
The buffer drops are still occurring - so I thought it worth trying use a ramdisk and set stats_temp_directory accordingly.

I've reloaded the instance, and can see that the stats directory is now being populated in the new location.  Except - there is one last file (pgss_query_texts.stat) that continues to be updated in the old pg_stat_tmp path..  Is that supposed to happen?


Fairly similar to this guy (but not quite the same).

I can see the packets arriving and being consumed by the collector..  and, the collector is indeed updating in the new stats_temp_directory.. just not for that one file.


It also failed to resolve the buffer drops.. At this point, I'm not sure I expected it to.  They tend to occur semi-regularly (every 8-13 minutes) but I can't correlate them with any kind of activity (and if I'm honest, it's possibly starting to drive me a little bit mad).




On Tue, Apr 18, 2017 at 2:53 PM Tim Kane <tim.kane@gmail.com> wrote:
Okay, so I've run an strace on the collector process during a buffer drop event.
I can see evidence of a recvfrom loop pulling in a maximum of 142kb.

While I've had already increased rmem_max, it would appear this is not being observed by the kernel.
rmem_default is set to 124kb, which would explain the above read maxing out just slightly beyond this (presuming a ring buffer filling up behind the read).

I'm going to try increasing rmem_default and see if it has any positive effect.. (and then investigate why the kernel doesn't want to consider rmem_max)..





On Tue, Apr 18, 2017 at 8:05 AM Tim Kane <tim.kane@gmail.com> wrote:
Hi all,

I'm seeing sporadic (but frequent) UDP buffer drops on a host that so far I've not been able to resolve.

The drops are originating from postgres processes, and from what I know - the only UDP traffic generated by postgres should be consumed by the statistics collector - but for whatever reason, it's failing to read the packets quickly enough.

Interestingly, I'm seeing these drops occur even when the system is idle..  but every 15 minutes or so (not consistently enough to isolate any particular activity) we'll see in the order of ~90 packets dropped at a time.

I'm running 9.6.2, but the issue was previously occurring on 9.2.4 (on the same hardware)


If it's relevant..  there are two instances of postgres running (and consequently, 2 instances of the stats collector process) though 1 of those instances is most definitely idle for most of the day.

In an effort to try to resolve the problem, I've increased (x2) the UDP recv buffer sizes on the host - but it seems to have had no effect.

cat /proc/sys/net/core/rmem_max
1677216

The following parameters are configured

track_activities on
track_counts     on
track_functions  none
track_io_timing  off


There are approximately 80-100 connections at any given time.

It seems that the issue started a few weeks ago, around the time of a reboot on the given host... but it's difficult to know what (if anything) has changed, or why :-/


Incidentally... the documentation doesn't seem to have any mention of UDP whatsoever.  I'm going to use this as an opportunity to dive into the source - but perhaps it's worth improving the documentation around this?

My next step is to try disabling track_activities and track_counts to see if they improve matters any, but I wouldn't expect these to generate enough data to flood the UDP buffers :-/

Any ideas?



Re: [GENERAL] UDP buffer drops / statistics collector

From
Jan de Visser
Date:
On Thursday, April 20, 2017 3:38:42 AM EDT Tim Kane wrote:
> The pgss_query_texts.stat still wants to live in the default *pg_stat_tmp*
> directory, wether by design or not..  but that's a non-issue for me now.

A 30 second investigation of the source seems to indicate that that directory
is hardcoded (i.e. not configurable). But I may be missing some of the
intricacies of the GUC mechanisms.