Thread: stats for network traffic WIP

stats for network traffic WIP

From
Nigel Heron
Date:
Hi, I've been using postgres for many years but never took the time to play with the code until now. As a learning experience i came up with this WIP patch to keep track of the # of bytes sent and received by the server over it's communication sockets. Counters are kept per database, per connection and globally/shared.
The counters are incremented for tcp (remote and localhost) and for unix sockets. The major WIP issue so far is that connections using SSL aren't counted properly. If there's any interest, i'll keep working on it.

a few functions are added:
- pg_stat_get_bytes_sent() returns the total count of outgoing bytes for the whole cluster (all dbs and all connections including replication)
- pg_stat_get_bytes_received() same but for incoming data
- pg_stat_get_db_bytes_sent(oid) returns count of outgoing bytes for a specific database
- pg_stat_get_db_bytes_received(oid) same but for incoming data

"bytes_sent" and "bytes_received" columns are added to:
- pg_stat_get_activity function
- pg_stat_activity view
- pg_stat_database view
- pg_stat_replication view

The counters are reset with the existing reset functions, but a new parameter value is added for the shared stats call (i named it "socket" for lack of imagination), eg. pg_stat_reset_shared('socket').

some benefits of the patch:
- can be used to track bandwidth usage of postgres, useful if the host isn't a dedicated db server, where host level statistics would include other traffic.
- can track bandwidth usage of streaming replication.
- can be used to find misbehaving connections.
- can be used in multi-user/multi-database clusters for resource usage tracking.
- competing databases have such metrics.
- could also be added to pg_stat_statements for extra debugging.
- etc.?

some negatives:
- extra code is called for each send() and recv(), I haven't measured the performance impact yet. (but can be turned off using track_counts=off)
- stats collector has more work to do.
- some stats structs are changed which will cause an error while trying to load them from disk the first time and the old stats will be lost.
- PL functions that create their own sockets aren't tracked.
- sockets from FDWs calls aren't tracked.

To debug the counters, i'm using clients connected through haproxy to generate traffic and then compare haproxy's stats with what pg stores in pg_stat/global.stat on shutdown. Attached is a very basic python script that can read the global.stat file (it takes the DATADIR as a parameter).

Any feedback is appreciated,
-nigel.
Attachment

Re: stats for network traffic WIP

From
Stephen Frost
Date:
Nigel,

* Nigel Heron (nheron@querymetrics.com) wrote:
> Hi, I've been using postgres for many years but never took the time to play
> with the code until now. As a learning experience i came up with this WIP
> patch to keep track of the # of bytes sent and received by the server over
> it's communication sockets. Counters are kept per database, per connection
> and globally/shared.

Very neat idea.  Please add it to the current commitfest
(http://commitfest.postgresql.org) and, ideally, someone will get in and
review it during the next CM.
Thanks!
    Stephen

Re: stats for network traffic WIP

From
Mike Blackwell
Date:
I added this to the current CF, and am starting to review it as I have time.

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com




On Mon, Oct 21, 2013 at 11:32 AM, Stephen Frost <sfrost@snowman.net> wrote:
Nigel,

* Nigel Heron (nheron@querymetrics.com) wrote:
> Hi, I've been using postgres for many years but never took the time to play
> with the code until now. As a learning experience i came up with this WIP
> patch to keep track of the # of bytes sent and received by the server over
> it's communication sockets. Counters are kept per database, per connection
> and globally/shared.

Very neat idea.  Please add it to the current commitfest
(http://commitfest.postgresql.org) and, ideally, someone will get in and
review it during the next CM.

        Thanks!

                Stephen

Re: stats for network traffic WIP

From
Nigel Heron
Date:
Hi, thanks, I'm still actively working on this patch. I've gotten the
traffic counters working when using SSL enabled clients (includes the
ssl overhead now) but I still have the walsender transfers under SSL
to work on.
I'll post an updated patch when i have it figured out.
Since the patch changes some views in pg_catalog, a regression test
fails .. i'm not sure what to do next. Change the regression test in
the patch, or wait until the review phase?

I was also thinking of adding global counters for the stats collector
(pg_stat* file read/write bytes + packets lost) and also log file io
(bytes written for txt and csv formats) .. any interest?

-nigel.

On Wed, Oct 23, 2013 at 12:50 PM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
> I added this to the current CF, and am starting to review it as I have time.
>
> __________________________________________________________________________________
> Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management
> | RR Donnelley
> 1750 Wallace Ave | St Charles, IL 60174-3401
> Office: 630.313.7818
> Mike.Blackwell@rrd.com
> http://www.rrdonnelley.com
>
>
>
>
> On Mon, Oct 21, 2013 at 11:32 AM, Stephen Frost <sfrost@snowman.net> wrote:
>>
>> Nigel,
>>
>> * Nigel Heron (nheron@querymetrics.com) wrote:
>> > Hi, I've been using postgres for many years but never took the time to
>> > play
>> > with the code until now. As a learning experience i came up with this
>> > WIP
>> > patch to keep track of the # of bytes sent and received by the server
>> > over
>> > it's communication sockets. Counters are kept per database, per
>> > connection
>> > and globally/shared.
>>
>> Very neat idea.  Please add it to the current commitfest
>> (http://commitfest.postgresql.org) and, ideally, someone will get in and
>> review it during the next CM.
>>
>>         Thanks!
>>
>>                 Stephen
>
>



Re: stats for network traffic WIP

From
Mike Blackwell
Date:
Sounds good.  I personally don't have any interest in log file i/o counters, but that's just me.  I wonder if stats collector counters might be useful... I seem to recall an effort to improve that area.  Maybe not enough use to take the performance hit on a regular basis, though.

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com




On Wed, Oct 23, 2013 at 1:44 PM, Nigel Heron <nheron@querymetrics.com> wrote:
Hi, thanks, I'm still actively working on this patch. I've gotten the
traffic counters working when using SSL enabled clients (includes the
ssl overhead now) but I still have the walsender transfers under SSL
to work on.
I'll post an updated patch when i have it figured out.
Since the patch changes some views in pg_catalog, a regression test
fails .. i'm not sure what to do next. Change the regression test in
the patch, or wait until the review phase?

I was also thinking of adding global counters for the stats collector
(pg_stat* file read/write bytes + packets lost) and also log file io
(bytes written for txt and csv formats) .. any interest?

-nigel.

On Wed, Oct 23, 2013 at 12:50 PM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
> I added this to the current CF, and am starting to review it as I have time.
>
> __________________________________________________________________________________
> Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management
> | RR Donnelley
> 1750 Wallace Ave | St Charles, IL 60174-3401
> Office: 630.313.7818
> Mike.Blackwell@rrd.com
> http://www.rrdonnelley.com
>
>
>
>
> On Mon, Oct 21, 2013 at 11:32 AM, Stephen Frost <sfrost@snowman.net> wrote:
>>
>> Nigel,
>>
>> * Nigel Heron (nheron@querymetrics.com) wrote:
>> > Hi, I've been using postgres for many years but never took the time to
>> > play
>> > with the code until now. As a learning experience i came up with this
>> > WIP
>> > patch to keep track of the # of bytes sent and received by the server
>> > over
>> > it's communication sockets. Counters are kept per database, per
>> > connection
>> > and globally/shared.
>>
>> Very neat idea.  Please add it to the current commitfest
>> (http://commitfest.postgresql.org) and, ideally, someone will get in and
>> review it during the next CM.
>>
>>         Thanks!
>>
>>                 Stephen
>
>

Re: stats for network traffic WIP

From
Atri Sharma
Date:
On Thu, Oct 24, 2013 at 12:23 AM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
> Sounds good.  I personally don't have any interest in log file i/o counters,
> but that's just me.  I wonder if stats collector counters might be useful...
> I seem to recall an effort to improve that area.  Maybe not enough use to
> take the performance hit on a regular basis, though.
>


+1.

I tend to be a bit touchy about any changes to code that runs
frequently. We need to seriously test if the overhead added by this
patch is worth it.

IMO, the idea is pretty good. Its just that we need to do some wide
spectrum performance testing. Thats only my thought though.

Regards,

Atri

-- 
Regards,

Atri
l'apprenant



Re: stats for network traffic WIP

From
Mike Blackwell
Date:

On Wed, Oct 23, 2013 at 1:58 PM, Atri Sharma <atri.jiit@gmail.com> wrote:

IMO, the idea is pretty good. Its just that we need to do some wide
spectrum performance testing. Thats only my thought though.


I'm looking at trying to do some performance testing on this.  Any suggestions on test scenarios, etc? 

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com



Re: stats for network traffic WIP

From
Nigel Heron
Date:
On Wed, Oct 23, 2013 at 2:58 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
> On Thu, Oct 24, 2013 at 12:23 AM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
>> Sounds good.  I personally don't have any interest in log file i/o counters,
>> but that's just me.  I wonder if stats collector counters might be useful...
>> I seem to recall an effort to improve that area.  Maybe not enough use to
>> take the performance hit on a regular basis, though.
>>
>
>
> +1.
>
> I tend to be a bit touchy about any changes to code that runs
> frequently. We need to seriously test if the overhead added by this
> patch is worth it.
>
> IMO, the idea is pretty good. Its just that we need to do some wide
> spectrum performance testing. Thats only my thought though.
>

I didn't implement the code yet, but my impression is that since it
will be the stats collector gathering counters about itself there will
be very little overhead (no message passing, etc.) .. just a few int
calculations and storing a few more bytes in the global stats file.
The log file io tracking would generate some overhead though, similar
to network stats tracking.
I think the stats collector concerns voiced previously on the list
were more about per relation stats which creates alot of io on servers
with many tables. Adding global stats doesn't seem as bad to me.

-nigel.



Re: stats for network traffic WIP

From
Atri Sharma
Date:
On Thu, Oct 24, 2013 at 12:30 AM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
>
> On Wed, Oct 23, 2013 at 1:58 PM, Atri Sharma <atri.jiit@gmail.com> wrote:
>
>>
>> IMO, the idea is pretty good. Its just that we need to do some wide
>> spectrum performance testing. Thats only my thought though.
>
>
>
> I'm looking at trying to do some performance testing on this.  Any
> suggestions on test scenarios, etc?

Umm...Lots of clients together would be the first obvious testing that
comes to my mind.

One thing to look at would be erratic clients. If some clients connect
and disconnect within a short span of time, we should look if the
collector works fine there.

Also, we should verify the accuracy of the statistics collected. A
small deviation is fine, but we should do a formal test, just to be
sure.

Does anyone think that the new untracked ports introduced by the patch
could pose a problem? I am not sure there.

I havent taken a deep look at the patch yet, but I will try to do so.
However, since I will be in Dublin next week, it may happen that my
inputs may be delayed a bit. The plus side is that I will discuss this
with lots of people there.

Adding myself as the co reviewer specifically for the testing
purposes, if its ok with you.

Regards,

Atri






-- 
Regards,

Atri
l'apprenant



Re: stats for network traffic WIP

From
Mike Blackwell
Date:


On Wed, Oct 23, 2013 at 2:10 PM, Atri Sharma <atri.jiit@gmail.com> wrote:

Adding myself as the co reviewer specifically for the testing
purposes, if its ok with you.


​It's perfectly fine with me.  Please do!​

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com


Re: stats for network traffic WIP

From
Nigel Heron
Date:
On Wed, Oct 23, 2013 at 2:44 PM, Nigel Heron <nheron@querymetrics.com> wrote:
> Hi, thanks, I'm still actively working on this patch. I've gotten the
> traffic counters working when using SSL enabled clients (includes the
> ssl overhead now) but I still have the walsender transfers under SSL
> to work on.
> I'll post an updated patch when i have it figured out.
> Since the patch changes some views in pg_catalog, a regression test
> fails .. i'm not sure what to do next. Change the regression test in
> the patch, or wait until the review phase?
>

here's v2 of the patch including the regression test update.
I omitted socket counters for walreceivers, i couldn't get them
working under SSL. Since they are using the front end libpq libs i
would have to duplicate alot of the code in the backend to be able to
instrument them under SSL (add openssl BIO custom send/recv like the
backend has), not sure it's worth it.. We can get the data from the
master's pg_stat_replication view anyways. I'm open to suggestions.

So, for now, the counters only track sockets created from an inbound
(client to server) connection.


-nigel.

Attachment

Re: stats for network traffic WIP

From
Nigel Heron
Date:
>
> So, for now, the counters only track sockets created from an inbound
> (client to server) connection.

here's v3 of the patch (rebase and cleanup).

-nigel.

Attachment

Re: stats for network traffic WIP

From
Greg Stark
Date:

On Mon, Oct 21, 2013 at 5:14 AM, Nigel Heron <nheron@querymetrics.com> wrote:
- can be used to find misbehaving connections.
- can be used in multi-user/multi-database clusters for resource usage tracking.
- competing databases have such metrics.

The most interesting thing that I could see calculating from these stats would require also knowing how much time was spent waiting on writes and reads on the network. With the cumulative time spent as well as the count of syscalls you can calculate the average latency over any time period between two snapshots. However that would involve adding two gettimeofday calls which would be quite likely to cause a noticeable impact on some architectures. Unless there's already a pair of gettimeofday calls you can piggy back onto?


--
greg

Re: stats for network traffic WIP

From
Nigel Heron
Date:
On Tue, Oct 29, 2013 at 11:26 AM, Nigel Heron <nheron@querymetrics.com> wrote:
>>
>> So, for now, the counters only track sockets created from an inbound
>> (client to server) connection.
>
> here's v3 of the patch (rebase and cleanup).
>

Hi,
here's v4 of the patch. I added documentation and a new global view
called "pg_stat_socket" (includes bytes_sent, bytes_received and
stats_reset time)

thanks,
-nigel.

Attachment

Re: stats for network traffic WIP

From
Mike Blackwell
Date:
Patch applies and builds against git HEAD (as of 6790e738031089d5).  "make check" runs cleanly as well.

The new features appear to work as advertised as far as I've been able to check.

The code looks good as far as I can see.  Documentation patches are included for the new features.

Still to be tested: 
the counts for streaming replication (no replication setup here to test against yet).

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com




On Fri, Nov 8, 2013 at 9:01 AM, Nigel Heron <nheron@querymetrics.com> wrote:
On Tue, Oct 29, 2013 at 11:26 AM, Nigel Heron <nheron@querymetrics.com> wrote:
>>
>> So, for now, the counters only track sockets created from an inbound
>> (client to server) connection.
>
> here's v3 of the patch (rebase and cleanup).
>

Hi,
here's v4 of the patch. I added documentation and a new global view
called "pg_stat_socket" (includes bytes_sent, bytes_received and
stats_reset time)

thanks,
-nigel.

Re: stats for network traffic WIP

From
Mike Blackwell
Date:
Also still to be tested: performance impact.  

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com




On Fri, Nov 8, 2013 at 9:33 AM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
Patch applies and builds against git HEAD (as of 6790e738031089d5).  "make check" runs cleanly as well.

The new features appear to work as advertised as far as I've been able to check.

The code looks good as far as I can see.  Documentation patches are included for the new features.

Still to be tested: 
the counts for streaming replication (no replication setup here to test against yet).

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com




On Fri, Nov 8, 2013 at 9:01 AM, Nigel Heron <nheron@querymetrics.com> wrote:
On Tue, Oct 29, 2013 at 11:26 AM, Nigel Heron <nheron@querymetrics.com> wrote:
>>
>> So, for now, the counters only track sockets created from an inbound
>> (client to server) connection.
>
> here's v3 of the patch (rebase and cleanup).
>

Hi,
here's v4 of the patch. I added documentation and a new global view
called "pg_stat_socket" (includes bytes_sent, bytes_received and
stats_reset time)

thanks,
-nigel.


Re: stats for network traffic WIP

From
Nigel Heron
Date:
On Thu, Nov 7, 2013 at 8:21 PM, Greg Stark <stark@mit.edu> wrote:
>
>
> The most interesting thing that I could see calculating from these stats
> would require also knowing how much time was spent waiting on writes and
> reads on the network. With the cumulative time spent as well as the count of
> syscalls you can calculate the average latency over any time period between
> two snapshots. However that would involve adding two gettimeofday calls
> which would be quite likely to cause a noticeable impact on some
> architectures. Unless there's already a pair of gettimeofday calls you can
> piggy back onto?
>
>

Adding timing instrumentation to each send() and recv() would require
over 50 calls to gettimeofday for a simple psql -c "SELECT 1", while
the client was waiting. That would add ~40usec extra time (estimated
using pg_test_timing on my laptop without TSC). It might be more
overhead than it's worth.

-nigel.



Re: stats for network traffic WIP

From
Peter Eisentraut
Date:
On Fri, 2013-11-08 at 10:01 -0500, Nigel Heron wrote:
> here's v4 of the patch. I added documentation and a new global view
> called "pg_stat_socket" (includes bytes_sent, bytes_received and
> stats_reset time)

Your patch needs to be rebased:

CONFLICT (content): Merge conflict in src/test/regress/expected/rules.out




Re: stats for network traffic WIP

From
Nigel Heron
Date:
On Wed, Nov 13, 2013 at 11:27 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Fri, 2013-11-08 at 10:01 -0500, Nigel Heron wrote:
>> here's v4 of the patch. I added documentation and a new global view
>> called "pg_stat_socket" (includes bytes_sent, bytes_received and
>> stats_reset time)
>
> Your patch needs to be rebased:
>
> CONFLICT (content): Merge conflict in src/test/regress/expected/rules.out
>

Hi,
here's a rebased patch with some additions.

an overview of it's current state...

a new pg_stat_socket global view:
- total bytes sent and received
- bytes sent and received for user backends
- bytes sent and received for wal senders
- total connection attempts
- successful connections to user backends
- successful connections to wal senders
- stats reset time
pg_stat_reset_shared('socket') resets the counters

added to pg_stat_database view:
- bytes sent and received per db
- successful connections per db
pg_stat_reset() resets the counters

added to pg_stat_activity view:
- bytes sent and received per backend

added to pg_stat_replication view:
- bytes sent and received per wal sender

using the existing track_counts guc to enable/disable these stats.
-nigel.

Attachment

Re: stats for network traffic WIP

From
Mike Blackwell
Date:
This patch looks good to me.  It applies, builds, and runs the regression tests.  Documentation is included and it seems to do what it says.  I don't consider myself a code expert, but as far as I can see it looks fine.  This is a pretty straightforward enhancement to the existing pg_stat_* code.

If no one has any objections, I'll mark it ready for committer.

Mike

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com




On Thu, Nov 14, 2013 at 11:29 PM, Nigel Heron <nheron@querymetrics.com> wrote:
On Wed, Nov 13, 2013 at 11:27 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Fri, 2013-11-08 at 10:01 -0500, Nigel Heron wrote:
>> here's v4 of the patch. I added documentation and a new global view
>> called "pg_stat_socket" (includes bytes_sent, bytes_received and
>> stats_reset time)
>
> Your patch needs to be rebased:
>
> CONFLICT (content): Merge conflict in src/test/regress/expected/rules.out
>

Hi,
here's a rebased patch with some additions.

an overview of it's current state...

a new pg_stat_socket global view:
- total bytes sent and received
- bytes sent and received for user backends
- bytes sent and received for wal senders
- total connection attempts
- successful connections to user backends
- successful connections to wal senders
- stats reset time
pg_stat_reset_shared('socket') resets the counters

added to pg_stat_database view:
- bytes sent and received per db
- successful connections per db
pg_stat_reset() resets the counters

added to pg_stat_activity view:
- bytes sent and received per backend

added to pg_stat_replication view:
- bytes sent and received per wal sender

using the existing track_counts guc to enable/disable these stats.
-nigel.

Re: stats for network traffic WIP

From
Atri Sharma
Date:
On Tue, Nov 19, 2013 at 11:43 PM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
> This patch looks good to me.  It applies, builds, and runs the regression
> tests.  Documentation is included and it seems to do what it says.  I don't
> consider myself a code expert, but as far as I can see it looks fine.  This
> is a pretty straightforward enhancement to the existing pg_stat_* code.
>
> If no one has any objections, I'll mark it ready for committer.
>
> Mike

I agree.

I had a discussion with Mike yesterday, and took the performance areas
in the patch. I think the impact would be pretty low and since the
global counter being incremented is incremented with keeping race
conditions in mind, I think that the statistics collected will be
valid.

So, I have no objections to the patch being marked as ready for committer.

Regards,

Atri

Regards,

Atri
l'apprenant



Re: stats for network traffic WIP

From
Fujii Masao
Date:
On Wed, Nov 20, 2013 at 3:18 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
> On Tue, Nov 19, 2013 at 11:43 PM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
>> This patch looks good to me.  It applies, builds, and runs the regression
>> tests.  Documentation is included and it seems to do what it says.  I don't
>> consider myself a code expert, but as far as I can see it looks fine.  This
>> is a pretty straightforward enhancement to the existing pg_stat_* code.
>>
>> If no one has any objections, I'll mark it ready for committer.
>>
>> Mike
>
> I agree.
>
> I had a discussion with Mike yesterday, and took the performance areas
> in the patch. I think the impact would be pretty low and since the
> global counter being incremented is incremented with keeping race
> conditions in mind, I think that the statistics collected will be
> valid.
>
> So, I have no objections to the patch being marked as ready for committer.

Could you share the performance numbers? I'm really concerned about
the performance overhead caused by this patch.

Here are the comments from me:

All the restrictions of this feature should be documented. For example,
this feature doesn't track the bytes of the data transferred by FDW.
It's worth documenting that kind of information.

ISTM that this feature doesn't support SSL case. Why not?

The amount of data transferred by walreceiver also should be tracked,
I think.

I just wonder how conn_received, conn_backend and conn_walsender
are useful.

Regards,

-- 
Fujii Masao



Re: stats for network traffic WIP

From
Atri Sharma
Date:

Sent from my iPad

> On 07-Dec-2013, at 23:47, Fujii Masao <masao.fujii@gmail.com> wrote:
>
>> On Wed, Nov 20, 2013 at 3:18 AM, Atri Sharma <atri.jiit@gmail.com> wrote:
>>> On Tue, Nov 19, 2013 at 11:43 PM, Mike Blackwell <mike.blackwell@rrd.com> wrote:
>>> This patch looks good to me.  It applies, builds, and runs the regression
>>> tests.  Documentation is included and it seems to do what it says.  I don't
>>> consider myself a code expert, but as far as I can see it looks fine.  This
>>> is a pretty straightforward enhancement to the existing pg_stat_* code.
>>>
>>> If no one has any objections, I'll mark it ready for committer.
>>>
>>> Mike
>>
>> I agree.
>>
>> I had a discussion with Mike yesterday, and took the performance areas
>> in the patch. I think the impact would be pretty low and since the
>> global counter being incremented is incremented with keeping race
>> conditions in mind, I think that the statistics collected will be
>> valid.
>>
>> So, I have no objections to the patch being marked as ready for committer.
>
> Could you share the performance numbers? I'm really concerned about
> the performance overhead caused by this patch.
I did some pgbench tests specifically with increasing number of clients, as that are the kind of workloads that can
leadto display in slowness due to increase in work in the commonly used functions. Let me see if I can get the numbers
andsee where I kept them. 

>
> Here are the comments from me:
>
> All the restrictions of this feature should be documented. For example,
> this feature doesn't track the bytes of the data transferred by FDW.
> It's worth documenting that kind of information.

+1

>
> ISTM that this feature doesn't support SSL case. Why not?
>
> The amount of data transferred by walreceiver also should be tracked,
> I think.
>
Yes, I agree. WAL receiver data transfer can be problematic some times as well, so should be tracked.

Regards,

Atri


Re: stats for network traffic WIP

From
Nigel Heron
Date:
On Sat, Dec 7, 2013 at 1:17 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> Could you share the performance numbers? I'm really concerned about
> the performance overhead caused by this patch.
>

I've tried pgbench in select mode with small data sets to avoid disk
io and didn't see any difference. That was on my old core2duo laptop
though .. I'll have to retry it on some server class multi core
hardware.

I could create a new GUC to turn on/off this feature. Currently, it
uses "track_counts".

> Here are the comments from me:
>
> All the restrictions of this feature should be documented. For example,
> this feature doesn't track the bytes of the data transferred by FDW.
> It's worth documenting that kind of information.
>

OK. It also doesn't account for DNS resolution, Bonjour traffic and
any traffic generated from PL functions that create their own sockets.

> ISTM that this feature doesn't support SSL case. Why not?

It does support SSL, see my_sock_read() and my_sock_write() in
backend/libpq/be-secure.c

> The amount of data transferred by walreceiver also should be tracked,
> I think.

I'll have to take another look at it. I might be able to create SSL
BIO functions in libpqwalreceiver.c and change some other functions
(eg. libpqrcv_send) to return byte counts instead of void to get it
working.

> I just wonder how conn_received, conn_backend and conn_walsender
> are useful.

I thought of it mostly for monitoring software usage (eg. cacti,
nagios) to track connections/sec which might be used for capacity
planning, confirm connection pooler settings, monitoring abuse, etc.
Eg. If your conn_walsender is increasing and you have a fixed set of
slaves it could show a network issue.
The information is available in the logs if "log_connections" GUC is
on but it requires parsing and access to log files to extract. With
the increasing popularity of hosted postgres services without OS or
log access, I think more metrics should be available through system
views.

-nigel.



Re: stats for network traffic WIP

From
Fujii Masao
Date:
On Tue, Dec 10, 2013 at 6:56 AM, Nigel Heron <nheron@querymetrics.com> wrote:
> On Sat, Dec 7, 2013 at 1:17 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>> Could you share the performance numbers? I'm really concerned about
>> the performance overhead caused by this patch.
>>
>
> I've tried pgbench in select mode with small data sets to avoid disk
> io and didn't see any difference. That was on my old core2duo laptop
> though .. I'll have to retry it on some server class multi core
> hardware.

When I ran pgbench -i -s 100 in four parallel, I saw the performance difference
between the master and the patched one. I ran the following commands.
   psql -c "checkpoint"   for i in $(seq 1 4); do time pgbench -i -s100 -q db$i & done

The results are:

* Master 10000000 of 10000000 tuples (100%) done (elapsed 13.91 s, remaining 0.00 s). 10000000 of 10000000 tuples
(100%)done (elapsed 14.03 s, remaining 0.00 s). 10000000 of 10000000 tuples (100%) done (elapsed 14.01 s, remaining
0.00s). 10000000 of 10000000 tuples (100%) done (elapsed 14.13 s, remaining 0.00 s).
 
 It took almost 14.0 seconds to store 10000000 tuples.

* Patched 10000000 of 10000000 tuples (100%) done (elapsed 14.90 s, remaining 0.00 s). 10000000 of 10000000 tuples
(100%)done (elapsed 15.05 s, remaining 0.00 s). 10000000 of 10000000 tuples (100%) done (elapsed 15.42 s, remaining
0.00s). 10000000 of 10000000 tuples (100%) done (elapsed 15.70 s, remaining 0.00 s).
 
 It took almost 15.0 seconds to store 10000000 tuples.

Thus, I'm afraid that enabling network statistics would cause serious
performance
degradation. Thought?

Regards,

-- 
Fujii Masao



Re: stats for network traffic WIP

From
Atri Sharma
Date:
On Tue, Dec 10, 2013 at 10:59 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Dec 10, 2013 at 6:56 AM, Nigel Heron <nheron@querymetrics.com> wrote:
>> On Sat, Dec 7, 2013 at 1:17 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>
>>> Could you share the performance numbers? I'm really concerned about
>>> the performance overhead caused by this patch.
>>>
>>
>> I've tried pgbench in select mode with small data sets to avoid disk
>> io and didn't see any difference. That was on my old core2duo laptop
>> though .. I'll have to retry it on some server class multi core
>> hardware.
>
> When I ran pgbench -i -s 100 in four parallel, I saw the performance difference
> between the master and the patched one. I ran the following commands.
>
>     psql -c "checkpoint"
>     for i in $(seq 1 4); do time pgbench -i -s100 -q db$i & done
>
> The results are:
>
> * Master
>   10000000 of 10000000 tuples (100%) done (elapsed 13.91 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 14.03 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 14.01 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 14.13 s, remaining 0.00 s).
>
>   It took almost 14.0 seconds to store 10000000 tuples.
>
> * Patched
>   10000000 of 10000000 tuples (100%) done (elapsed 14.90 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 15.05 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 15.42 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 15.70 s, remaining 0.00 s).
>
>   It took almost 15.0 seconds to store 10000000 tuples.
>--
Regards,

Atri
l'apprenant
> Thus, I'm afraid that enabling network statistics would cause serious
> performance
> degradation. Thought?

Hmm, I think I did not push it this high. The performance numbers here
are cause of worry.

Another point I may mention here is that if we can isolate a few
points of performance degradation and work on them because I still
feel that the entire patch itself does not cause a serious lapse,
rather, a few points may.

However, the above numbers bring up the original concerns for the
performance voiced. I guess I was testing on too low number of clients
for the gap to show up significantly.

Regards,

Atri



Re: stats for network traffic WIP

From
Robert Haas
Date:
On Tue, Dec 10, 2013 at 12:29 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Tue, Dec 10, 2013 at 6:56 AM, Nigel Heron <nheron@querymetrics.com> wrote:
>> On Sat, Dec 7, 2013 at 1:17 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>
>>> Could you share the performance numbers? I'm really concerned about
>>> the performance overhead caused by this patch.
>>>
>>
>> I've tried pgbench in select mode with small data sets to avoid disk
>> io and didn't see any difference. That was on my old core2duo laptop
>> though .. I'll have to retry it on some server class multi core
>> hardware.
>
> When I ran pgbench -i -s 100 in four parallel, I saw the performance difference
> between the master and the patched one. I ran the following commands.
>
>     psql -c "checkpoint"
>     for i in $(seq 1 4); do time pgbench -i -s100 -q db$i & done
>
> The results are:
>
> * Master
>   10000000 of 10000000 tuples (100%) done (elapsed 13.91 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 14.03 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 14.01 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 14.13 s, remaining 0.00 s).
>
>   It took almost 14.0 seconds to store 10000000 tuples.
>
> * Patched
>   10000000 of 10000000 tuples (100%) done (elapsed 14.90 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 15.05 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 15.42 s, remaining 0.00 s).
>   10000000 of 10000000 tuples (100%) done (elapsed 15.70 s, remaining 0.00 s).
>
>   It took almost 15.0 seconds to store 10000000 tuples.
>
> Thus, I'm afraid that enabling network statistics would cause serious
> performance
> degradation. Thought?

Yes, I think the overhead of this patch is far, far too high to
contemplate applying it.  It sends a stats collector message after
*every socket operation*.  Once per transaction would likely be too
much overhead already (think: pgbench -S) but once per socket op is
insane.

Moreover, even if we found some way to reduce that overhead to an
acceptable level, I think a lot of people would be unhappy about the
statsfile bloat.  Unfortunately, the bottom line here is that, until
someone overhauls the stats collector infrastructure to make
incremental updates to the statsfile cheap, we really can't afford to
add much of anything in the way of new statistics.  So I fear this
patch is doomed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: stats for network traffic WIP

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Yes, I think the overhead of this patch is far, far too high to
> contemplate applying it.  It sends a stats collector message after
> *every socket operation*.  Once per transaction would likely be too
> much overhead already (think: pgbench -S) but once per socket op is
> insane.

Oh, is that what the problem is?  That seems trivially fixable --- only
flush the data to the collector once per query or so.  I'd be a bit
inclined to add it to the existing transaction-end messages instead of
adding any new traffic.

> Moreover, even if we found some way to reduce that overhead to an
> acceptable level, I think a lot of people would be unhappy about the
> statsfile bloat.

This could be a bigger problem, but what are we aggregating over?
If the stats are only recorded at say the database level, that's not
going to take much space.

Having said that, I can't get very excited about this feature anyway,
so I'm fine with rejecting the patch.  I'm not sure that enough people
care to justify any added overhead at all.  The long and the short of
it is that network traffic generally is what it is, for any given query
workload, and so it's not clear what's the point of counting it.
        regards, tom lane



Re: stats for network traffic WIP

From
Peter Eisentraut
Date:
On 12/10/13, 5:08 PM, Tom Lane wrote:
> Having said that, I can't get very excited about this feature anyway,
> so I'm fine with rejecting the patch.  I'm not sure that enough people
> care to justify any added overhead at all.  The long and the short of
> it is that network traffic generally is what it is, for any given query
> workload, and so it's not clear what's the point of counting it.

Also, if we add this, the next guy is going to want to add CPU
statistics, memory statistics, etc.

Is there a reason why you can't get this directly from the OS?



Re: stats for network traffic WIP

From
Atri Sharma
Date:
On Wed, Dec 11, 2013 at 11:12 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 12/10/13, 5:08 PM, Tom Lane wrote:
>> Having said that, I can't get very excited about this feature anyway,
>> so I'm fine with rejecting the patch.  I'm not sure that enough people
>> care to justify any added overhead at all.  The long and the short of
>> it is that network traffic generally is what it is, for any given query
>> workload, and so it's not clear what's the point of counting it.
>
> Also, if we add this, the next guy is going to want to add CPU
> statistics, memory statistics, etc.
>
> Is there a reason why you can't get this directly from the OS?

I would say that its more of a convenience to track the usage directly
from the database instead of setting up OS infrastructure to store it.

That said, it should be possible to directly do it from OS level. Can
we think of adding this to pgtop, though?

I am just musing here.

Regards,

Atri



Re: stats for network traffic WIP

From
Tom Lane
Date:
Atri Sharma <atri.jiit@gmail.com> writes:
> On Wed, Dec 11, 2013 at 11:12 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> Is there a reason why you can't get this directly from the OS?

> I would say that its more of a convenience to track the usage directly
> from the database instead of setting up OS infrastructure to store it.

The thing that I'm wondering is why the database would be the right place
to be measuring it at all.  If you've got a network usage problem,
aggregate usage across everything on the server is probably what you
need to be worried about, and PG can't tell you that.
        regards, tom lane



Re: stats for network traffic WIP

From
Greg Stark
Date:
<p dir="ltr">I could see this being interesting for FDW plan nodes of the status were visible in explain. Possibly also
timespent waiting on network reads and writes.<p dir="ltr">I have a harder time seeing why it's useful to have these
staysin aggregate but I suppose if you had lots of FDW connections or lots of steaming slaves you might want to be able
toidentify which ones are not getting used or are dominating your network usage. <p dir="ltr">-- <br /> greg<div
class="gmail_quote">On11 Dec 2013 10:52, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>
wrote:<brtype="attribution" /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">Atri Sharma <<a href="mailto:atri.jiit@gmail.com">atri.jiit@gmail.com</a>> writes:<br />
>On Wed, Dec 11, 2013 at 11:12 PM, Peter Eisentraut <<a href="mailto:peter_e@gmx.net">peter_e@gmx.net</a>>
wrote:<br/> >> Is there a reason why you can't get this directly from the OS?<br /><br /> > I would say that
itsmore of a convenience to track the usage directly<br /> > from the database instead of setting up OS
infrastructureto store it.<br /><br /> The thing that I'm wondering is why the database would be the right place<br />
tobe measuring it at all.  If you've got a network usage problem,<br /> aggregate usage across everything on the server
isprobably what you<br /> need to be worried about, and PG can't tell you that.<br /><br />                        
regards,tom lane<br /><br /><br /> --<br /> Sent via pgsql-hackers mailing list (<a
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br/> To make changes to your
subscription:<br/><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br/></blockquote></div> 

Re: stats for network traffic WIP

From
Jim Nasby
Date:
On 12/11/13 12:51 PM, Tom Lane wrote:
> Atri Sharma <atri.jiit@gmail.com> writes:
>> On Wed, Dec 11, 2013 at 11:12 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>>> Is there a reason why you can't get this directly from the OS?
>
>> I would say that its more of a convenience to track the usage directly
>> from the database instead of setting up OS infrastructure to store it.
>
> The thing that I'm wondering is why the database would be the right place
> to be measuring it at all.  If you've got a network usage problem,
> aggregate usage across everything on the server is probably what you
> need to be worried about, and PG can't tell you that.

Except how many folks that care about performance that much don't have dedicated database servers?

BTW, since someone mentioned CPU etc, what I'd be interested in is being able to see what OS-level resources were
consumedby individual queries. You can already get that to a degree via explain (at least for memory and buffer reads),
butit'd be very useful to see what queries are CPU or IO-bound.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: stats for network traffic WIP

From
Craig Ringer
Date:
On 12/12/2013 02:51 AM, Tom Lane wrote:
> The thing that I'm wondering is why the database would be the right place
> to be measuring it at all.  If you've got a network usage problem,
> aggregate usage across everything on the server is probably what you
> need to be worried about, and PG can't tell you that.

I suspect this feature would be useful for when you want to try to drill
down and figure out what's having network issues - specifically, to
associate network behaviour with individual queries, individual users,
application_name, etc.

One sometimes faces the same issue with I/O: I know PostgreSQL is doing
lots of I/O, but what exactly is causing the I/O? Especially if you
can't catch it at the time it happens, it can be quite tricky to go from
"there's lots of I/O" to "this query changed from using synchronized
seqscans to doing an index-only scan that's hammering the cache".

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: stats for network traffic WIP

From
Stephen Frost
Date:
* Craig Ringer (craig@2ndquadrant.com) wrote:
> On 12/12/2013 02:51 AM, Tom Lane wrote:
> > The thing that I'm wondering is why the database would be the right place
> > to be measuring it at all.  If you've got a network usage problem,
> > aggregate usage across everything on the server is probably what you
> > need to be worried about, and PG can't tell you that.
>
> I suspect this feature would be useful for when you want to try to drill
> down and figure out what's having network issues - specifically, to
> associate network behaviour with individual queries, individual users,
> application_name, etc.
>
> One sometimes faces the same issue with I/O: I know PostgreSQL is doing
> lots of I/O, but what exactly is causing the I/O? Especially if you
> can't catch it at the time it happens, it can be quite tricky to go from
> "there's lots of I/O" to "this query changed from using synchronized
> seqscans to doing an index-only scan that's hammering the cache".

Agreed.  My other thought on this is that there's a lot to be said for
having everything you need available through one tool- kinda like how
Emacs users rarely go outside of it.. :)  And then there's also the
consideration that DBAs may not have access to the host system at all,
or not to the level needed to do similar analysis there.
Thanks,
    Stephen

Re: stats for network traffic WIP

From
Robert Haas
Date:
On Wed, Dec 18, 2013 at 8:47 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Agreed.  My other thought on this is that there's a lot to be said for
> having everything you need available through one tool- kinda like how
> Emacs users rarely go outside of it.. :)  And then there's also the
> consideration that DBAs may not have access to the host system at all,
> or not to the level needed to do similar analysis there.

I completely agree with this, and yet I still think we should reject
the patch, because I think the overhead is going to be intolerable.

Now, the fact is, the monitoring facilities we have in PostgreSQL
today are not nearly good enough.  Other products do better.  I cringe
every time I tell someone to attach strace to a long-running autovac
process to find out what block number it's currently on, so we can
estimate when it will finish; or every time we need data about lwlock
contention and the only way to get it is to use perf, or recompile
with LWLOCK_STATS defined.  These are not fun conversations to have
with customers who are in production.

On the other hand, there's not much value in adding monitoring
features that are going to materially harm performance, and a lot of
the monitoring features that get proposed die on the vine for exactly
that reason.  I think the root of the problem is that our stats
infrastructure is a streaming pile of crap.  A number of people have
worked diligently to improve it and that work has not been fruitless,
but the current situation is still not very good.  In many ways, this
situation reminds me of the situation with EXPLAIN a few years ago.
People kept proposing useful extensions to EXPLAIN which we did not
adopt because they required creating (and perhaps reserving) far too
many keywords.  Now that we have the extensible options syntax,
EXPLAIN has options for COSTS, BUFFERS, TIMING, and FORMAT, all of
which have proven to be worth their weight in code, at least IMHO.

I am really not sure what a better infrastructure for stats collection
should look like, but I know that until we get one, a lot of
monitoring patches that would be really nice to have are going to get
shot down because of concerns about performance, and specifically
stats file bloat.  Fixing that problem figures to be unglamorous, but
I'll buy whoever does it a beer (or another beverage of your choice).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: stats for network traffic WIP

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Dec 18, 2013 at 8:47 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > Agreed.  My other thought on this is that there's a lot to be said for
> > having everything you need available through one tool- kinda like how
> > Emacs users rarely go outside of it.. :)  And then there's also the
> > consideration that DBAs may not have access to the host system at all,
> > or not to the level needed to do similar analysis there.
>
> I completely agree with this, and yet I still think we should reject
> the patch, because I think the overhead is going to be intolerable.

That's a fair point and I'm fine with rejecting it on the grounds that
the overhead is too much.  Hopefully that encourages the author to go
back and review Tom's comments and consider how the overhead could be
reduced or eliminated.  We absolutely need better monitoring and I have
had many of the same strace-involving conversations.  perf is nearly out
of the question as it's often not even installed and can be terribly
risky (I once had to get a prod box hard-reset after running perf on it
for mere moments because it never came back enough to let us do a clean
restart).

> I think the root of the problem is that our stats
> infrastructure is a streaming pile of crap.

+1
Thanks,
    Stephen

Re: stats for network traffic WIP

From
Bruce Momjian
Date:
On Wed, Dec 18, 2013 at 03:41:24PM -0500, Robert Haas wrote:
> On the other hand, there's not much value in adding monitoring
> features that are going to materially harm performance, and a lot of
> the monitoring features that get proposed die on the vine for exactly
> that reason.  I think the root of the problem is that our stats
> infrastructure is a streaming pile of crap.  A number of people have

"streaming"?  I can't imagine what that looks like.  ;-)

I think the larger point is that network is only one of many things we
need to address, so this needs a holistic approach that looks at all
needs and creates infrastructure to address it.
--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +