Thread: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

[HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

From
Jonathon Nelson
Date:

Attached please find a patch for PostgreSQL 9.4 which changes the maximum amount of data that the wal sender will send at any point in time from the hard-coded value of 128KiB to a user-controllable value up to 16MiB. It has been primarily tested under 9.4 but there has been some testing with 9.5.

In our lab environment and with a 16MiB setting, we saw substantially better network utilization (almost 2x!), primarily over high bandwidth delay product links.

--
Jon Nelson
Dyn / Principal Software Engineer

Attachment
Hi,

On 2017-01-05 12:55:44 -0600, Jonathon Nelson wrote:
> Attached please find a patch for PostgreSQL 9.4 which changes the maximum
> amount of data that the wal sender will send at any point in time from the
> hard-coded value of 128KiB to a user-controllable value up to 16MiB. It has
> been primarily tested under 9.4 but there has been some testing with 9.5.
> 
> In our lab environment and with a 16MiB setting, we saw substantially
> better network utilization (almost 2x!), primarily over high bandwidth
> delay product links.

That's a bit odd - shouldn't the OS network stack take care of this in
both cases?  I mean either is too big for TCP packets (including jumbo
frames).  What type of OS and network is involved here?

Greetings,

Andres Freund



Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZEto max_wal_send

From
Jonathon Nelson
Date:


On Thu, Jan 5, 2017 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2017-01-05 12:55:44 -0600, Jonathon Nelson wrote:
> Attached please find a patch for PostgreSQL 9.4 which changes the maximum
> amount of data that the wal sender will send at any point in time from the
> hard-coded value of 128KiB to a user-controllable value up to 16MiB. It has
> been primarily tested under 9.4 but there has been some testing with 9.5.
>
> In our lab environment and with a 16MiB setting, we saw substantially
> better network utilization (almost 2x!), primarily over high bandwidth
> delay product links.

That's a bit odd - shouldn't the OS network stack take care of this in
both cases?  I mean either is too big for TCP packets (including jumbo
frames).  What type of OS and network is involved here?

In our test lab, we make use of multiple flavors of Linux. No jumbo frames. We simulated anything from 0 to 160ms RTT (with varying degrees of jitter, packet loss, etc.) using tc. Even with everything fairly clean, at 80ms RTT there was a 2x improvement in performance.

--
Jon Nelson
Dyn / Principal Software Engineer

Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZEto max_wal_send

From
Kevin Grittner
Date:
On Thu, Jan 5, 2017 at 7:32 PM, Jonathon Nelson <jdnelson@dyn.com> wrote:
> On Thu, Jan 5, 2017 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote:
>> On 2017-01-05 12:55:44 -0600, Jonathon Nelson wrote:

>>> In our lab environment and with a 16MiB setting, we saw substantially
>>> better network utilization (almost 2x!), primarily over high bandwidth
>>> delay product links.
>>
>> That's a bit odd - shouldn't the OS network stack take care of this in
>> both cases?  I mean either is too big for TCP packets (including jumbo
>> frames).  What type of OS and network is involved here?
>
> In our test lab, we make use of multiple flavors of Linux. No jumbo frames.
> We simulated anything from 0 to 160ms RTT (with varying degrees of jitter,
> packet loss, etc.) using tc. Even with everything fairly clean, at 80ms RTT
> there was a 2x improvement in performance.

Is there compression and/or encryption being performed by the
network layers?  My experience with both is that they run faster on
bigger chunks of data, and that might happen before the data is
broken into packets.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZEto max_wal_send

From
Jonathon Nelson
Date:


On Fri, Jan 6, 2017 at 8:52 AM, Kevin Grittner <kgrittn@gmail.com> wrote:
On Thu, Jan 5, 2017 at 7:32 PM, Jonathon Nelson <jdnelson@dyn.com> wrote:
> On Thu, Jan 5, 2017 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote:
>> On 2017-01-05 12:55:44 -0600, Jonathon Nelson wrote:

>>> In our lab environment and with a 16MiB setting, we saw substantially
>>> better network utilization (almost 2x!), primarily over high bandwidth
>>> delay product links.
>>
>> That's a bit odd - shouldn't the OS network stack take care of this in
>> both cases?  I mean either is too big for TCP packets (including jumbo
>> frames).  What type of OS and network is involved here?
>
> In our test lab, we make use of multiple flavors of Linux. No jumbo frames.
> We simulated anything from 0 to 160ms RTT (with varying degrees of jitter,
> packet loss, etc.) using tc. Even with everything fairly clean, at 80ms RTT
> there was a 2x improvement in performance.

Is there compression and/or encryption being performed by the
network layers?  My experience with both is that they run faster on
bigger chunks of data, and that might happen before the data is
broken into packets.

There is no compression or encryption. The testing was with and without various forms of hardware offload, etc. but otherwise there is no magic up these sleeves.

--
Jon Nelson
Dyn / Principal Software Engineer

On 1/5/17 12:55 PM, Jonathon Nelson wrote:
> Attached please find a patch for PostgreSQL 9.4 which changes the
> maximum amount of data that the wal sender will send at any point in
> time from the hard-coded value of 128KiB to a user-controllable value up
> to 16MiB. It has been primarily tested under 9.4 but there has been some
> testing with 9.5.

To make sure this doesn't get lost, please add it to 
https://commitfest.postgresql.org. Please verify the patch will apply 
against current HEAD and pass make check-world.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



On 5 January 2017 at 19:01, Andres Freund <andres@anarazel.de> wrote:
> That's a bit odd - shouldn't the OS network stack take care of this in
> both cases?  I mean either is too big for TCP packets (including jumbo
> frames).  What type of OS and network is involved here?

2x may be plausible. The first 128k goes out, then the rest queues up
until the first ack comes back. Then the next 128kB goes out again
without waiting... I think this is what Nagle is supposed to actually
address but either it may be off by default these days or our usage
pattern may be defeating it in some way.

-- 
greg



On 8 January 2017 at 17:26, Greg Stark <stark@mit.edu> wrote:
> On 5 January 2017 at 19:01, Andres Freund <andres@anarazel.de> wrote:
>> That's a bit odd - shouldn't the OS network stack take care of this in
>> both cases?  I mean either is too big for TCP packets (including jumbo
>> frames).  What type of OS and network is involved here?
>
> 2x may be plausible. The first 128k goes out, then the rest queues up
> until the first ack comes back. Then the next 128kB goes out again
> without waiting... I think this is what Nagle is supposed to actually
> address but either it may be off by default these days or our usage
> pattern may be defeating it in some way.

Hm. That wasn't very clear.  And the more I think about it, it's not right.

The first block of data -- one byte in the worst case, 128kB in our
case -- gets put in the output buffers and since there's nothing
stopping it it immediately gets sent out. Then all the subsequent data
gets put in output buffers but buffers up due to Nagle. Until there's
a full packet of data buffered, the ack arrives, or the timeout
expires, at which point the buffered data drains efficiently in full
packets. Eventually it all drains away and the next 128kB arrives and
is sent out immediately.

So most packets are full size with the occasional 128kB packet thrown
in whenever the buffer empties. And I think even when the 128kB packet
is pending Nagle only stops small packets, not full packets, and the
window should allow more than one packet of data to be pending.

So, uh, forget what I said. Nagle should be our friend here.

I think you should get network dumps and use xplot to understand
what's really happening. c.f.
https://fasterdata.es.net/assets/Uploads/20131016-TCPDumpTracePlot.pdf


-- 
greg



Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

From
Jonathon Nelson
Date:


On Sun, Jan 8, 2017 at 11:36 AM, Greg Stark <stark@mit.edu> wrote:
On 8 January 2017 at 17:26, Greg Stark <stark@mit.edu> wrote:
> On 5 January 2017 at 19:01, Andres Freund <andres@anarazel.de> wrote:
>> That's a bit odd - shouldn't the OS network stack take care of this in
>> both cases?  I mean either is too big for TCP packets (including jumbo
>> frames).  What type of OS and network is involved here?
>
> 2x may be plausible. The first 128k goes out, then the rest queues up
> until the first ack comes back. Then the next 128kB goes out again
> without waiting... I think this is what Nagle is supposed to actually
> address but either it may be off by default these days or our usage
> pattern may be defeating it in some way.

Hm. That wasn't very clear.  And the more I think about it, it's not right.

The first block of data -- one byte in the worst case, 128kB in our
case -- gets put in the output buffers and since there's nothing
stopping it it immediately gets sent out. Then all the subsequent data
gets put in output buffers but buffers up due to Nagle. Until there's
a full packet of data buffered, the ack arrives, or the timeout
expires, at which point the buffered data drains efficiently in full
packets. Eventually it all drains away and the next 128kB arrives and
is sent out immediately.

So most packets are full size with the occasional 128kB packet thrown
in whenever the buffer empties. And I think even when the 128kB packet
is pending Nagle only stops small packets, not full packets, and the
window should allow more than one packet of data to be pending.

So, uh, forget what I said. Nagle should be our friend here.

[I have not done a rigid analysis, here, but...]

I *think* libpq is the culprit here.

walsender says "Hey, libpq - please send (up to) 128KB of data!" and doesn't "return" until it's "sent". Then it sends more.  Regardless of the underlying cause (nagle, tcp congestion control algorithms, umpteen different combos of hardware and settings, etc..) in almost every test I saw improvement (usually quite a bit). This was most easily observable with high bandwidth-delay product links, but my time in the lab is somewhat limited.

I calculated "performance" the most simple measurement possible: how long did it take for Y volume of data to get transferred, performed over a long-enough interval (typically 1800 seconds) for TCP windows to open up, etc...

--
Jon Nelson
Dyn / Principal Software Engineer


On Sat, Jan 7, 2017 at 7:48 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 1/5/17 12:55 PM, Jonathon Nelson wrote:
Attached please find a patch for PostgreSQL 9.4 which changes the
maximum amount of data that the wal sender will send at any point in
time from the hard-coded value of 128KiB to a user-controllable value up
to 16MiB. It has been primarily tested under 9.4 but there has been some
testing with 9.5.

To make sure this doesn't get lost, please add it to https://commitfest.postgresql.org. Please verify the patch will apply against current HEAD and pass make check-world.

Attached please find a revision of the patch, changed in the following ways:

1. removed a call to debug2.
2. applies cleanly against master (as of 8c5722948e831c1862a39da2bb5d793a6f2aabab)
3. one small indentation fix, one small verbiage fix.
4. switched to calculating the upper bound using XLOG_SEG_SIZE rather than hard-coding 16384.
5. the git author is - obviously - different.

make check-world passes.
I have added it to the commitfest.
I have verified with strace that up to 16MB sends are being used.
I have verified that the GUC properly grumps about values greater than XLOG_SEG_SIZE / 1024 or smaller than 4.

--
Jon
Attachment
On 1/9/17 11:33 PM, Jon Nelson wrote:
> 
> On Sat, Jan 7, 2017 at 7:48 PM, Jim Nasby <Jim.Nasby@bluetreble.com
> <mailto:Jim.Nasby@bluetreble.com>> wrote:
> 
>     On 1/5/17 12:55 PM, Jonathon Nelson wrote:
> 
>         Attached please find a patch for PostgreSQL 9.4 which changes the
>         maximum amount of data that the wal sender will send at any point in
>         time from the hard-coded value of 128KiB to a user-controllable
>         value up
>         to 16MiB. It has been primarily tested under 9.4 but there has
>         been some
>         testing with 9.5.
> 
> 
>     To make sure this doesn't get lost, please add it to
>     https://commitfest.postgresql.org
>     <https://commitfest.postgresql.org>. Please verify the patch will
>     apply against current HEAD and pass make check-world.
> 
> 
> Attached please find a revision of the patch, changed in the following ways:
> 
> 1. removed a call to debug2.
> 2. applies cleanly against master (as of
> 8c5722948e831c1862a39da2bb5d793a6f2aabab)
> 3. one small indentation fix, one small verbiage fix.
> 4. switched to calculating the upper bound using XLOG_SEG_SIZE rather
> than hard-coding 16384.
> 5. the git author is - obviously - different.
> 
> make check-world passes.
> I have added it to the commitfest.
> I have verified with strace that up to 16MB sends are being used.
> I have verified that the GUC properly grumps about values greater than
> XLOG_SEG_SIZE / 1024 or smaller than 4.

This patch applies cleanly on cccbdde and compiles.  However,
documentation in config.sgml is needed.

The concept is simple enough though there seems to be some argument
about whether or not the patch is necessary.  In my experience 128K
should be more than large enough for a chunk size, but I'll buy the
argument that libpq is acting as a barrier in this case.

I'm marking this patch "Waiting on Author" for required documentation.

-- 
-David
david@pgmasters.net





On Thu, Mar 16, 2017 at 9:59 AM, David Steele <david@pgmasters.net> wrote:
On 1/9/17 11:33 PM, Jon Nelson wrote:
>
> On Sat, Jan 7, 2017 at 7:48 PM, Jim Nasby <Jim.Nasby@bluetreble.com
> <mailto:Jim.Nasby@bluetreble.com>> wrote:
>
>     On 1/5/17 12:55 PM, Jonathon Nelson wrote:
>
>         Attached please find a patch for PostgreSQL 9.4 which changes the
>         maximum amount of data that the wal sender will send at any point in
>         time from the hard-coded value of 128KiB to a user-controllable
>         value up
>         to 16MiB. It has been primarily tested under 9.4 but there has
>         been some
>         testing with 9.5.
>
>
>     To make sure this doesn't get lost, please add it to
>     https://commitfest.postgresql.org
>     <https://commitfest.postgresql.org>. Please verify the patch will
>     apply against current HEAD and pass make check-world.
>
>
> Attached please find a revision of the patch, changed in the following ways:
>
> 1. removed a call to debug2.
> 2. applies cleanly against master (as of
> 8c5722948e831c1862a39da2bb5d793a6f2aabab)
> 3. one small indentation fix, one small verbiage fix.
> 4. switched to calculating the upper bound using XLOG_SEG_SIZE rather
> than hard-coding 16384.
> 5. the git author is - obviously - different.
>
> make check-world passes.
> I have added it to the commitfest.
> I have verified with strace that up to 16MB sends are being used.
> I have verified that the GUC properly grumps about values greater than
> XLOG_SEG_SIZE / 1024 or smaller than 4.

This patch applies cleanly on cccbdde and compiles.  However,
documentation in config.sgml is needed.

The concept is simple enough though there seems to be some argument
about whether or not the patch is necessary.  In my experience 128K
should be more than large enough for a chunk size, but I'll buy the
argument that libpq is acting as a barrier in this case.
  (as
I'm marking this patch "Waiting on Author" for required documentation.

Thank you for testing and the comments.  I have some updates:

- I set up a network at home and - in some very quick testing - was unable to observe any obvious performance difference regardless of chunk size
- Before I could get any real testing done, one of the machines I was using for testing died and won't even POST, which has put a damper on said testing (as you might imagine).
- There is a small issue with the patch: a lower-bound of 4 is not appropriate; it should be XLOG_BLCKSZ / 1024 (I can submit an updated patch if that is appropriate)
- I am, at this time, unable to replicate the earlier results however I can't rule them out, either.


--
Jon

On Mon, Jan 9, 2017 at 4:27 PM, Jonathon Nelson <jdnelson@dyn.com> wrote:
> [I have not done a rigid analysis, here, but...]
>
> I *think* libpq is the culprit here.
>
> walsender says "Hey, libpq - please send (up to) 128KB of data!" and doesn't
> "return" until it's "sent". Then it sends more.  Regardless of the
> underlying cause (nagle, tcp congestion control algorithms, umpteen
> different combos of hardware and settings, etc..) in almost every test I saw
> improvement (usually quite a bit). This was most easily observable with high
> bandwidth-delay product links, but my time in the lab is somewhat limited.

This seems plausible to me.  If it takes X amount of time for the
upper layers to put Y amount of data into libpq's buffers, that
imposes some limit on overall throughput.

I mean, is it not sufficient to know that the performance improvement
is happening?  If it's happening, there's an explanation for why it's
happening.

It would be good if somebody else could try to reproduce these results, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 3/16/17 11:53 AM, Jon Nelson wrote:
> 
> 
> On Thu, Mar 16, 2017 at 9:59 AM, David Steele <david@pgmasters.net
> <mailto:david@pgmasters.net>> wrote:
> 
>     On 1/9/17 11:33 PM, Jon Nelson wrote:
>     >
>     > On Sat, Jan 7, 2017 at 7:48 PM, Jim Nasby <Jim.Nasby@bluetreble.com <mailto:Jim.Nasby@bluetreble.com>
>     > <mailto:Jim.Nasby@bluetreble.com <mailto:Jim.Nasby@bluetreble.com>>> wrote:
>     >
>     >     On 1/5/17 12:55 PM, Jonathon Nelson wrote:
>     >
>     >         Attached please find a patch for PostgreSQL 9.4 which changes the
>     >         maximum amount of data that the wal sender will send at any point in
>     >         time from the hard-coded value of 128KiB to a user-controllable
>     >         value up
>     >         to 16MiB. It has been primarily tested under 9.4 but there has
>     >         been some
>     >         testing with 9.5.
>     >
>     >
>     >     To make sure this doesn't get lost, please add it to
>     >     https://commitfest.postgresql.org <https://commitfest.postgresql.org>
>     >     <https://commitfest.postgresql.org
>     <https://commitfest.postgresql.org>>. Please verify the patch will
>     >     apply against current HEAD and pass make check-world.
>     >
>     >
>     > Attached please find a revision of the patch, changed in the following ways:
>     >
>     > 1. removed a call to debug2.
>     > 2. applies cleanly against master (as of
>     > 8c5722948e831c1862a39da2bb5d793a6f2aabab)
>     > 3. one small indentation fix, one small verbiage fix.
>     > 4. switched to calculating the upper bound using XLOG_SEG_SIZE rather
>     > than hard-coding 16384.
>     > 5. the git author is - obviously - different.
>     >
>     > make check-world passes.
>     > I have added it to the commitfest.
>     > I have verified with strace that up to 16MB sends are being used.
>     > I have verified that the GUC properly grumps about values greater than
>     > XLOG_SEG_SIZE / 1024 or smaller than 4.
> 
>     This patch applies cleanly on cccbdde and compiles.  However,
>     documentation in config.sgml is needed.
> 
>     The concept is simple enough though there seems to be some argument
>     about whether or not the patch is necessary.  In my experience 128K
>     should be more than large enough for a chunk size, but I'll buy the
>     argument that libpq is acting as a barrier in this case.
>       (as
>     I'm marking this patch "Waiting on Author" for required documentation.
> 
> 
> Thank you for testing and the comments.  I have some updates:
> 
> - I set up a network at home and - in some very quick testing - was
> unable to observe any obvious performance difference regardless of chunk
> size
> - Before I could get any real testing done, one of the machines I was
> using for testing died and won't even POST, which has put a damper on
> said testing (as you might imagine).
> - There is a small issue with the patch: a lower-bound of 4 is not
> appropriate; it should be XLOG_BLCKSZ / 1024 (I can submit an updated
> patch if that is appropriate)
> - I am, at this time, unable to replicate the earlier results however I
> can't rule them out, either.

My recommendation is that we mark this patch "Returned with Feedback" to
allow you time to test and refine the patch.  You can resubmit once it
is ready.

Thanks,
-- 
-David
david@pgmasters.net



On 3/16/17 11:56 AM, David Steele wrote:
>
> My recommendation is that we mark this patch "Returned with Feedback" to
> allow you time to test and refine the patch.  You can resubmit once it
> is ready.

This submission has been marked "Returned with Feedback".  Please feel 
free to resubmit to a future commitfest.

Thanks,
-- 
-David
david@pgmasters.net