Thread: BUG #5585: SSL problems with long COPYs

BUG #5585: SSL problems with long COPYs

From
"Karl Denninger"
Date:
The following bug has been logged online:

Bug reference:      5585
Logged by:          Karl Denninger
Email address:      karl@denninger.net
PostgreSQL version: 8.4.4
Operating system:   FreeBSD 8.0
Description:        SSL problems with long COPYs
Details:

This is a copy of a message I posted this evening on the SLONY list.

Synopsis: With SSL ON a large table copy containing a BYTEA field fails
repeatedly a few minutes into the operation.  The table itself is some ~25gb
in size, and a dump and restore of the table onto a different machine (or
the same machine) can be completed without incident (e.g. the data is known
to not be corrupt.)

However, during a COPY performed by SLONY 2.0.4, the system says it lost the
connection.  It in fact did not - the copy is running on the server,
although the output is closed (thus it is dumping into a closed file
descriptor)

Perusal of the server logs discloses that the SSL module is apparently
internally faulting, although the actual error message is less than useful
in trying to track down the cause.

Here is the copy of the email I just sent to the SLONY list.

Note that the copy, while not yet complete, is proceeding normally (and
about 20% in - a good 10x further than it ever got with SSL enabled) - with
SSL turned off.  It is far enough along at this point that I am reasonably
confident that with SSL off it will complete as expected.

Karl Denninger wrote:
> > I upgraded somewhat-recently from 2.0.2 to 2.0.4, and now I've got a
> > serious problem.
> >
> > The reason for the "gotta do it now" was that somehow one of the tables
> > got out of sync, and a delete was failing to propagate - hanging the
> > process.
> >
> > OK, ok, so 2.0.2 with Postgres 8.4.4 is a bit old and mismatched.  So I
> > upgraded to 2.0.4 on all the nodes, and told the subscriber to reload -
> > ditched the client config and re-subscribed the sets.
> >
> > All went well until a very large table came up - it failed.
> >
> > There's no error in the logs indicating why, other than the following:
> >
> > Jul 31 22:52:53 dbms TICKER[70295]: [153-1] CONFIG
remoteWorkerThread_3:
> > copy table "public"."images"
> > Jul 31 22:52:53 dbms TICKER[70295]: [154-1] CONFIG
remoteWorkerThread_3:
> > Begin COPY of table "public"."images"
> > Jul 31 22:54:24 dbms TICKER[70295]: [155-1] ERROR
remoteWorkerThread_3:
> > PGgetCopyData() server closed the connection unexpectedly
> > Jul 31 22:54:24 dbms TICKER[70295]: [155-2]     This probably means the
> > server terminated abnormally
> > Jul 31 22:54:24 dbms TICKER[70295]: [155-3]     before or while
> > processing the request.
> > Jul 31 22:54:24 dbms TICKER[70295]: [156-1] WARN
remoteWorkerThread_3:
> > data copy for set 1 failed 1 times - sleep 15 seconds
> >
> > And in 15 seconds, the entire process of trying to re-init the node
> > starts over - from the beginning!
> >
> > Near as I can tell, it's failing pretty early on.
> >
> > The source host is fine.  This particular table contains a BYTEA field,
> > and it's BIG.  ~20ish gigs big.  But I've re-initialized in the past
> > without problems.  I tried going back to 2.0.2, and that still fails.
> > Both servers are running with encoding set to SQL_ASCII, if it matters.
> >
> > When it fails the SERVER's COPY is still running - so the client is
> > definitely wrong on the reported error.  I have NOTHING in the server's
> > SLON log and there are no comms problems between the two hosts.
> >
> > I'm going to run a dump of the table and see if I can manually bring it
> > over to the other host and load it.  There's nothing going on with the
> > master that implicates the data being damaged......
> >
> > Ideas?
> >
> > -- Karl
> >
Oh boy, I think I know what's going on....

This looks like a problem in the SSL code (!!!!!)

I shut off SSL and I'm now ~4x further than where the copy has
previously failed.  For obvious reasons this is decidedly un-good - a
perusal of the server's log (not slony's, but the server's) disclosed
several SSL errors in the log, which would account for the problem and
what the client thought were disconnects - but weren't.

Specifically, I got a bunch of these...

Jul 31 23:00:00 tickerforum postgres[27093]: [9593-2] STATEMENT:  copy
"public"."images"
("post_ordinal","ordinal","caption","image","login","file_type","thumb","thu
mb_width","thumb_height","hidden")
to stdout;
Jul 31 23:00:00 tickerforum postgres[27093]: [9594-1] LOG:  SSL error:
internal error
Jul 31 23:00:00 tickerforum postgres[27093]: [9594-2] STATEMENT:  copy
"public"."images"
("post_ordinal","ordinal","caption","image","login","file_type","thumb","thu
mb_width","thumb_height","hidden")
to stdout;
Jul 31 23:00:01 tickerforum postgres[27093]: [9595-1] LOG:  SSL error:
internal error

This isn't SLONY's issue, but it's definitely a problem.  I'll report it
over on the Postgres list in the morning...

In the meantime, I think I can get the copy to go with SSL off, and then
turn it back on once the copy is complete.

Re: BUG #5585: SSL problems with long COPYs

From
Alex Hunsaker
Date:
On Sun, Aug 1, 2010 at 00:08, Karl Denninger <karl@denninger.net> wrote:
>
> The following bug has been logged online:
>
> Bug reference: =C2=A0 =C2=A0 =C2=A05585
> Logged by: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Karl Denninger
> Email address: =C2=A0 =C2=A0 =C2=A0karl@denninger.net
> PostgreSQL version: 8.4.4
> Operating system: =C2=A0 FreeBSD 8.0
> Description: =C2=A0 =C2=A0 =C2=A0 =C2=A0SSL problems with long COPYs
> Details:
>
> This is a copy of a message I posted this evening on the SLONY list.
>
> Synopsis: With SSL ON a large table copy containing a BYTEA field fails
> repeatedly a few minutes into the operation.

My guess is its due to the server or client disabling ssl
renegotiation, per the docs:

ssl_renegotiation_limit (integer)
Specifies how much data can flow over an SSL encrypted connection
before renegotiation of the session will take place. Renegotiation of
the session decreases the chance of doing cryptanalysis when large
amounts of data are sent, but it also carries a large performance
penalty. The sum of sent and received traffic is used to check the
limit. If the parameter is set to 0, renegotiation is disabled. The
default is 512MB.

Note: SSL libraries from before November 2009 are insecure when using
SSL renegotiation, due to a vulnerability in the SSL protocol. As a
stop-gap fix for this vulnerability, some vendors also shipped SSL
libraries incapable of doing renegotiation. If any of these libraries
are in use on the client or server, SSL renegotiation should be
disabled.

Id try setting that to 0 in your postgresql.conf and see if it still fails.

Re: BUG #5585: SSL problems with long COPYs

From
Karl Denninger
Date:
Alex Hunsaker wrote:
On Sun, Aug 1, 2010 at 00:08, Karl Denninger <karl@denninger.net> wrote: 
The following bug has been logged online:

Bug reference:      5585
Logged by:          Karl Denninger
Email address:      karl@denninger.net
PostgreSQL version: 8.4.4
Operating system:   FreeBSD 8.0
Description:        SSL problems with long COPYs
Details:

This is a copy of a message I posted this evening on the SLONY list.

Synopsis: With SSL ON a large table copy containing a BYTEA field fails
repeatedly a few minutes into the operation.   
My guess is its due to the server or client disabling ssl
renegotiation, per the docs:

ssl_renegotiation_limit (integer)
Specifies how much data can flow over an SSL encrypted connection
before renegotiation of the session will take place. Renegotiation of
the session decreases the chance of doing cryptanalysis when large
amounts of data are sent, but it also carries a large performance
penalty. The sum of sent and received traffic is used to check the
limit. If the parameter is set to 0, renegotiation is disabled. The
default is 512MB.

Note: SSL libraries from before November 2009 are insecure when using
SSL renegotiation, due to a vulnerability in the SSL protocol. As a
stop-gap fix for this vulnerability, some vendors also shipped SSL
libraries incapable of doing renegotiation. If any of these libraries
are in use on the client or server, SSL renegotiation should be
disabled.

Id try setting that to 0 in your postgresql.conf and see if it still fails.
 
I will attempt this but it is at least somewhat unlikely to be the cause, as prior to the failure two tables of over 1GB each did correctly transfer.  They did not, however, have any binary (bytea) fields in them.

-- Karl

Attachment

Re: BUG #5585: SSL problems with long COPYs

From
Stefan Kaltenbrunner
Date:
On 08/01/2010 08:33 AM, Karl Denninger wrote:
> Alex Hunsaker wrote:
>> On Sun, Aug 1, 2010 at 00:08, Karl Denninger<karl@denninger.net>  wrote:
>>
>>> The following bug has been logged online:
>>>
>>> Bug reference:      5585
>>> Logged by:          Karl Denninger
>>> Email address:karl@denninger.net
>>> PostgreSQL version: 8.4.4
>>> Operating system:   FreeBSD 8.0
>>> Description:        SSL problems with long COPYs
>>> Details:
>>>
>>> This is a copy of a message I posted this evening on the SLONY list.
>>>
>>> Synopsis: With SSL ON a large table copy containing a BYTEA field fails
>>> repeatedly a few minutes into the operation.
>>>
>>
>> My guess is its due to the server or client disabling ssl
>> renegotiation, per the docs:
>>
>> ssl_renegotiation_limit (integer)
>> Specifies how much data can flow over an SSL encrypted connection
>> before renegotiation of the session will take place. Renegotiation of
>> the session decreases the chance of doing cryptanalysis when large
>> amounts of data are sent, but it also carries a large performance
>> penalty. The sum of sent and received traffic is used to check the
>> limit. If the parameter is set to 0, renegotiation is disabled. The
>> default is 512MB.
>>
>> Note: SSL libraries from before November 2009 are insecure when using
>> SSL renegotiation, due to a vulnerability in the SSL protocol. As a
>> stop-gap fix for this vulnerability, some vendors also shipped SSL
>> libraries incapable of doing renegotiation. If any of these libraries
>> are in use on the client or server, SSL renegotiation should be
>> disabled.
>>
>> Id try setting that to 0 in your postgresql.conf and see if it still fails.
>>
>>
> I will attempt this but it is at least somewhat unlikely to be the
> cause, as prior to the failure two tables of over 1GB each did correctly
> transfer.  They did not, however, have any binary (bytea) fields in them.

how exactly did you measure the 1GB? If that's the on-disk size of the
table (maybe even including indexes) it is entirely believable that the
amount of data transfered through COPY on the SSL-session was much less
than 512MB.
Given the symthoms reported I would agree with Alex on suspecting a
broken openssl library.



Stefan

Re: BUG #5585: SSL problems with long COPYs

From
Karl Denninger
Date:
Stefan Kaltenbrunner wrote:
> On 08/01/2010 08:33 AM, Karl Denninger wrote:
>> Alex Hunsaker wrote:
>>> On Sun, Aug 1, 2010 at 00:08, Karl Denninger<karl@denninger.net>
>>> wrote:
>>>
>>>> The following bug has been logged online:
>>>>
>>>> Bug reference:      5585
>>>> Logged by:          Karl Denninger
>>>> Email address:karl@denninger.net
>>>> PostgreSQL version: 8.4.4
>>>> Operating system:   FreeBSD 8.0
>>>> Description:        SSL problems with long COPYs
>>>> Details:
>>>>
>>>> This is a copy of a message I posted this evening on the SLONY list.
>>>>
>>>> Synopsis: With SSL ON a large table copy containing a BYTEA field
>>>> fails
>>>> repeatedly a few minutes into the operation.
>>>>
>>>
>>> My guess is its due to the server or client disabling ssl
>>> renegotiation, per the docs:
>>>
>>> ssl_renegotiation_limit (integer)
>>> Specifies how much data can flow over an SSL encrypted connection
>>> before renegotiation of the session will take place. Renegotiation of
>>> the session decreases the chance of doing cryptanalysis when large
>>> amounts of data are sent, but it also carries a large performance
>>> penalty. The sum of sent and received traffic is used to check the
>>> limit. If the parameter is set to 0, renegotiation is disabled. The
>>> default is 512MB.
>>>
>>> Note: SSL libraries from before November 2009 are insecure when using
>>> SSL renegotiation, due to a vulnerability in the SSL protocol. As a
>>> stop-gap fix for this vulnerability, some vendors also shipped SSL
>>> libraries incapable of doing renegotiation. If any of these libraries
>>> are in use on the client or server, SSL renegotiation should be
>>> disabled.
>>>
>>> Id try setting that to 0 in your postgresql.conf and see if it still
>>> fails.
>>>
>>>
>> I will attempt this but it is at least somewhat unlikely to be the
>> cause, as prior to the failure two tables of over 1GB each did correctly
>> transfer.  They did not, however, have any binary (bytea) fields in
>> them.
>
> how exactly did you measure the 1GB? If that's the on-disk size of the
> table (maybe even including indexes) it is entirely believable that
> the amount of data transfered through COPY on the SSL-session was much
> less than 512MB.
> Given the symthoms reported I would agree with Alex on suspecting a
> broken openssl library.
The reported copy table size in the SLON log.   It exceeded 1GB for two
of the tables the successfully came over before the error.

-- Karl

Attachment

Re: BUG #5585: SSL problems with long COPYs

From
Tom Lane
Date:
Karl Denninger <karl@denninger.net> writes:
> Stefan Kaltenbrunner wrote:
>> how exactly did you measure the 1GB?

> The reported copy table size in the SLON log.   It exceeded 1GB for two
> of the tables the successfully came over before the error.

Hmm, I'm not sure how Slony comes by that number, so this might or might
not be meaningful.  I agree with the other respondents that the symptom
sounds exactly like broken renegotiation --- the earliest security
patches to close the openssl CVE hole resulted in failures exactly like
this whenever the server tried to force key renegotiation.  You might
check whether libssl was recently updated on either the server or client
machine.

            regards, tom lane

Re: BUG #5585: SSL problems with long COPYs

From
Karl Denninger
Date:
Tom Lane wrote:
Karl Denninger <karl@denninger.net> writes: 
Stefan Kaltenbrunner wrote:   
how exactly did you measure the 1GB?     
 
The reported copy table size in the SLON log.   It exceeded 1GB for two
of the tables the successfully came over before the error.   
Hmm, I'm not sure how Slony comes by that number, so this might or might
not be meaningful.  I agree with the other respondents that the symptom
sounds exactly like broken renegotiation --- the earliest security
patches to close the openssl CVE hole resulted in failures exactly like
this whenever the server tried to force key renegotiation.  You might
check whether libssl was recently updated on either the server or client
machine.
		regards, tom lane 
I've turned off SSL for now and the copy appears to be proceeding normally.  I'll see if I can isolate this further once I have the replication stabilized again - I have a test box I can set up as a third node off one of the others on a higher-bandwidth connection, which will make testing this in some more detail easier and faster.

I've also shut off SSL renegotiation for now and will leave it off until I can figure out what's up - if it breaks during my testing with renegotiation off (or not) I'll update here.

-- Karl
Attachment

Re: BUG #5585: SSL problems with long COPYs

From
Karl Denninger
Date:
Tom Lane wrote:
Karl Denninger <karl@denninger.net> writes: 
Stefan Kaltenbrunner wrote:   
how exactly did you measure the 1GB.
The reported copy table size in the SLON log.   It exceeded 1GB for two
of the tables the successfully came over before the error.   
Hmm, I'm not sure how Slony comes by that number, so this might or might
not be meaningful.  I agree with the other respondents that the symptom
sounds exactly like broken renegotiation --- the earliest security
patches to close the openssl CVE hole resulted in failures exactly like
this whenever the server tried to force key renegotiation.  You might
check whether libssl was recently updated on either the server or client
machine. 
I set the ssl_renegotiation off and the copy is now being attempted (and is well past where it failed before) with SSL on.

There's a second problem in that SLONY appears to have a memory management issue that I've tickled with a COPY of this particular table, and it's a bad one - it may preclude me from being able to resync at all - but that's not Postgres' fault.

Looks like this bug report can be closed as the issue does not appear to be yours beyond the SSL issue that is documented.

(Whether Postgress 9's internal replication will solve this for me when it is released is something I'm not sure about - I think the answer is "no", since if I'm reading the docs correctly Postgres 9 requires that both master and slave be in sync via some other method before the replication is enabled - that is, it's not capable of taking a "raw" (populated with empty tables or not) new system and bringing it into sync and then replicating from there.  That's a major problem in a "live" environment if there's a failure of some sort and you want to bring the system that failed back into the cluster......)

-- Karl
Attachment

Re: BUG #5585: SSL problems with long COPYs

From
Alex Hunsaker
Date:
On Sun, Aug 1, 2010 at 14:02, Karl Denninger <karl@denninger.net> wrote:

> (Whether Postgress 9's internal replication will solve this for me when it
> is released is something I'm not sure about - I think the answer is "no",
> since if I'm reading the docs correctly Postgres 9 requires that both master
> and slave be in sync via some other method before the replication is enabled

Not only that, but you cant sync certain tables or databases.
However, I have had great success with a simple:
select pg_start_backup();
... rsync database dir to slave, fixup postgresql.conf,recovery.conf
select pg_stop_backup();

That lets you "sync" up the slave with 0 downtime. (the io from rsync
can be a problem tho)