Thread: BUG #5585: SSL problems with long COPYs
The following bug has been logged online: Bug reference: 5585 Logged by: Karl Denninger Email address: karl@denninger.net PostgreSQL version: 8.4.4 Operating system: FreeBSD 8.0 Description: SSL problems with long COPYs Details: This is a copy of a message I posted this evening on the SLONY list. Synopsis: With SSL ON a large table copy containing a BYTEA field fails repeatedly a few minutes into the operation. The table itself is some ~25gb in size, and a dump and restore of the table onto a different machine (or the same machine) can be completed without incident (e.g. the data is known to not be corrupt.) However, during a COPY performed by SLONY 2.0.4, the system says it lost the connection. It in fact did not - the copy is running on the server, although the output is closed (thus it is dumping into a closed file descriptor) Perusal of the server logs discloses that the SSL module is apparently internally faulting, although the actual error message is less than useful in trying to track down the cause. Here is the copy of the email I just sent to the SLONY list. Note that the copy, while not yet complete, is proceeding normally (and about 20% in - a good 10x further than it ever got with SSL enabled) - with SSL turned off. It is far enough along at this point that I am reasonably confident that with SSL off it will complete as expected. Karl Denninger wrote: > > I upgraded somewhat-recently from 2.0.2 to 2.0.4, and now I've got a > > serious problem. > > > > The reason for the "gotta do it now" was that somehow one of the tables > > got out of sync, and a delete was failing to propagate - hanging the > > process. > > > > OK, ok, so 2.0.2 with Postgres 8.4.4 is a bit old and mismatched. So I > > upgraded to 2.0.4 on all the nodes, and told the subscriber to reload - > > ditched the client config and re-subscribed the sets. > > > > All went well until a very large table came up - it failed. > > > > There's no error in the logs indicating why, other than the following: > > > > Jul 31 22:52:53 dbms TICKER[70295]: [153-1] CONFIG remoteWorkerThread_3: > > copy table "public"."images" > > Jul 31 22:52:53 dbms TICKER[70295]: [154-1] CONFIG remoteWorkerThread_3: > > Begin COPY of table "public"."images" > > Jul 31 22:54:24 dbms TICKER[70295]: [155-1] ERROR remoteWorkerThread_3: > > PGgetCopyData() server closed the connection unexpectedly > > Jul 31 22:54:24 dbms TICKER[70295]: [155-2] This probably means the > > server terminated abnormally > > Jul 31 22:54:24 dbms TICKER[70295]: [155-3] before or while > > processing the request. > > Jul 31 22:54:24 dbms TICKER[70295]: [156-1] WARN remoteWorkerThread_3: > > data copy for set 1 failed 1 times - sleep 15 seconds > > > > And in 15 seconds, the entire process of trying to re-init the node > > starts over - from the beginning! > > > > Near as I can tell, it's failing pretty early on. > > > > The source host is fine. This particular table contains a BYTEA field, > > and it's BIG. ~20ish gigs big. But I've re-initialized in the past > > without problems. I tried going back to 2.0.2, and that still fails. > > Both servers are running with encoding set to SQL_ASCII, if it matters. > > > > When it fails the SERVER's COPY is still running - so the client is > > definitely wrong on the reported error. I have NOTHING in the server's > > SLON log and there are no comms problems between the two hosts. > > > > I'm going to run a dump of the table and see if I can manually bring it > > over to the other host and load it. There's nothing going on with the > > master that implicates the data being damaged...... > > > > Ideas? > > > > -- Karl > > Oh boy, I think I know what's going on.... This looks like a problem in the SSL code (!!!!!) I shut off SSL and I'm now ~4x further than where the copy has previously failed. For obvious reasons this is decidedly un-good - a perusal of the server's log (not slony's, but the server's) disclosed several SSL errors in the log, which would account for the problem and what the client thought were disconnects - but weren't. Specifically, I got a bunch of these... Jul 31 23:00:00 tickerforum postgres[27093]: [9593-2] STATEMENT: copy "public"."images" ("post_ordinal","ordinal","caption","image","login","file_type","thumb","thu mb_width","thumb_height","hidden") to stdout; Jul 31 23:00:00 tickerforum postgres[27093]: [9594-1] LOG: SSL error: internal error Jul 31 23:00:00 tickerforum postgres[27093]: [9594-2] STATEMENT: copy "public"."images" ("post_ordinal","ordinal","caption","image","login","file_type","thumb","thu mb_width","thumb_height","hidden") to stdout; Jul 31 23:00:01 tickerforum postgres[27093]: [9595-1] LOG: SSL error: internal error This isn't SLONY's issue, but it's definitely a problem. I'll report it over on the Postgres list in the morning... In the meantime, I think I can get the copy to go with SSL off, and then turn it back on once the copy is complete.
On Sun, Aug 1, 2010 at 00:08, Karl Denninger <karl@denninger.net> wrote: > > The following bug has been logged online: > > Bug reference: =C2=A0 =C2=A0 =C2=A05585 > Logged by: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Karl Denninger > Email address: =C2=A0 =C2=A0 =C2=A0karl@denninger.net > PostgreSQL version: 8.4.4 > Operating system: =C2=A0 FreeBSD 8.0 > Description: =C2=A0 =C2=A0 =C2=A0 =C2=A0SSL problems with long COPYs > Details: > > This is a copy of a message I posted this evening on the SLONY list. > > Synopsis: With SSL ON a large table copy containing a BYTEA field fails > repeatedly a few minutes into the operation. My guess is its due to the server or client disabling ssl renegotiation, per the docs: ssl_renegotiation_limit (integer) Specifies how much data can flow over an SSL encrypted connection before renegotiation of the session will take place. Renegotiation of the session decreases the chance of doing cryptanalysis when large amounts of data are sent, but it also carries a large performance penalty. The sum of sent and received traffic is used to check the limit. If the parameter is set to 0, renegotiation is disabled. The default is 512MB. Note: SSL libraries from before November 2009 are insecure when using SSL renegotiation, due to a vulnerability in the SSL protocol. As a stop-gap fix for this vulnerability, some vendors also shipped SSL libraries incapable of doing renegotiation. If any of these libraries are in use on the client or server, SSL renegotiation should be disabled. Id try setting that to 0 in your postgresql.conf and see if it still fails.
Alex Hunsaker wrote:
-- Karl
I will attempt this but it is at least somewhat unlikely to be the cause, as prior to the failure two tables of over 1GB each did correctly transfer. They did not, however, have any binary (bytea) fields in them.On Sun, Aug 1, 2010 at 00:08, Karl Denninger <karl@denninger.net> wrote:The following bug has been logged online: Bug reference: 5585 Logged by: Karl Denninger Email address: karl@denninger.net PostgreSQL version: 8.4.4 Operating system: FreeBSD 8.0 Description: SSL problems with long COPYs Details: This is a copy of a message I posted this evening on the SLONY list. Synopsis: With SSL ON a large table copy containing a BYTEA field fails repeatedly a few minutes into the operation.My guess is its due to the server or client disabling ssl renegotiation, per the docs: ssl_renegotiation_limit (integer) Specifies how much data can flow over an SSL encrypted connection before renegotiation of the session will take place. Renegotiation of the session decreases the chance of doing cryptanalysis when large amounts of data are sent, but it also carries a large performance penalty. The sum of sent and received traffic is used to check the limit. If the parameter is set to 0, renegotiation is disabled. The default is 512MB. Note: SSL libraries from before November 2009 are insecure when using SSL renegotiation, due to a vulnerability in the SSL protocol. As a stop-gap fix for this vulnerability, some vendors also shipped SSL libraries incapable of doing renegotiation. If any of these libraries are in use on the client or server, SSL renegotiation should be disabled. Id try setting that to 0 in your postgresql.conf and see if it still fails.
-- Karl
Attachment
On 08/01/2010 08:33 AM, Karl Denninger wrote: > Alex Hunsaker wrote: >> On Sun, Aug 1, 2010 at 00:08, Karl Denninger<karl@denninger.net> wrote: >> >>> The following bug has been logged online: >>> >>> Bug reference: 5585 >>> Logged by: Karl Denninger >>> Email address:karl@denninger.net >>> PostgreSQL version: 8.4.4 >>> Operating system: FreeBSD 8.0 >>> Description: SSL problems with long COPYs >>> Details: >>> >>> This is a copy of a message I posted this evening on the SLONY list. >>> >>> Synopsis: With SSL ON a large table copy containing a BYTEA field fails >>> repeatedly a few minutes into the operation. >>> >> >> My guess is its due to the server or client disabling ssl >> renegotiation, per the docs: >> >> ssl_renegotiation_limit (integer) >> Specifies how much data can flow over an SSL encrypted connection >> before renegotiation of the session will take place. Renegotiation of >> the session decreases the chance of doing cryptanalysis when large >> amounts of data are sent, but it also carries a large performance >> penalty. The sum of sent and received traffic is used to check the >> limit. If the parameter is set to 0, renegotiation is disabled. The >> default is 512MB. >> >> Note: SSL libraries from before November 2009 are insecure when using >> SSL renegotiation, due to a vulnerability in the SSL protocol. As a >> stop-gap fix for this vulnerability, some vendors also shipped SSL >> libraries incapable of doing renegotiation. If any of these libraries >> are in use on the client or server, SSL renegotiation should be >> disabled. >> >> Id try setting that to 0 in your postgresql.conf and see if it still fails. >> >> > I will attempt this but it is at least somewhat unlikely to be the > cause, as prior to the failure two tables of over 1GB each did correctly > transfer. They did not, however, have any binary (bytea) fields in them. how exactly did you measure the 1GB? If that's the on-disk size of the table (maybe even including indexes) it is entirely believable that the amount of data transfered through COPY on the SSL-session was much less than 512MB. Given the symthoms reported I would agree with Alex on suspecting a broken openssl library. Stefan
Stefan Kaltenbrunner wrote: > On 08/01/2010 08:33 AM, Karl Denninger wrote: >> Alex Hunsaker wrote: >>> On Sun, Aug 1, 2010 at 00:08, Karl Denninger<karl@denninger.net> >>> wrote: >>> >>>> The following bug has been logged online: >>>> >>>> Bug reference: 5585 >>>> Logged by: Karl Denninger >>>> Email address:karl@denninger.net >>>> PostgreSQL version: 8.4.4 >>>> Operating system: FreeBSD 8.0 >>>> Description: SSL problems with long COPYs >>>> Details: >>>> >>>> This is a copy of a message I posted this evening on the SLONY list. >>>> >>>> Synopsis: With SSL ON a large table copy containing a BYTEA field >>>> fails >>>> repeatedly a few minutes into the operation. >>>> >>> >>> My guess is its due to the server or client disabling ssl >>> renegotiation, per the docs: >>> >>> ssl_renegotiation_limit (integer) >>> Specifies how much data can flow over an SSL encrypted connection >>> before renegotiation of the session will take place. Renegotiation of >>> the session decreases the chance of doing cryptanalysis when large >>> amounts of data are sent, but it also carries a large performance >>> penalty. The sum of sent and received traffic is used to check the >>> limit. If the parameter is set to 0, renegotiation is disabled. The >>> default is 512MB. >>> >>> Note: SSL libraries from before November 2009 are insecure when using >>> SSL renegotiation, due to a vulnerability in the SSL protocol. As a >>> stop-gap fix for this vulnerability, some vendors also shipped SSL >>> libraries incapable of doing renegotiation. If any of these libraries >>> are in use on the client or server, SSL renegotiation should be >>> disabled. >>> >>> Id try setting that to 0 in your postgresql.conf and see if it still >>> fails. >>> >>> >> I will attempt this but it is at least somewhat unlikely to be the >> cause, as prior to the failure two tables of over 1GB each did correctly >> transfer. They did not, however, have any binary (bytea) fields in >> them. > > how exactly did you measure the 1GB? If that's the on-disk size of the > table (maybe even including indexes) it is entirely believable that > the amount of data transfered through COPY on the SSL-session was much > less than 512MB. > Given the symthoms reported I would agree with Alex on suspecting a > broken openssl library. The reported copy table size in the SLON log. It exceeded 1GB for two of the tables the successfully came over before the error. -- Karl
Attachment
Karl Denninger <karl@denninger.net> writes: > Stefan Kaltenbrunner wrote: >> how exactly did you measure the 1GB? > The reported copy table size in the SLON log. It exceeded 1GB for two > of the tables the successfully came over before the error. Hmm, I'm not sure how Slony comes by that number, so this might or might not be meaningful. I agree with the other respondents that the symptom sounds exactly like broken renegotiation --- the earliest security patches to close the openssl CVE hole resulted in failures exactly like this whenever the server tried to force key renegotiation. You might check whether libssl was recently updated on either the server or client machine. regards, tom lane
Tom Lane wrote:
I've also shut off SSL renegotiation for now and will leave it off until I can figure out what's up - if it breaks during my testing with renegotiation off (or not) I'll update here.
-- Karl
I've turned off SSL for now and the copy appears to be proceeding normally. I'll see if I can isolate this further once I have the replication stabilized again - I have a test box I can set up as a third node off one of the others on a higher-bandwidth connection, which will make testing this in some more detail easier and faster.Karl Denninger <karl@denninger.net> writes:Stefan Kaltenbrunner wrote:how exactly did you measure the 1GB?The reported copy table size in the SLON log. It exceeded 1GB for two of the tables the successfully came over before the error.Hmm, I'm not sure how Slony comes by that number, so this might or might not be meaningful. I agree with the other respondents that the symptom sounds exactly like broken renegotiation --- the earliest security patches to close the openssl CVE hole resulted in failures exactly like this whenever the server tried to force key renegotiation. You might check whether libssl was recently updated on either the server or client machine. regards, tom lane
I've also shut off SSL renegotiation for now and will leave it off until I can figure out what's up - if it breaks during my testing with renegotiation off (or not) I'll update here.
-- Karl
Attachment
Tom Lane wrote:
There's a second problem in that SLONY appears to have a memory management issue that I've tickled with a COPY of this particular table, and it's a bad one - it may preclude me from being able to resync at all - but that's not Postgres' fault.
Looks like this bug report can be closed as the issue does not appear to be yours beyond the SSL issue that is documented.
(Whether Postgress 9's internal replication will solve this for me when it is released is something I'm not sure about - I think the answer is "no", since if I'm reading the docs correctly Postgres 9 requires that both master and slave be in sync via some other method before the replication is enabled - that is, it's not capable of taking a "raw" (populated with empty tables or not) new system and bringing it into sync and then replicating from there. That's a major problem in a "live" environment if there's a failure of some sort and you want to bring the system that failed back into the cluster......)
-- Karl
I set the ssl_renegotiation off and the copy is now being attempted (and is well past where it failed before) with SSL on.Karl Denninger <karl@denninger.net> writes:Stefan Kaltenbrunner wrote:how exactly did you measure the 1GB.The reported copy table size in the SLON log. It exceeded 1GB for two of the tables the successfully came over before the error.Hmm, I'm not sure how Slony comes by that number, so this might or might not be meaningful. I agree with the other respondents that the symptom sounds exactly like broken renegotiation --- the earliest security patches to close the openssl CVE hole resulted in failures exactly like this whenever the server tried to force key renegotiation. You might check whether libssl was recently updated on either the server or client machine.
There's a second problem in that SLONY appears to have a memory management issue that I've tickled with a COPY of this particular table, and it's a bad one - it may preclude me from being able to resync at all - but that's not Postgres' fault.
Looks like this bug report can be closed as the issue does not appear to be yours beyond the SSL issue that is documented.
(Whether Postgress 9's internal replication will solve this for me when it is released is something I'm not sure about - I think the answer is "no", since if I'm reading the docs correctly Postgres 9 requires that both master and slave be in sync via some other method before the replication is enabled - that is, it's not capable of taking a "raw" (populated with empty tables or not) new system and bringing it into sync and then replicating from there. That's a major problem in a "live" environment if there's a failure of some sort and you want to bring the system that failed back into the cluster......)
-- Karl
Attachment
On Sun, Aug 1, 2010 at 14:02, Karl Denninger <karl@denninger.net> wrote: > (Whether Postgress 9's internal replication will solve this for me when it > is released is something I'm not sure about - I think the answer is "no", > since if I'm reading the docs correctly Postgres 9 requires that both master > and slave be in sync via some other method before the replication is enabled Not only that, but you cant sync certain tables or databases. However, I have had great success with a simple: select pg_start_backup(); ... rsync database dir to slave, fixup postgresql.conf,recovery.conf select pg_stop_backup(); That lets you "sync" up the slave with 0 downtime. (the io from rsync can be a problem tho)