Thread: BUG #3266: SSL broken pipes kill the machine and fill the disk
The following bug has been logged online: Bug reference: 3266 Logged by: Peter Koczan Email address: pjkoczan@gmail.com PostgreSQL version: 8.2.4 Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4 Description: SSL broken pipes kill the machine and fill the disk Details: If a connection using SSL is terminated on the client side before a query completes, postgres keeps trying to write to the broken connection, shooting CPU and load very high and filling the postgres syslog (I have that pointed to /var/log/pglog) with ~2000 of the following messages per second. May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL SYSCALL error: Broken pipe This quickly fills up the /var partition on the server. To replicate the problem: 1. Connect to an running server using an SSL connection. Using psql is fine. 2. Begin a query on any table. For full effect the query should be expensive and large. 3. Kill psql *on the client side* BEFORE the query finishes (don't do anything to the server side connection). 4. 'tail -f' wherever the postgres server output and error is going to. 5. Wait a few seconds while the server gets all of its data. 6. See thousands of error messages fill up your terminal on the server. This has also happened when people stop web browsers in the middle of serving up a postgresql-driven web page, but this is harder to replicate. This usually terminates, but after 3 hours for a query that usually takes 20 seconds. During this time, the server is slow to the point of unusable.
I didn't see any comment on this. Seems like a problem. --------------------------------------------------------------------------- Peter Koczan wrote: > > The following bug has been logged online: > > Bug reference: 3266 > Logged by: Peter Koczan > Email address: pjkoczan@gmail.com > PostgreSQL version: 8.2.4 > Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4 > Description: SSL broken pipes kill the machine and fill the disk > Details: > > If a connection using SSL is terminated on the client side before a query > completes, postgres keeps trying to write to the broken connection, shooting > CPU and load very high and filling the postgres syslog (I have that pointed > to /var/log/pglog) with ~2000 of the following messages per second. > > May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL SYSCALL error: > Broken pipe > > This quickly fills up the /var partition on the server. > > To replicate the problem: > 1. Connect to an running server using an SSL connection. Using psql is > fine. > 2. Begin a query on any table. For full effect the query should be expensive > and large. > 3. Kill psql *on the client side* BEFORE the query finishes (don't do > anything to the server side connection). > 4. 'tail -f' wherever the postgres server output and error is going to. > 5. Wait a few seconds while the server gets all of its data. > 6. See thousands of error messages fill up your terminal on the server. > > This has also happened when people stop web browsers in the middle of > serving up a postgresql-driven web page, but this is harder to replicate. > > This usually terminates, but after 3 hours for a query that usually takes 20 > seconds. During this time, the server is slow to the point of unusable. > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
This looks a lot like bug #2829 (excep that one is Windows), as I mentioned here: http://archives.postgresql.org/pgsql-hackers/2007-05/msg00461.php Haven't looked into the actual code, though, but Tom had a suggestion in the original bug, but AFAIK nobody has done that yet (at least not me.:) //Magnus Bruce Momjian wrote: > I didn't see any comment on this. Seems like a problem. > > --------------------------------------------------------------------------- > > Peter Koczan wrote: >> The following bug has been logged online: >> >> Bug reference: 3266 >> Logged by: Peter Koczan >> Email address: pjkoczan@gmail.com >> PostgreSQL version: 8.2.4 >> Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4 >> Description: SSL broken pipes kill the machine and fill the disk >> Details: >> >> If a connection using SSL is terminated on the client side before a query >> completes, postgres keeps trying to write to the broken connection, shooting >> CPU and load very high and filling the postgres syslog (I have that pointed >> to /var/log/pglog) with ~2000 of the following messages per second. >> >> May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL SYSCALL error: >> Broken pipe >> >> This quickly fills up the /var partition on the server. >> >> To replicate the problem: >> 1. Connect to an running server using an SSL connection. Using psql is >> fine. >> 2. Begin a query on any table. For full effect the query should be expensive >> and large. >> 3. Kill psql *on the client side* BEFORE the query finishes (don't do >> anything to the server side connection). >> 4. 'tail -f' wherever the postgres server output and error is going to. >> 5. Wait a few seconds while the server gets all of its data. >> 6. See thousands of error messages fill up your terminal on the server. >> >> This has also happened when people stop web browsers in the middle of >> serving up a postgresql-driven web page, but this is harder to replicate. >> >> This usually terminates, but after 3 hours for a query that usually takes 20 >> seconds. During this time, the server is slow to the point of unusable. >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 6: explain analyze is your friend >
Magnus Hagander <magnus@hagander.net> writes: > This looks a lot like bug #2829 (excep that one is Windows), as I > mentioned here: > http://archives.postgresql.org/pgsql-hackers/2007-05/msg00461.php > Haven't looked into the actual code, though, but Tom had a suggestion in > the original bug, but AFAIK nobody has done that yet (at least not me.:) I reproduced this on my own machine, and basically the problem seems to be that secure_write() has been coded to bleat on every failure. This behavior overrides the intelligence that was put into pqcomm.c's internal_flush() a long time ago to not report consecutive write failures ... which worked fine at the time it was written, because it was just calling send() not secure_write(). secure_write is obviously inconsistent anyway, since it doesn't elog anything in the non-SSL path. Proposed fix: 1. For cases where errno conveys all the useful info (ie, SSL_ERROR_SYSCALL), secure_write should elog nothing and just let its caller do it, same as the plain send() path. 2. For SSL protocol errors (SSL_ERROR_SSL), we do want to print the error at least once. It is not clear whether repeated calls would be likely to produce the same failure, and we don't have any cheap way to tell whether the messages are duplicate. I'm inclined to leave that path alone until/unless we get reports of many duplicate messages from it. regards, tom lane
One more quick addendum...I tried this with non-SSL connections, and this problem did *not* arise when using non-SSL connections. Peter Koczan wrote: > Yes, #2829 seems quite similar to my plight. I did take a look through > the code tree and there appear to be checks for an EINTR status within > loops in src/backend/libpq/pqcomm.c (line 725 in function pq_recvbuf > and line 1057 in function internal_flush), that could point to the > problem. I don't know enough about OpenSSL and it took me a long time > to find out as much as I did. > > FYI, I compiled against OpenSSL 0.9.8d, if that makes any difference. > > Peter > > Magnus Hagander wrote: >> This looks a lot like bug #2829 (excep that one is Windows), as I >> mentioned here: >> http://archives.postgresql.org/pgsql-hackers/2007-05/msg00461.php >> >> Haven't looked into the actual code, though, but Tom had a suggestion in >> the original bug, but AFAIK nobody has done that yet (at least not me.:) >> >> //Magnus >> >> Bruce Momjian wrote: >> >>> I didn't see any comment on this. Seems like a problem. >>> >>> --------------------------------------------------------------------------- >>> >>> >>> Peter Koczan wrote: >>> >>>> The following bug has been logged online: >>>> >>>> Bug reference: 3266 >>>> Logged by: Peter Koczan >>>> Email address: pjkoczan@gmail.com >>>> PostgreSQL version: 8.2.4 >>>> Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4 >>>> Description: SSL broken pipes kill the machine and fill the >>>> disk >>>> Details: >>>> If a connection using SSL is terminated on the client side before a >>>> query >>>> completes, postgres keeps trying to write to the broken connection, >>>> shooting >>>> CPU and load very high and filling the postgres syslog (I have that >>>> pointed >>>> to /var/log/pglog) with ~2000 of the following messages per second. >>>> >>>> May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL >>>> SYSCALL error: >>>> Broken pipe >>>> >>>> This quickly fills up the /var partition on the server. >>>> >>>> To replicate the problem: >>>> 1. Connect to an running server using an SSL connection. Using psql is >>>> fine. >>>> 2. Begin a query on any table. For full effect the query should be >>>> expensive >>>> and large. >>>> 3. Kill psql *on the client side* BEFORE the query finishes (don't do >>>> anything to the server side connection). >>>> 4. 'tail -f' wherever the postgres server output and error is going >>>> to. >>>> 5. Wait a few seconds while the server gets all of its data. >>>> 6. See thousands of error messages fill up your terminal on the >>>> server. >>>> >>>> This has also happened when people stop web browsers in the middle of >>>> serving up a postgresql-driven web page, but this is harder to >>>> replicate. >>>> >>>> This usually terminates, but after 3 hours for a query that usually >>>> takes 20 >>>> seconds. During this time, the server is slow to the point of >>>> unusable. >>>> >>>> ---------------------------(end of >>>> broadcast)--------------------------- >>>> TIP 6: explain analyze is your friend >>>> >> >> >> > >
Yes, #2829 seems quite similar to my plight. I did take a look through the code tree and there appear to be checks for an EINTR status within loops in src/backend/libpq/pqcomm.c (line 725 in function pq_recvbuf and line 1057 in function internal_flush), that could point to the problem. I don't know enough about OpenSSL and it took me a long time to find out as much as I did. FYI, I compiled against OpenSSL 0.9.8d, if that makes any difference. Peter Magnus Hagander wrote: > This looks a lot like bug #2829 (excep that one is Windows), as I > mentioned here: > http://archives.postgresql.org/pgsql-hackers/2007-05/msg00461.php > > Haven't looked into the actual code, though, but Tom had a suggestion in > the original bug, but AFAIK nobody has done that yet (at least not me.:) > > //Magnus > > Bruce Momjian wrote: > >> I didn't see any comment on this. Seems like a problem. >> >> --------------------------------------------------------------------------- >> >> Peter Koczan wrote: >> >>> The following bug has been logged online: >>> >>> Bug reference: 3266 >>> Logged by: Peter Koczan >>> Email address: pjkoczan@gmail.com >>> PostgreSQL version: 8.2.4 >>> Operating system: CentOS Linux 4.4 (RHEL 4) running on Pentium 4 >>> Description: SSL broken pipes kill the machine and fill the disk >>> Details: >>> >>> If a connection using SSL is terminated on the client side before a query >>> completes, postgres keeps trying to write to the broken connection, shooting >>> CPU and load very high and filling the postgres syslog (I have that pointed >>> to /var/log/pglog) with ~2000 of the following messages per second. >>> >>> May 10 14:45:01 mitchell postgres[10340]: [15729-1] LOG: SSL SYSCALL error: >>> Broken pipe >>> >>> This quickly fills up the /var partition on the server. >>> >>> To replicate the problem: >>> 1. Connect to an running server using an SSL connection. Using psql is >>> fine. >>> 2. Begin a query on any table. For full effect the query should be expensive >>> and large. >>> 3. Kill psql *on the client side* BEFORE the query finishes (don't do >>> anything to the server side connection). >>> 4. 'tail -f' wherever the postgres server output and error is going to. >>> 5. Wait a few seconds while the server gets all of its data. >>> 6. See thousands of error messages fill up your terminal on the server. >>> >>> This has also happened when people stop web browsers in the middle of >>> serving up a postgresql-driven web page, but this is harder to replicate. >>> >>> This usually terminates, but after 3 hours for a query that usually takes 20 >>> seconds. During this time, the server is slow to the point of unusable. >>> >>> ---------------------------(end of broadcast)--------------------------- >>> TIP 6: explain analyze is your friend >>> > > >