Thread: Win32 hard crash problem
Hello, Dave Cramer and I have dealt with a company today running 8.1.4 on Windows 2003. The application is a web app that runs via JDBC/Hibernate. The application will function perfectly for about 2/3 weeks and then we will receive a: "server sent data (\"D\" message) without prior row description (\"T\" message)"); (not escaped of course). Subsequent connections to the database will fail (such as pgAdmin) and Windows must be completely rebooted. I did ask if they were able to kill the process via the task manager. Instead they opt to use the service options and when that fails (which is always) they reboot the machine entirely. PostgreSQL will also not recover on its own (e.g; auto restart and roll through the logs). The good news is at that on reboot the problem goes away for 2/3 weeks. I have verified that they are doing all requisite routine maintenance. I currently have the customer running hardware checks to verify validity of the hardware but... Any thoughts? Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes: > Dave Cramer and I have dealt with a company today running 8.1.4 on > Windows 2003. The application is a web app that runs via JDBC/Hibernate. > The application will function perfectly for about 2/3 weeks and then we > will receive a: > "server sent data (\"D\" message) without prior row description (\"T\" > message)"); That sounds suspiciously close to the time from boot to wraparound of GetTickCount: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp M$ list this as 49 days but that's the time to wrap clear around to zero; the value overflows and goes negative in 24.85 days if I've done the math correctly. My bet is something depending on GetTickCount to measure elapsed time (and no, it's not used in the core Postgres code, but you've got plenty of other possible culprits in that stack). BTW, are you sure this is coming from JDBC? I see the exact same message text in libpq:libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n")); Maybe the JDBC driver uses the identical message wording but my thought is to look for something going through libpq. > Any thoughts? I suppose "get a real operating system" won't go over well? regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Dave Cramer and I have dealt with a company today running 8.1.4 on >> Windows 2003. The application is a web app that runs via JDBC/Hibernate. >> The application will function perfectly for about 2/3 weeks and then we >> will receive a: >> "server sent data (\"D\" message) without prior row description (\"T\" >> message)"); > > That sounds suspiciously close to the time from boot to wraparound of > GetTickCount: > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp > M$ list this as 49 days but that's the time to wrap clear around to > zero; the value overflows and goes negative in 24.85 days if I've > done the math correctly. > > My bet is something depending on GetTickCount to measure elapsed time > (and no, it's not used in the core Postgres code, but you've got plenty > of other possible culprits in that stack). > > BTW, are you sure this is coming from JDBC? I see the exact same > message text in libpq: > libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n")); > Maybe the JDBC driver uses the identical message wording but my thought > is to look for something going through libpq. The error is server side. I was just describing the environment. > >> Any thoughts? > > I suppose "get a real operating system" won't go over well? Tried that, I got nervous laughter on the other end ;) Joshua D. Drake > > regards, tom lane > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes: > Tom Lane wrote: >> BTW, are you sure this is coming from JDBC? I see the exact same >> message text in libpq: >> libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n")); >> Maybe the JDBC driver uses the identical message wording but my thought >> is to look for something going through libpq. > The error is server side. I was just describing the environment. I can entirely assure you that that error message is not present in the server code. regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Tom Lane wrote: >>> BTW, are you sure this is coming from JDBC? I see the exact same >>> message text in libpq: >>> libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n")); >>> Maybe the JDBC driver uses the identical message wording but my thought >>> is to look for something going through libpq. > >> The error is server side. I was just describing the environment. > > I can entirely assure you that that error message is not present in the > server code. Ok let me be more clear. The message is being throw via PostgreSQL. I am getting per the message I posted.. http://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/interfaces/libpq/fe-protocol2.c?rev=22194 http://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/interfaces/libpq/fe-protocol3.c?rev=25989 It is in libpq and the protocol not the backend that is giving me the message. When I said server, I as referring to postgresql inclusively, not the driver that was actually connecting. Sincerely, Joshua D. Drake > > regards, tom lane > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
On 31-Aug-06, at 6:01 PM, Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Tom Lane wrote: >>> BTW, are you sure this is coming from JDBC? I see the exact same >>> message text in libpq: >>> libpq_gettext("server sent data (\"D\" message) without prior row >>> description (\"T\" message)\n")); >>> Maybe the JDBC driver uses the identical message wording but my >>> thought >>> is to look for something going through libpq. > >> The error is server side. I was just describing the environment. > > I can entirely assure you that that error message is not present in > the > server code. Well that's even more interesting because it doesn't exist in the jdbc driver either. Dave > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that > your > message can get through to the mailing list cleanly >
Dave Cramer wrote: > > On 31-Aug-06, at 6:01 PM, Tom Lane wrote: > > >"Joshua D. Drake" <jd@commandprompt.com> writes: > >>Tom Lane wrote: > >>>BTW, are you sure this is coming from JDBC? I see the exact same > >>>message text in libpq: > >>>libpq_gettext("server sent data (\"D\" message) without prior row > >>>description (\"T\" message)\n")); > >>>Maybe the JDBC driver uses the identical message wording but my > >>>thought is to look for something going through libpq. > > > >>The error is server side. I was just describing the environment. > > > >I can entirely assure you that that error message is not present in > >the server code. > Well that's even more interesting because it doesn't exist in the > jdbc driver either. Conclusion: they are using libpq in some form, so you should investigate that. Is there a way to alter the tick counter, so that a test run does not need to take the full 3 weeks? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
> > That sounds suspiciously close to the time from boot to wraparound of > GetTickCount: > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp > M$ list this as 49 days but that's the time to wrap clear around to > zero; the value overflows and goes negative in 24.85 days if I've > done the math correctly. > > My bet is something depending on GetTickCount to measure elapsed time > (and no, it's not used in the core Postgres code, but you've got plenty > of other possible culprits in that stack). This doesn't quite make sense. The only reason we have to reboot is because PostgreSQL no longer responds. The system itself is fine. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes: >> My bet is something depending on GetTickCount to measure elapsed time >> (and no, it's not used in the core Postgres code, but you've got plenty >> of other possible culprits in that stack). > This doesn't quite make sense. The only reason we have to reboot is > because PostgreSQL no longer responds. The system itself is fine. The Windows kernel may still work, but that doesn't mean that everything Postgres depends on still works. I'm wondering about (a) the TCP stack (and that includes 3rd party firewalls and such, not only the core Windows code); (b) timing or threading stuff inside the application that's using libpq, which the only thing we know about so far is that it's *not* JDBC/Hibernate. regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >>> My bet is something depending on GetTickCount to measure elapsed time >>> (and no, it's not used in the core Postgres code, but you've got plenty >>> of other possible culprits in that stack). > >> This doesn't quite make sense. The only reason we have to reboot is >> because PostgreSQL no longer responds. The system itself is fine. > > The Windows kernel may still work, but that doesn't mean that everything > Postgres depends on still works. I'm wondering about (a) the TCP stack > (and that includes 3rd party firewalls and such, not only the core > Windows code); (b) timing or threading stuff inside the application > that's using libpq, which the only thing we know about so far is that > it's *not* JDBC/Hibernate. /me grumbles in a not so polite way about Windows. Which means we need to start stripping it down. Gah, I actually argued *for* this port to. Next time slap me. Joshua D. Drake > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
Alvaro Herrera wrote: > Dave Cramer wrote: >> On 31-Aug-06, at 6:01 PM, Tom Lane wrote: >> >>> "Joshua D. Drake" <jd@commandprompt.com> writes: >>>> Tom Lane wrote: >>>>> BTW, are you sure this is coming from JDBC? I see the exact same >>>>> message text in libpq: >>>>> libpq_gettext("server sent data (\"D\" message) without prior row >>>>> description (\"T\" message)\n")); >>>>> Maybe the JDBC driver uses the identical message wording but my >>>>> thought is to look for something going through libpq. >>>> The error is server side. I was just describing the environment. >>> I can entirely assure you that that error message is not present in >>> the server code. >> Well that's even more interesting because it doesn't exist in the >> jdbc driver either. > > Conclusion: they are using libpq in some form, so you should investigate > that. > > Is there a way to alter the tick counter, so that a test run does not > need to take the full 3 weeks? > Sure it is a registry entry... so we could (in theory) shrink that quite a bit.. However I am confused, if we don't use it, what that is connecting to libpq would trigger it? I know they are using pgAAdmin... Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes: > Which means we need to start stripping it down. Gah, I actually argued > *for* this port to. Next time slap me. Well, before you invest a lot of time barking up what might be the wrong tree, there is a very easy test you can use to check the GetTickCount theory: keep closer track of time-since-boot on the affected systems. If that idea is right, it won't be "two or three weeks" between boot and problems appearing, it'll be 24.85 days on the nose. It shouldn't take much except waiting to either falsify the theory or make it look pretty convincing. regards, tom lane
On 31/8/06 23:34, "Joshua D. Drake" <jd@commandprompt.com> wrote: > Sure it is a registry entry... so we could (in theory) shrink that quite > a bit.. However I am confused, if we don't use it, what that is > connecting to libpq would trigger it? > > I know they are using pgAAdmin... Are they using pgAgent? That's the only part of pgAdmin that doesn't any sort of timing I can think of offhand (other than the query tool timer which only runs whilst a query is running). Even then it's done indirectly through wxWidgets so I'm not familiar with how it's implemented at the win32 API level. If it were pgAdmin (or any other client) though, how would that lock up the entire PostgreSQL instance, but not the rest of the server? Regards, Dave.
> >> My bet is something depending on GetTickCount to measure elapsed > time > >> (and no, it's not used in the core Postgres code, but you've got > >> plenty of other possible culprits in that stack). > > > This doesn't quite make sense. The only reason we have to reboot > is > > because PostgreSQL no longer responds. The system itself is fine. > > The Windows kernel may still work, but that doesn't mean that > everything Postgres depends on still works. I'm wondering about > (a) the TCP stack (and that includes 3rd party firewalls and such, > not only the core Windows code); (b) timing or threading stuff > inside the application that's using libpq, which the only thing we > know about so far is that it's *not* JDBC/Hibernate. How about getting a simple backtrace from a couple of the stuck postgres processes? And from the postmaster which should be accepting new connections... Or does that also hang completely? How to get one? Well, since we don't have the MSVC build yet (yeah, yeah, eventually), you can only get a semi-backtrace that only looks at exported symbols. You can get this using process explorer (thread tab, click stack), using WinDBG or using Visual Studio (you'll need VS 2005, and you need to check the option for "Load DLL exports" in options->debugging->native). Oh, btw, if there is a 3rd firewall on the box the standard recommendation of uninstalling it definitely sounds like a good plan :-) //Magnus
Oops, going backwards through the mails it seems :) > Subsequent connections to the database will fail (such as pgAdmin) > and Windows must be completely rebooted. Fail in what way. Hang, not connect, or get an error msg? > PostgreSQL will also not recover on its own (e.g; auto restart and > roll through the logs). What do you mean by this? It doesn't start upon reboot? What is needed to make it start? //Magnus
> >> My bet is something depending on GetTickCount to measure elapsed time > >> (and no, it's not used in the core Postgres code, but you've got > >> plenty of other possible culprits in that stack). > > > This doesn't quite make sense. The only reason we have to reboot is > > because PostgreSQL no longer responds. The system itself is fine. > > The Windows kernel may still work, but that doesn't mean that > everything Postgres depends on still works. It may be a not reacting listen socket. This may be because of a handle leak. Next time it blocks look at the handle counts (e.g. with handle.exe from sysinternals). You could also look for handle count now with Task Manager and see if it increases constantly. (handle.exe shows you the details) Andreas
Magnus Hagander wrote: > Oops, going backwards through the mails it seems :) > >> Subsequent connections to the database will fail (such as pgAdmin) >> and Windows must be completely rebooted. > > Fail in what way. Hang, not connect, or get an error msg? > >> PostgreSQL will also not recover on its own (e.g; auto restart and >> roll through the logs). > > What do you mean by this? It doesn't start upon reboot? What is needed > to make it start? It means that postgresql doesn't recover on its own. On linux if a backend crashes all of PostgreSQL will restart and come back up if it can. On Win32 it doesn't. Joshua D. Drake > > > //Magnus > > > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
On 9/5/06, Joshua D. Drake <jd@commandprompt.com> wrote: > Magnus Hagander wrote: > > What do you mean by this? It doesn't start upon reboot? What is needed > > to make it start? > > It means that postgresql doesn't recover on its own. On linux if a > backend crashes all of PostgreSQL will restart and come back up if it can. > > On Win32 it doesn't. it does for me, at least for me when I used to work with windows :). I think it just doesn't restart for this particular type of crash. I had a couple of similarly wierd undetectable windows problems that I could never quite figured out until I got hired by another company and left that monster behind for good. merlin
Magnus Hagander wrote: >>>> PostgreSQL will also not recover on its own (e.g; auto restart and >>>> roll through the logs). >>> What do you mean by this? It doesn't start upon reboot? >> What is needed >>> to make it start? >> It means that postgresql doesn't recover on its own. On linux >> if a backend crashes all of PostgreSQL will restart and come >> back up if it can. >> >> On Win32 it doesn't. > > Ah, I thought you meant that the database recovery process (that runs > after a crash) failed and lost data. But it's not data-loss then, it > just took a reboot to fix it? Right, but "just took a reboot to fix it" isn't very confidence inspiring ;) > > I think we're somehow seeing a complete postmaster hang, where it's > either not able to kill off th ebackends as required, or just not > capable of accepting new connections after that. Which makes a > stacktrace from the postmaster the most interesting one to look at. I have asked the customer to also look and see if there was one particular process that was eating cpu via the task master and see if that process can be killed. If that process can be killed and postgresql comes back clean, then that is a step. However, debugging this beast is a pain. I take it mingw doesn't have a gdb we can use? > > //Magnus > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
> >> PostgreSQL will also not recover on its own (e.g; auto restart and > >> roll through the logs). > > > > What do you mean by this? It doesn't start upon reboot? > What is needed > > to make it start? > > It means that postgresql doesn't recover on its own. On linux > if a backend crashes all of PostgreSQL will restart and come > back up if it can. > > On Win32 it doesn't. Ah, I thought you meant that the database recovery process (that runs after a crash) failed and lost data. But it's not data-loss then, it just took a reboot to fix it? I think we're somehow seeing a complete postmaster hang, where it's either not able to kill off th ebackends as required, or just not capable of accepting new connections after that. Which makes a stacktrace from the postmaster the most interesting one to look at. //Magnus
"Merlin Moncure" <mmoncure@gmail.com> writes: > On 9/5/06, Joshua D. Drake <jd@commandprompt.com> wrote: >> Magnus Hagander wrote: >>> What do you mean by this? It doesn't start upon reboot? What is needed >>> to make it start? >> >> It means that postgresql doesn't recover on its own. On linux if a >> backend crashes all of PostgreSQL will restart and come back up if it can. >> >> On Win32 it doesn't. > it does for me, at least for me when I used to work with windows :). > I think it just doesn't restart for this particular type of crash. As best I can tell, Josh isn't describing a crash at all. Something (possibly in the TCP stack) has locked up, but there's no way for the postmaster to know there's anything wrong, and probably no way for the postmaster to fix it if it did know. Restarting backends certainly isn't going to fix a communication problem. Josh failed to answer the most important question though: >> Subsequent connections to the database will fail (such as pgAdmin) >> and Windows must be completely rebooted. > > Fail in what way. Hang, not connect, or get an error msg? regards, tom lane
> Josh failed to answer the most important question though: Sorry. > >>> Subsequent connections to the database will fail (such as pgAdmin) >>> and Windows must be completely rebooted. >> Fail in what way. Hang, not connect, or get an error msg? Just verified with customer. Once the problem occurs the first time, the customer will continually get the same error message for each subsequent connection attempt: server sent data ("D" message) without prior row description ("T" message) > > regards, tom lane > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes: >>> Fail in what way. Hang, not connect, or get an error msg? > Just verified with customer. Once the problem occurs the first time, the > customer will continually get the same error message for each subsequent > connection attempt: > server sent data ("D" message) without prior row description ("T" message) During the connection attempt? I don't think libpq can report that message until it tries to do a regular query (might be wrong though). Is the client using some application that's going to issue a query immediately on connecting? It would be useful to turn on log_connections and log_statement (and perhaps crank log_min_messages all the way up to DEBUG5) to see if we can get anything in the postmaster log giving a hint what actually happens here. A TCP sniff of the connection attempt traffic would be pretty useful too. regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > >>> Fail in what way. Hang, not connect, or get an error msg? > > > Just verified with customer. Once the problem occurs the first time, the > > customer will continually get the same error message for each subsequent > > connection attempt: > > > server sent data ("D" message) without prior row description ("T" message) > > During the connection attempt? I don't think libpq can report that > message until it tries to do a regular query (might be wrong though). > Is the client using some application that's going to issue a query > immediately on connecting? What I've been wondering all along is whether they are using a connection pool. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >>>> Fail in what way. Hang, not connect, or get an error msg? > >> Just verified with customer. Once the problem occurs the first time, the >> customer will continually get the same error message for each subsequent >> connection attempt: > >> server sent data ("D" message) without prior row description ("T" message) > > During the connection attempt? I don't think libpq can report that > message until it tries to do a regular query (might be wrong though). > Is the client using some application that's going to issue a query > immediately on connecting? Well, windows ;) Customer says that they double click pgadmin and they get that message. I have informed them on how to increase to debug5 and hopefully we get something from that, of course it will likely be 24.85 days from now ;) > > It would be useful to turn on log_connections and log_statement (and > perhaps crank log_min_messages all the way up to DEBUG5) to see if we > can get anything in the postmaster log giving a hint what actually > happens here. A TCP sniff of the connection attempt traffic would be > pretty useful too. > Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
Alvaro Herrera wrote: > Tom Lane wrote: >> "Joshua D. Drake" <jd@commandprompt.com> writes: >>>>> Fail in what way. Hang, not connect, or get an error msg? >>> Just verified with customer. Once the problem occurs the first time, the >>> customer will continually get the same error message for each subsequent >>> connection attempt: >>> server sent data ("D" message) without prior row description ("T" message) >> During the connection attempt? I don't think libpq can report that >> message until it tries to do a regular query (might be wrong though). >> Is the client using some application that's going to issue a query >> immediately on connecting? > > What I've been wondering all along is whether they are using a > connection pool. > Yes they are using a connection pool. A java based one. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
On Tue, 5 Sep 2006, Joshua D. Drake wrote: > Right, but "just took a reboot to fix it" isn't very confidence inspiring ;) Are you kidding? This is standard procedure for troubleshooting Windows problems :) -- The world is coming to an end. Please log off.
Alvaro Herrera wrote: > Joshua D. Drake wrote: >> Alvaro Herrera wrote: >>> Tom Lane wrote: >>>> "Joshua D. Drake" <jd@commandprompt.com> writes: >>>>>>> Fail in what way. Hang, not connect, or get an error msg? >>>>> Just verified with customer. Once the problem occurs the first time, the >>>>> customer will continually get the same error message for each subsequent >>>>> connection attempt: >>>>> server sent data ("D" message) without prior row description ("T" >>>>> message) >>>> During the connection attempt? I don't think libpq can report that >>>> message until it tries to do a regular query (might be wrong though). >>>> Is the client using some application that's going to issue a query >>>> immediately on connecting? >>> What I've been wondering all along is whether they are using a >>> connection pool. >> Yes they are using a connection pool. A java based one. > > It's quite possible that it's the connection pool that gets confused, > and not PostgreSQL itself. It would be interesting if they change the > connection setting when the "hang" next occurs, to point directly to > PostgreSQL bypassing the connection pool. Well except when they are connecting with Pgadmin (which wouldn't go through the connection pool) they get the error as well. Joshua D. Drake > > OTOH the connection pool may be the thing with the TickCounter problem. > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote: > Alvaro Herrera wrote: >> Tom Lane wrote: >>> "Joshua D. Drake" <jd@commandprompt.com> writes: >>>>>> Fail in what way. Hang, not connect, or get an error msg? >>>> Just verified with customer. Once the problem occurs the first >>>> time, the customer will continually get the same error message >>>> for each subsequent connection attempt: >>>> server sent data ("D" message) without prior row description >>>> ("T" message) >>> During the connection attempt? I don't think libpq can report that >>> message until it tries to do a regular query (might be wrong >>> though). >>> Is the client using some application that's going to issue a query >>> immediately on connecting? >> What I've been wondering all along is whether they are using a >> connection pool. > > Yes they are using a connection pool. A java based one. Since java has it's own protocol implementation, this is totally unrelated to any libpq error messages. While I've not personally used the pool in question (c3p0) my understanding is that it is pretty robust. Personally, I'm betting on some windows TCP/IP weirdness here. Dave > > Sincerely, > > Joshua D. Drake > > > -- > > === The PostgreSQL Company: Command Prompt, Inc. === > Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 > Providing the most comprehensive PostgreSQL solutions since 1997 > http://www.commandprompt.com/ > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that > your > message can get through to the mailing list cleanly >
Joshua D. Drake wrote: > Alvaro Herrera wrote: > >Joshua D. Drake wrote: > >>Alvaro Herrera wrote: > >>>What I've been wondering all along is whether they are using a > >>>connection pool. > >>Yes they are using a connection pool. A java based one. > > > >It's quite possible that it's the connection pool that gets confused, > >and not PostgreSQL itself. It would be interesting if they change the > >connection setting when the "hang" next occurs, to point directly to > >PostgreSQL bypassing the connection pool. > > Well except when they are connecting with Pgadmin (which wouldn't go > through the connection pool) they get the error as well. Are you assuming, or did they/you verify that this is indeed the case? I see no reason to assume that pgAdmin can't connect via a pool. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Dave Cramer <pg@fastcrypt.com> writes: > On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote: >> Yes they are using a connection pool. A java based one. > Since java has it's own protocol implementation, this is totally > unrelated to any libpq error messages. Another important point that we've not been given information on: when pgAdmin/libpq starts failing like this, exactly what is happening with the connection pool? Is it still able to issue queries, and if not what happens exactly? regards, tom lane
Joshua D. Drake wrote: > Alvaro Herrera wrote: > >Tom Lane wrote: > >>"Joshua D. Drake" <jd@commandprompt.com> writes: > >>>>>Fail in what way. Hang, not connect, or get an error msg? > >>>Just verified with customer. Once the problem occurs the first time, the > >>>customer will continually get the same error message for each subsequent > >>>connection attempt: > >>>server sent data ("D" message) without prior row description ("T" > >>>message) > >>During the connection attempt? I don't think libpq can report that > >>message until it tries to do a regular query (might be wrong though). > >>Is the client using some application that's going to issue a query > >>immediately on connecting? > > > >What I've been wondering all along is whether they are using a > >connection pool. > > Yes they are using a connection pool. A java based one. It's quite possible that it's the connection pool that gets confused, and not PostgreSQL itself. It would be interesting if they change the connection setting when the "hang" next occurs, to point directly to PostgreSQL bypassing the connection pool. OTOH the connection pool may be the thing with the TickCounter problem. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Joshua D. Drake wrote: >> Alvaro Herrera wrote: >>> Joshua D. Drake wrote: >>>> Alvaro Herrera wrote: > >>>>> What I've been wondering all along is whether they are using a >>>>> connection pool. >>>> Yes they are using a connection pool. A java based one. >>> It's quite possible that it's the connection pool that gets confused, >>> and not PostgreSQL itself. It would be interesting if they change the >>> connection setting when the "hang" next occurs, to point directly to >>> PostgreSQL bypassing the connection pool. >> Well except when they are connecting with Pgadmin (which wouldn't go >> through the connection pool) they get the error as well. > > Are you assuming, or did they/you verify that this is indeed the case? > I see no reason to assume that pgAdmin can't connect via a pool. > Verified. They do not connect to the connection pool for pgadmin. Although I would think pgadmin might have problems connecting to a java based pool. If I recall, (I could be cranked) JDBC apps can't use pgpool for example. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
Tom Lane wrote: > Dave Cramer <pg@fastcrypt.com> writes: >> On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote: >>> Yes they are using a connection pool. A java based one. > >> Since java has it's own protocol implementation, this is totally >> unrelated to any libpq error messages. > > Another important point that we've not been given information on: > when pgAdmin/libpq starts failing like this, exactly what is happening > with the connection pool? Is it still able to issue queries, and > if not what happens exactly? No, when this happens everything stops. The only thing they get back is that message until they reboot the server. The web app (via java/connection pool), pgAdmin both give the same error. Which now that I think about it, seems odd if the message is coming from libpq yes? Sincerely, Joshua D. Drake > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
-----Original Message----- From: "Joshua D. Drake" <jd@commandprompt.com> To: "Joshua D. Drake" <jd@commandprompt.com>; "Tom Lane" <tgl@sss.pgh.pa.us>; "Merlin Moncure" <mmoncure@gmail.com>; "MagnusHagander" <mha@sollentuna.net>; "PostgreSQL-development" <pgsql-hackers@postgresql.org> Sent: 05/09/06 23:27 Subject: Re: [HACKERS] Win32 hard crash problem > Well except when they are connecting with Pgadmin (which wouldn't go > through the connection pool) they get the error as well. It wouldn't? It's just a 'regular' libpq app. Doesn't say much for the connection pool if it cannot handle a simple libpqconnection. /D
Joshua D. Drake wrote: > Tom Lane wrote: > >Dave Cramer <pg@fastcrypt.com> writes: > >>On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote: > >>>Yes they are using a connection pool. A java based one. > > > >>Since java has it's own protocol implementation, this is totally > >>unrelated to any libpq error messages. > > > >Another important point that we've not been given information on: > >when pgAdmin/libpq starts failing like this, exactly what is happening > >with the connection pool? Is it still able to issue queries, and > >if not what happens exactly? > > No, when this happens everything stops. The only thing they get back is > that message until they reboot the server. The web app (via > java/connection pool), pgAdmin both give the same error. Actually Dave Cramer told me that if the postmaster was stopped and then restarted, it would start answering fine again. Which would make a lot of sense. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Joshua D. Drake wrote: >> Tom Lane wrote: >>> Dave Cramer <pg@fastcrypt.com> writes: >>>> On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote: >>>>> Yes they are using a connection pool. A java based one. >>>> Since java has it's own protocol implementation, this is totally >>>> unrelated to any libpq error messages. >>> Another important point that we've not been given information on: >>> when pgAdmin/libpq starts failing like this, exactly what is happening >>> with the connection pool? Is it still able to issue queries, and >>> if not what happens exactly? >> No, when this happens everything stops. The only thing they get back is >> that message until they reboot the server. The web app (via >> java/connection pool), pgAdmin both give the same error. > > Actually Dave Cramer told me that if the postmaster was stopped and then > restarted, it would start answering fine again. Which would make a lot > of sense. I already said that ;). The problem IS NOT that we can't restart the system and get postgresql back. It is that it happens at all. Joshua D. Drake > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
Hello, O.k. to recap: OS: Win2k3 SP1 PostgreSQL: 8.1.2 Application Server: Jboss Connection Pooler: C3PO JDBC Version: 8.1.404, Also verified with 8.0.311 Problem: After 2/3 weeks, PostgreSQL will begin issuing the following message: server sent data ("D" message) without prior row description ("T" message) This message will present itself, if connection attempts are made from the Web Application (Java/JDBC), or locally via PgAdmin. Once the error message is received, all subsequent connection attempts will also result in that same message. We do not know if the error occurs before or after authentication. The only known resolution is to reboot Windows. Using the service control panel to shutdown postgresql will fail once the message is received. It is unknown if using the task master to individually kill processes will work. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
Joshua D. Drake wrote: > Alvaro Herrera wrote: > >Joshua D. Drake wrote: > >>Tom Lane wrote: > >>>Dave Cramer <pg@fastcrypt.com> writes: > >>>>On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote: > >>>>>Yes they are using a connection pool. A java based one. > >>>>Since java has it's own protocol implementation, this is totally > >>>>unrelated to any libpq error messages. > >>>Another important point that we've not been given information on: > >>>when pgAdmin/libpq starts failing like this, exactly what is happening > >>>with the connection pool? Is it still able to issue queries, and > >>>if not what happens exactly? > >>No, when this happens everything stops. The only thing they get back is > >>that message until they reboot the server. The web app (via > >>java/connection pool), pgAdmin both give the same error. > > > >Actually Dave Cramer told me that if the postmaster was stopped and then > >restarted, it would start answering fine again. Which would make a lot > >of sense. > > I already said that ;). The problem IS NOT that we can't restart the > system and get postgresql back. It is that it happens at all. It is quite different a bug that can only be fixed by "rebooting the server" (which to me means taking the operating system down and starting it afresh) than one that can be fixed by restarting the PostgreSQL server (_without_ taking the operating system down). I've been reading "reboot" all along -- sorry if I missed an email saying otherwise. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Joshua D. Drake wrote: > The only known resolution is to reboot Windows. Using the service ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > control panel to shutdown postgresql will fail once the message is > received. It is unknown if using the task master to individually kill > processes will work. This is what I'm saying that doesn't match what Dave told me. The stuff about failing to shut the postmaster down, is the first time I hear. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Joshua D. Drake wrote: >> I already said that ;). The problem IS NOT that we can't restart the >> system and get postgresql back. It is that it happens at all. > It is quite different a bug that can only be fixed by "rebooting the > server" (which to me means taking the operating system down and starting > it afresh) than one that can be fixed by restarting the PostgreSQL > server (_without_ taking the operating system down). I've been reading > "reboot" all along -- sorry if I missed an email saying otherwise. It sounds to me like we don't actually know that, because the client doesn't know how to restart the postmaster without rebooting the OS. (Josh says "pg_ctl stop" doesn't work in this state, which is a tad interesting in itself, because that doesn't go through a connection request.) It would be useful to try killing off the postgres processes via task manager and then see if a new postmaster can be started and if things then behave normally, or if a reboot is truly needed. The bottom line here is that all we have so far are client-side observations ("I get this message") and we have no clue what state the postmaster thinks it's in. We really need more information. regards, tom lane
> It sounds to me like we don't actually know that, because the client > doesn't know how to restart the postmaster without rebooting the OS. > (Josh says "pg_ctl stop" doesn't work in this state, which is a tad > interesting in itself, because that doesn't go through a connection > request.) It would be useful to try killing off the postgres processes > via task manager and then see if a new postmaster can be started and if > things then behave normally, or if a reboot is truly needed. Right, and I have asked that the next time this happens that they try and use the task manager to kill the process. > > The bottom line here is that all we have so far are client-side > observations ("I get this message") and we have no clue what state > the postmaster thinks it's in. We really need more information. > Yes, unfortunately there isn't much more to be had for another 2 weeks ;) Sincerely, Joshua D. Drake > regards, tom lane > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
On 5-Sep-06, at 7:00 PM, Joshua D. Drake wrote: > Tom Lane wrote: >> Dave Cramer <pg@fastcrypt.com> writes: >>> On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote: >>>> Yes they are using a connection pool. A java based one. >>> Since java has it's own protocol implementation, this is totally >>> unrelated to any libpq error messages. >> Another important point that we've not been given information on: >> when pgAdmin/libpq starts failing like this, exactly what is >> happening >> with the connection pool? Is it still able to issue queries, and >> if not what happens exactly? > > No, when this happens everything stops. The only thing they get > back is that message until they reboot the server. The web app (via > java/connection pool), pgAdmin both give the same error. > > Which now that I think about it, seems odd if the message is coming > from libpq yes? Yes, this is very odd, AFICS, this message does not exist in the java driver. So.... it would be interesting to get the actual logs from the client. > > Sincerely, > > Joshua D. Drake > > >> regards, tom lane >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 2: Don't 'kill -9' the postmaster > > > -- > > === The PostgreSQL Company: Command Prompt, Inc. === > Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 > Providing the most comprehensive PostgreSQL solutions since 1997 > http://www.commandprompt.com/ > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 6: explain analyze is your friend >
"Joshua D. Drake" <jd@commandprompt.com> writes: > Yes, unfortunately there isn't much more to be had for another 2 weeks ;) I trust they've got the reboot time and they will know exactly how long from reboot to problem? I'm not all that sold on the "GetTickCount overflow" theory, but certainly we ought not be missing a chance to test or disprove it. regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Yes, unfortunately there isn't much more to be had for another 2 weeks ;) > > I trust they've got the reboot time and they will know exactly how long > from reboot to problem? I'm not all that sold on the "GetTickCount > overflow" theory, but certainly we ought not be missing a chance to test > or disprove it. Yes I documented all conversations and disclaimers :) Joshua D. Drake > > regards, tom lane > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
I'm a bit fear to to engage into this thread, but I've seen also reproducible case when libpq client stops working and 'vaccuum analyze' helped. It's happened on Windows Server 2003 and XP with PostgreSQL 8.1.4. I don't have client source code, so I can't say more, but customer's developer said the same behaviour was observed on Linux with 8.1.0 and has gone in 8.1.4. They said, that this happens only with enabled row statistics. Client inserts some data in transaction, backend writes 'COMMIT' to log, but client wait something and 'vacuum analyze' of all database in some magic way pushed the process. I've got their installation CD and will try to investigate this problem. Any suggestions ? I'm not familiar with W32 at all. Oleg On Tue, 5 Sep 2006, Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> Yes, unfortunately there isn't much more to be had for another 2 weeks ;) > > I trust they've got the reboot time and they will know exactly how long > from reboot to problem? I'm not all that sold on the "GetTickCount > overflow" theory, but certainly we ought not be missing a chance to test > or disprove it. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
> >>>> Yes they are using a connection pool. A java based one. > >>> Since java has it's own protocol implementation, this is > totally > >>> unrelated to any libpq error messages. > >> Another important point that we've not been given information > on: > >> when pgAdmin/libpq starts failing like this, exactly what is > >> happening with the connection pool? Is it still able to issue > >> queries, and if not what happens exactly? > > > > No, when this happens everything stops. The only thing they get > back > > is that message until they reboot the server. The web app (via > > java/connection pool), pgAdmin both give the same error. > > > > Which now that I think about it, seems odd if the message is > coming > > from libpq yes? > Yes, this is very odd, AFICS, this message does not exist in the > java driver. So.... it would be interesting to get the actual logs > from the client. Definitly - that error msg showing up in the web app really doesn't make sense. However, are we sure that the error message is *exactly* the same, word for word, or is it possible that it's just "the same in what it says" but with different words? I assume there are screendumps to verify this ;-) Another point that at least I don't know - what kind of connection pool is it? Is it an external one (like pgpool) to which the java app connects (using FE/BE protocol, emulating a "proper postmaster" but pooling access to the database), or is it running inside the app server (like for example .net connection pooling does, which simply means that when you run the Open() method on the connection object it will pick something off an *internal* pool)? //Magnus
Magnus Hagander wrote:> Another point that at least I don't know - what kind of connection pool> is it? Is it an externalone (like pgpool) to which the java app> connects (using FE/BE protocol, emulating a "proper postmaster" but> poolingaccess to the database), or is it running inside the app server> (like for example .net connection pooling does, whichsimply means that> when you run the Open() method on the connection object it will pick> something off an *internal*pool)? Googling for 3CPO [1] shows that it is a Java-based connection pool that implements connection pooling using the JDBC API, i.e. it is an *internal* pool running inside the app servers JVM. PG Admin cannot in any case connect through this pool. Best Regards Michael Paesold [1] http://sourceforge.net/projects/c3p0
> > server sent data ("D" message) without prior row description ("T" > > message) > > During the connection attempt? I don't think libpq can report that > message until it tries to do a regular query (might be wrong > though). > Is the client using some application that's going to issue a query > immediately on connecting? In the case of pgAdmin, it does. It will set datestyle, load a list of dbs etc. //Magnus
On 6-Sep-06, at 3:27 AM, Magnus Hagander wrote: >>>>>> Yes they are using a connection pool. A java based one. >>>>> Since java has it's own protocol implementation, this is >> totally >>>>> unrelated to any libpq error messages. >>>> Another important point that we've not been given information >> on: >>>> when pgAdmin/libpq starts failing like this, exactly what is >>>> happening with the connection pool? Is it still able to issue >>>> queries, and if not what happens exactly? >>> >>> No, when this happens everything stops. The only thing they get >> back >>> is that message until they reboot the server. The web app (via >>> java/connection pool), pgAdmin both give the same error. >>> >>> Which now that I think about it, seems odd if the message is >> coming >>> from libpq yes? >> Yes, this is very odd, AFICS, this message does not exist in the >> java driver. So.... it would be interesting to get the actual logs >> from the client. > > Definitly - that error msg showing up in the web app really doesn't > make > sense. However, are we sure that the error message is *exactly* the > same, word for word, or is it possible that it's just "the same in > what > it says" but with different words? I assume there are screendumps to > verify this ;-) I looked at the code in the jdbc driver and it doesn't even do this check > > > Another point that at least I don't know - what kind of connection > pool > is it? Is it an external one (like pgpool) to which the java app > connects (using FE/BE protocol, emulating a "proper postmaster" but > pooling access to the database), or is it running inside the app > server > (like for example .net connection pooling does, which simply means > that > when you run the Open() method on the connection object it will pick > something off an *internal* pool)? It's an internal pool, and the client has told me off list they have removed it and are using the jdbc driver pool. At this point I'm confused as to what they really are using, but as they have contracted Command Prompt to fix this for them, I am no longer in the private loop. Dave > > //Magnus > > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
"Joshua D. Drake" <jd@commandprompt.com> writes: > O.k. to recap: > > This message will present itself, if connection attempts are made from the Web > Application (Java/JDBC), or locally via PgAdmin. Once the error message is > received, all subsequent connection attempts will also result in that same > message. We do not know if the error occurs before or after authentication. I think other people have claimed that this message is in libpq and not in JDBC source code which is inconsistent with this description. > The only known resolution is to reboot Windows. Using the service control panel > to shutdown postgresql will fail once the message is received. It is unknown if > using the task master to individually kill processes will work. This contradicts your previous email about restarting the postmaster working. I think you have to sit down and write down *exactly* what sequence of actions cause what results. Describing them in shorthand like "if connection attempts are made" is leading to a lot of speculation instead of systematic deductions. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > >> O.k. to recap: >> >> This message will present itself, if connection attempts are made from the Web >> Application (Java/JDBC), or locally via PgAdmin. Once the error message is >> received, all subsequent connection attempts will also result in that same >> message. We do not know if the error occurs before or after authentication. > > I think other people have claimed that this message is in libpq and not in > JDBC source code which is inconsistent with this description. Yes I am fully aware of that. I am only relaying what the customer said. > >> The only known resolution is to reboot Windows. Using the service control panel >> to shutdown postgresql will fail once the message is received. It is unknown if >> using the task master to individually kill processes will work. > > This contradicts your previous email about restarting the postmaster working. No, it doesn't. I never said restarting the postmaster would work. I said rebooting windows, allows postgresql to come back up. Those are entirely different things. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
Joshua D. Drake wrote: > Gregory Stark wrote: > >"Joshua D. Drake" <jd@commandprompt.com> writes: > >>The only known resolution is to reboot Windows. Using the service > >>control panel to shutdown postgresql will fail once the message is > >>received. It is unknown if using the task master to individually > >>kill processes will work. > > > >This contradicts your previous email about restarting the postmaster > >working. > > No, it doesn't. I never said restarting the postmaster would work. I > said rebooting windows, allows postgresql to come back up. Those are > entirely different things. Yup. It was me who said that restarting the postmaster solved the problem. That's what Dave Cramer told me. But maybe Dave was not certain about that -- he did use the word "reboot" and I asked for confirmation about whether this was an actual reboot of the machine or just a postmaster "reboot", and he said it was the latter. But this may have been a suposition. Sorry for the confusion. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"Joshua D. Drake" <jd@commandprompt.com> writes: > Yes I am fully aware of that. I am only relaying what the customer said. Yeah sorry, I guess what I sent was pretty obvious to you. I should stop confusing -general with -hackers :) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Joshua D. Drake wrote: > Tom Lane wrote: >> "Joshua D. Drake" <jd@commandprompt.com> writes: >>> Yes, unfortunately there isn't much more to be had for another 2 >>> weeks ;) >> >> I trust they've got the reboot time and they will know exactly how long >> from reboot to problem? I'm not all that sold on the "GetTickCount >> overflow" theory, but certainly we ought not be missing a chance to test >> or disprove it. > > Yes I documented all conversations and disclaimers :) O.k. further on this.. the crashing is happening quickly now but not predictably. (as in sometimes a week sometimes 2 days). I just now got them to send some further logs... Interestingly: 2006-09-28 16:38:37.406 LOG: could not send data to client: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full. That log entry is the last (of consequence) entry before the machine says: 2006-09-28 16:40:36.921 LOG: received fast shutdown request 2006-09-28 16:40:36.921 LOG: aborting any active transactions 2006-09-28 16:40:36.921 FATAL: terminating connection due to administrator command On the ERROR side of things I have a bunch of standard, unique key violations etc... AND: postgresql-2006-09-27_000000.log:2006-09-27 23:49:57.671 FATAL: could not read from statistics collector pipe: No error I have requested a clean run with entire log at DEBUG2. Hopefully that will give us more info. Sincerely, Joshua D. Drake > > Joshua D. Drake > >> >> regards, tom lane >> > > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
"Joshua D. Drake" <jd@commandprompt.com> writes: > O.k. further on this.. the crashing is happening quickly now but not > predictably. (as in sometimes a week sometimes 2 days). OK, that seems to eliminate the GetTickCount-overflow theory anyway. > That log entry is the last (of consequence) entry before the machine says: > 2006-09-28 16:40:36.921 LOG: received fast shutdown request Oh? That's pretty interesting on a Windows machine, because AFAIK there wouldn't be any standard mechanism that might tie into our homegrown signal facility. Anyone have a theory on what might trigger a SIGINT to the postmaster, other than intentional pg_ctl invocation? regards, tom lane
Tom Lane wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: >> O.k. further on this.. the crashing is happening quickly now but not >> predictably. (as in sometimes a week sometimes 2 days). > > OK, that seems to eliminate the GetTickCount-overflow theory anyway. > >> That log entry is the last (of consequence) entry before the machine says: >> 2006-09-28 16:40:36.921 LOG: received fast shutdown request > > Oh? That's pretty interesting on a Windows machine, because AFAIK there > wouldn't be any standard mechanism that might tie into our homegrown > signal facility. Anyone have a theory on what might trigger a SIGINT > to the postmaster, other than intentional pg_ctl invocation? Well the other option would be a windows restart. On windows would that send a SIGINT to the backend? Joshua D. Drake > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
> > That log entry is the last (of consequence) entry before > the machine says: > > 2006-09-28 16:40:36.921 LOG: received fast shutdown request > > Oh? That's pretty interesting on a Windows machine, because > AFAIK there wouldn't be any standard mechanism that might tie > into our homegrown signal facility. Anyone have a theory on > what might trigger a SIGINT to the postmaster, other than > intentional pg_ctl invocation? pg_ctl will send SIGINT to the postmaster when the service is stopped, or when windows is shutting down. Do you get anything about the postgresql service in the eventlog within say a minute of this happening? (before or after) Could it be a backend or the postmaster trying to send a signal to a different backend, that for some reason sends it to the wrong process? //Magnus
Magnus Hagander wrote: >>> That log entry is the last (of consequence) entry before >> the machine says: >>> 2006-09-28 16:40:36.921 LOG: received fast shutdown request >> Oh? That's pretty interesting on a Windows machine, because >> AFAIK there wouldn't be any standard mechanism that might tie >> into our homegrown signal facility. Anyone have a theory on >> what might trigger a SIGINT to the postmaster, other than >> intentional pg_ctl invocation? > > pg_ctl will send SIGINT to the postmaster when the service is stopped, > or when windows is shutting down. O.k. that pretty much confirms my suspicion then. The SIGINT likely came from the user rebooting windows. > > Do you get anything about the postgresql service in the eventlog within > say a minute of this happening? (before or after) Too late to say now :( I will have to follow up with them. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutionssince 1997 http://www.commandprompt.com/
IIRC there is no real SIGINT on Windows, so it can only come from a postgres program. The windows shutdown could be calling pg_ctl to stop the service, of course. cheers andrew Joshua D. Drake wrote: > Magnus Hagander wrote: >>>> That log entry is the last (of consequence) entry before >>> the machine says: >>>> 2006-09-28 16:40:36.921 LOG: received fast shutdown request >>> Oh? That's pretty interesting on a Windows machine, because >>> AFAIK there wouldn't be any standard mechanism that might tie >>> into our homegrown signal facility. Anyone have a theory on >>> what might trigger a SIGINT to the postmaster, other than >>> intentional pg_ctl invocation? >> >> pg_ctl will send SIGINT to the postmaster when the service is stopped, >> or when windows is shutting down. > > O.k. that pretty much confirms my suspicion then. The SIGINT likely came > from the user rebooting windows. > >> >> Do you get anything about the postgresql service in the eventlog within >> say a minute of this happening? (before or after) > > Too late to say now :( I will have to follow up with them. >
> IIRC there is no real SIGINT on Windows, so it can only come > from a postgres program. The windows shutdown could be > calling pg_ctl to stop the service, of course. Well, not quite that, but it will send a service command to the running pg_ctl (which is our "service supervisor"), which *will* respond with a SIGINT to the postmaster. //Magnus