Thread: Win32 hard crash problem

Win32 hard crash problem

From

"Joshua D. Drake"

Date:

31 August 2006, 17:47:57

Hello,

Dave Cramer and I have dealt with a company today running 8.1.4 on 
Windows 2003. The application is a web app that runs via JDBC/Hibernate.

The application will function perfectly for about 2/3 weeks and then we 
will receive a:

"server sent data (\"D\" message) without prior row description (\"T\" 
message)");

(not escaped of course).

Subsequent connections to the database will fail (such as pgAdmin) and 
Windows must be completely rebooted. I did ask if they were able to kill 
the process via the task manager. Instead they opt to use the service 
options and when that fails (which is always) they reboot the machine 
entirely.

PostgreSQL will also not recover on its own (e.g; auto restart and roll 
through the logs).

The good news is at that on reboot the problem goes away for 2/3 weeks. 
I have verified that they are doing all requisite routine maintenance.

I currently have the customer running hardware checks to verify validity 
of the hardware but...

Any thoughts?

Sincerely,


Joshua D. Drake



-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Tom Lane

Date:

31 August 2006, 18:51:17

"Joshua D. Drake" <jd@commandprompt.com> writes:
> Dave Cramer and I have dealt with a company today running 8.1.4 on 
> Windows 2003. The application is a web app that runs via JDBC/Hibernate.
> The application will function perfectly for about 2/3 weeks and then we 
> will receive a:
> "server sent data (\"D\" message) without prior row description (\"T\" 
> message)");

That sounds suspiciously close to the time from boot to wraparound of
GetTickCount:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp
M$ list this as 49 days but that's the time to wrap clear around to
zero; the value overflows and goes negative in 24.85 days if I've
done the math correctly.

My bet is something depending on GetTickCount to measure elapsed time
(and no, it's not used in the core Postgres code, but you've got plenty
of other possible culprits in that stack).

BTW, are you sure this is coming from JDBC?  I see the exact same
message text in libpq:libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\"
message)\n"));
Maybe the JDBC driver uses the identical message wording but my thought
is to look for something going through libpq.

> Any thoughts?

I suppose "get a real operating system" won't go over well?
        regards, tom lane

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

31 August 2006, 18:59:25

Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> Dave Cramer and I have dealt with a company today running 8.1.4 on 
>> Windows 2003. The application is a web app that runs via JDBC/Hibernate.
>> The application will function perfectly for about 2/3 weeks and then we 
>> will receive a:
>> "server sent data (\"D\" message) without prior row description (\"T\" 
>> message)");
> 
> That sounds suspiciously close to the time from boot to wraparound of
> GetTickCount:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp
> M$ list this as 49 days but that's the time to wrap clear around to
> zero; the value overflows and goes negative in 24.85 days if I've
> done the math correctly.
> 
> My bet is something depending on GetTickCount to measure elapsed time
> (and no, it's not used in the core Postgres code, but you've got plenty
> of other possible culprits in that stack).
> 
> BTW, are you sure this is coming from JDBC?  I see the exact same
> message text in libpq:
>  libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n"));
> Maybe the JDBC driver uses the identical message wording but my thought
> is to look for something going through libpq.

The error is server side. I was just describing the environment.

> 
>> Any thoughts?
> 
> I suppose "get a real operating system" won't go over well?

Tried that, I got nervous laughter on the other end ;)

Joshua D. Drake

> 
>             regards, tom lane
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Tom Lane

Date:

31 August 2006, 19:02:11

"Joshua D. Drake" <jd@commandprompt.com> writes:
> Tom Lane wrote:
>> BTW, are you sure this is coming from JDBC?  I see the exact same
>> message text in libpq:
>> libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n"));
>> Maybe the JDBC driver uses the identical message wording but my thought
>> is to look for something going through libpq.

> The error is server side. I was just describing the environment.

I can entirely assure you that that error message is not present in the
server code.
        regards, tom lane

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

31 August 2006, 19:10:49

Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> Tom Lane wrote:
>>> BTW, are you sure this is coming from JDBC?  I see the exact same
>>> message text in libpq:
>>> libpq_gettext("server sent data (\"D\" message) without prior row description (\"T\" message)\n"));
>>> Maybe the JDBC driver uses the identical message wording but my thought
>>> is to look for something going through libpq.
> 
>> The error is server side. I was just describing the environment.
> 
> I can entirely assure you that that error message is not present in the
> server code.

Ok let me be more clear. The message is being throw via PostgreSQL. I am 
getting per the message I posted..

http://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/interfaces/libpq/fe-protocol2.c?rev=22194
http://projects.commandprompt.com/public/pgsql/browser/trunk/pgsql/src/interfaces/libpq/fe-protocol3.c?rev=25989

It is in libpq and the protocol not the backend that is giving me the 
message. When I said server, I as referring to postgresql inclusively, 
not the driver that was actually connecting.

Sincerely,

Joshua D. Drake



> 
>             regards, tom lane
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Dave Cramer

Date:

31 August 2006, 19:19:21

On 31-Aug-06, at 6:01 PM, Tom Lane wrote:

> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> Tom Lane wrote:
>>> BTW, are you sure this is coming from JDBC?  I see the exact same
>>> message text in libpq:
>>> libpq_gettext("server sent data (\"D\" message) without prior row  
>>> description (\"T\" message)\n"));
>>> Maybe the JDBC driver uses the identical message wording but my  
>>> thought
>>> is to look for something going through libpq.
>
>> The error is server side. I was just describing the environment.
>
> I can entirely assure you that that error message is not present in  
> the
> server code.
Well that's even more interesting because it doesn't exist in the  
jdbc driver either.

Dave
>
>             regards, tom lane
>
> ---------------------------(end of  
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that  
> your
>        message can get through to the mailing list cleanly
>

Re: Win32 hard crash problem

From

Alvaro Herrera

Date:

31 August 2006, 19:29:47

Dave Cramer wrote:
> 
> On 31-Aug-06, at 6:01 PM, Tom Lane wrote:
> 
> >"Joshua D. Drake" <jd@commandprompt.com> writes:
> >>Tom Lane wrote:
> >>>BTW, are you sure this is coming from JDBC?  I see the exact same
> >>>message text in libpq:
> >>>libpq_gettext("server sent data (\"D\" message) without prior row  
> >>>description (\"T\" message)\n"));
> >>>Maybe the JDBC driver uses the identical message wording but my
> >>>thought is to look for something going through libpq.
> >
> >>The error is server side. I was just describing the environment.
> >
> >I can entirely assure you that that error message is not present in
> >the server code.
> Well that's even more interesting because it doesn't exist in the  
> jdbc driver either.

Conclusion: they are using libpq in some form, so you should investigate
that.

Is there a way to alter the tick counter, so that a test run does not
need to take the full 3 weeks?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

31 August 2006, 19:29:53

> 
> That sounds suspiciously close to the time from boot to wraparound of
> GetTickCount:
> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/gettickcount.asp
> M$ list this as 49 days but that's the time to wrap clear around to
> zero; the value overflows and goes negative in 24.85 days if I've
> done the math correctly.
> 
> My bet is something depending on GetTickCount to measure elapsed time
> (and no, it's not used in the core Postgres code, but you've got plenty
> of other possible culprits in that stack).

This doesn't quite make sense. The only reason we have to reboot is 
because PostgreSQL no longer responds. The system itself is fine.

Sincerely,

Joshua D. Drake


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Tom Lane

Date:

31 August 2006, 19:39:46

"Joshua D. Drake" <jd@commandprompt.com> writes:
>> My bet is something depending on GetTickCount to measure elapsed time
>> (and no, it's not used in the core Postgres code, but you've got plenty
>> of other possible culprits in that stack).

> This doesn't quite make sense. The only reason we have to reboot is 
> because PostgreSQL no longer responds. The system itself is fine.

The Windows kernel may still work, but that doesn't mean that everything
Postgres depends on still works.  I'm wondering about (a) the TCP stack
(and that includes 3rd party firewalls and such, not only the core
Windows code); (b) timing or threading stuff inside the application
that's using libpq, which the only thing we know about so far is that
it's *not* JDBC/Hibernate.
        regards, tom lane

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

31 August 2006, 19:47:33

Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>> My bet is something depending on GetTickCount to measure elapsed time
>>> (and no, it's not used in the core Postgres code, but you've got plenty
>>> of other possible culprits in that stack).
> 
>> This doesn't quite make sense. The only reason we have to reboot is 
>> because PostgreSQL no longer responds. The system itself is fine.
> 
> The Windows kernel may still work, but that doesn't mean that everything
> Postgres depends on still works.  I'm wondering about (a) the TCP stack
> (and that includes 3rd party firewalls and such, not only the core
> Windows code); (b) timing or threading stuff inside the application
> that's using libpq, which the only thing we know about so far is that
> it's *not* JDBC/Hibernate.

/me grumbles in a not so polite way about Windows.

Which means we need to start stripping it down. Gah, I actually argued 
*for* this port to. Next time slap me.

Joshua D. Drake


> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

31 August 2006, 19:56:06

Alvaro Herrera wrote:
> Dave Cramer wrote:
>> On 31-Aug-06, at 6:01 PM, Tom Lane wrote:
>>
>>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>>> Tom Lane wrote:
>>>>> BTW, are you sure this is coming from JDBC?  I see the exact same
>>>>> message text in libpq:
>>>>> libpq_gettext("server sent data (\"D\" message) without prior row  
>>>>> description (\"T\" message)\n"));
>>>>> Maybe the JDBC driver uses the identical message wording but my
>>>>> thought is to look for something going through libpq.
>>>> The error is server side. I was just describing the environment.
>>> I can entirely assure you that that error message is not present in
>>> the server code.
>> Well that's even more interesting because it doesn't exist in the  
>> jdbc driver either.
> 
> Conclusion: they are using libpq in some form, so you should investigate
> that.
> 
> Is there a way to alter the tick counter, so that a test run does not
> need to take the full 3 weeks?
> 

Sure it is a registry entry... so we could (in theory) shrink that quite 
a bit.. However I am confused, if we don't use it, what that is 
connecting to libpq would trigger it?

I know they are using pgAAdmin...

Joshua D. Drake


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Tom Lane

Date:

31 August 2006, 19:56:25

"Joshua D. Drake" <jd@commandprompt.com> writes:
> Which means we need to start stripping it down. Gah, I actually argued 
> *for* this port to. Next time slap me.

Well, before you invest a lot of time barking up what might be the wrong
tree, there is a very easy test you can use to check the GetTickCount
theory: keep closer track of time-since-boot on the affected systems.
If that idea is right, it won't be "two or three weeks" between boot and
problems appearing, it'll be 24.85 days on the nose.  It shouldn't take
much except waiting to either falsify the theory or make it look pretty
convincing.
        regards, tom lane

Re: Win32 hard crash problem

From

Dave Page

Date:

31 August 2006, 20:21:08

On 31/8/06 23:34, "Joshua D. Drake" <jd@commandprompt.com> wrote:

> Sure it is a registry entry... so we could (in theory) shrink that quite
> a bit.. However I am confused, if we don't use it, what that is
> connecting to libpq would trigger it?
> 
> I know they are using pgAAdmin...

Are they using pgAgent? That's the only part of pgAdmin that doesn't any
sort of timing I can think of offhand (other than the query tool timer which
only runs whilst a query is running). Even then it's done indirectly through
wxWidgets so I'm not familiar with how it's implemented at the win32 API
level.

If it were pgAdmin (or any other client) though, how would that lock up the
entire PostgreSQL instance, but not the rest of the server?

Regards, Dave.

Re: Win32 hard crash problem

From

"Magnus Hagander"

Date:

01 September 2006, 05:01:36

> >> My bet is something depending on GetTickCount to measure elapsed
> time
> >> (and no, it's not used in the core Postgres code, but you've got
> >> plenty of other possible culprits in that stack).
>
> > This doesn't quite make sense. The only reason we have to reboot
> is
> > because PostgreSQL no longer responds. The system itself is fine.
>
> The Windows kernel may still work, but that doesn't mean that
> everything Postgres depends on still works.  I'm wondering about
> (a) the TCP stack (and that includes 3rd party firewalls and such,
> not only the core Windows code); (b) timing or threading stuff
> inside the application that's using libpq, which the only thing we
> know about so far is that it's *not* JDBC/Hibernate.

How about getting a simple backtrace from a couple of the stuck postgres
processes? And from the postmaster which should be accepting new
connections... Or does that also hang completely?

How to get one? Well, since we don't have the MSVC build yet (yeah,
yeah, eventually), you can only get a semi-backtrace that only looks at
exported symbols. You can get this using process explorer (thread tab,
click stack), using WinDBG or using Visual Studio (you'll need VS 2005,
and you need to check the option for "Load DLL exports" in
options->debugging->native).


Oh, btw, if there is a 3rd firewall on the box the standard
recommendation of uninstalling it definitely sounds like a good plan :-)

//Magnus

Re: Win32 hard crash problem

From

"Magnus Hagander"

Date:

01 September 2006, 05:03:40

Oops, going backwards through the mails it seems :)

> Subsequent connections to the database will fail (such as pgAdmin)
> and Windows must be completely rebooted.

Fail in what way. Hang, not connect, or get an error msg?

> PostgreSQL will also not recover on its own (e.g; auto restart and
> roll through the logs).

What do you mean by this? It doesn't start upon reboot? What is needed
to make it start?


//Magnus

Re: Win32 hard crash problem

From

"Zeugswetter Andreas DCP SD"

Date:

01 September 2006, 05:27:30

> >> My bet is something depending on GetTickCount to measure elapsed
time
> >> (and no, it's not used in the core Postgres code, but you've got
> >> plenty of other possible culprits in that stack).
>
> > This doesn't quite make sense. The only reason we have to reboot is
> > because PostgreSQL no longer responds. The system itself is fine.
>
> The Windows kernel may still work, but that doesn't mean that
> everything Postgres depends on still works.

It may be a not reacting listen socket. This may be because of a handle
leak. Next time it blocks look at the handle counts (e.g. with
handle.exe
from sysinternals).

You could also look for handle count now with Task Manager and see if it
increases constantly. (handle.exe shows you the details)

Andreas

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 17:57:59

Magnus Hagander wrote:
> Oops, going backwards through the mails it seems :)
> 
>> Subsequent connections to the database will fail (such as pgAdmin)
>> and Windows must be completely rebooted.
> 
> Fail in what way. Hang, not connect, or get an error msg?
> 
>> PostgreSQL will also not recover on its own (e.g; auto restart and
>> roll through the logs).
> 
> What do you mean by this? It doesn't start upon reboot? What is needed
> to make it start?

It means that postgresql doesn't recover on its own. On linux if a 
backend crashes all of PostgreSQL will restart and come back up if it can.

On Win32 it doesn't.

Joshua D. Drake


> 
> 
> //Magnus
> 
> 
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

"Merlin Moncure"

Date:

05 September 2006, 18:07:37

On 9/5/06, Joshua D. Drake <jd@commandprompt.com> wrote:
> Magnus Hagander wrote:
> > What do you mean by this? It doesn't start upon reboot? What is needed
> > to make it start?
>
> It means that postgresql doesn't recover on its own. On linux if a
> backend crashes all of PostgreSQL will restart and come back up if it can.
>
> On Win32 it doesn't.

it does for me, at least for me when I used to work with windows :).
I think it just doesn't restart for this particular type of crash.  I
had a couple of similarly wierd undetectable windows problems that I
could never quite figured out until I got hired by another company and
left that monster behind for good.

merlin

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 18:14:12

Magnus Hagander wrote:
>>>> PostgreSQL will also not recover on its own (e.g; auto restart and 
>>>> roll through the logs).
>>> What do you mean by this? It doesn't start upon reboot? 
>> What is needed 
>>> to make it start?
>> It means that postgresql doesn't recover on its own. On linux 
>> if a backend crashes all of PostgreSQL will restart and come 
>> back up if it can.
>>
>> On Win32 it doesn't.
> 
> Ah, I thought you meant that the database recovery process (that runs
> after a crash) failed and lost data. But it's not data-loss then, it
> just took a reboot to fix it?

Right, but "just took a reboot to fix it" isn't very confidence inspiring ;)

> 
> I think we're somehow seeing a complete postmaster hang, where it's
> either not able to kill off th ebackends as required, or just not
> capable of accepting new connections after that. Which makes a
> stacktrace from the postmaster the most interesting one to look at.

I have asked the customer to also look and see if there was one 
particular process that was eating cpu via the task master and see if 
that process can be killed. If that process can be killed and postgresql 
comes back clean, then that is a step.

However, debugging this beast is a pain. I take it mingw doesn't have a 
gdb we can use?

> 
> //Magnus
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

"Magnus Hagander"

Date:

05 September 2006, 18:18:14

> >> PostgreSQL will also not recover on its own (e.g; auto restart and
> >> roll through the logs).
> >
> > What do you mean by this? It doesn't start upon reboot?
> What is needed
> > to make it start?
>
> It means that postgresql doesn't recover on its own. On linux
> if a backend crashes all of PostgreSQL will restart and come
> back up if it can.
>
> On Win32 it doesn't.

Ah, I thought you meant that the database recovery process (that runs
after a crash) failed and lost data. But it's not data-loss then, it
just took a reboot to fix it?

I think we're somehow seeing a complete postmaster hang, where it's
either not able to kill off th ebackends as required, or just not
capable of accepting new connections after that. Which makes a
stacktrace from the postmaster the most interesting one to look at.

//Magnus

Re: Win32 hard crash problem

From

Tom Lane

Date:

05 September 2006, 18:21:53

"Merlin Moncure" <mmoncure@gmail.com> writes:
> On 9/5/06, Joshua D. Drake <jd@commandprompt.com> wrote:
>> Magnus Hagander wrote:
>>> What do you mean by this? It doesn't start upon reboot? What is needed
>>> to make it start?
>> 
>> It means that postgresql doesn't recover on its own. On linux if a
>> backend crashes all of PostgreSQL will restart and come back up if it can.
>> 
>> On Win32 it doesn't.

> it does for me, at least for me when I used to work with windows :).
> I think it just doesn't restart for this particular type of crash.

As best I can tell, Josh isn't describing a crash at all.  Something
(possibly in the TCP stack) has locked up, but there's no way for the
postmaster to know there's anything wrong, and probably no way for the
postmaster to fix it if it did know.  Restarting backends certainly
isn't going to fix a communication problem.

Josh failed to answer the most important question though:

>> Subsequent connections to the database will fail (such as pgAdmin)
>> and Windows must be completely rebooted.
> 
> Fail in what way. Hang, not connect, or get an error msg?
        regards, tom lane

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 18:39:24

> Josh failed to answer the most important question though:

Sorry.

> 
>>> Subsequent connections to the database will fail (such as pgAdmin)
>>> and Windows must be completely rebooted.
>> Fail in what way. Hang, not connect, or get an error msg?

Just verified with customer. Once the problem occurs the first time, the 
customer will continually get the same error message for each subsequent 
connection attempt:

server sent data ("D" message) without prior row description ("T" message)


> 
>             regards, tom lane
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Tom Lane

Date:

05 September 2006, 18:53:13

"Joshua D. Drake" <jd@commandprompt.com> writes:
>>> Fail in what way. Hang, not connect, or get an error msg?

> Just verified with customer. Once the problem occurs the first time, the 
> customer will continually get the same error message for each subsequent 
> connection attempt:

> server sent data ("D" message) without prior row description ("T" message)

During the connection attempt?  I don't think libpq can report that
message until it tries to do a regular query (might be wrong though).
Is the client using some application that's going to issue a query
immediately on connecting?

It would be useful to turn on log_connections and log_statement (and
perhaps crank log_min_messages all the way up to DEBUG5) to see if we
can get anything in the postmaster log giving a hint what actually
happens here.  A TCP sniff of the connection attempt traffic would be
pretty useful too.
        regards, tom lane

Re: Win32 hard crash problem

From

Alvaro Herrera

Date:

05 September 2006, 19:01:20

Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> >>> Fail in what way. Hang, not connect, or get an error msg?
> 
> > Just verified with customer. Once the problem occurs the first time, the 
> > customer will continually get the same error message for each subsequent 
> > connection attempt:
> 
> > server sent data ("D" message) without prior row description ("T" message)
> 
> During the connection attempt?  I don't think libpq can report that
> message until it tries to do a regular query (might be wrong though).
> Is the client using some application that's going to issue a query
> immediately on connecting?

What I've been wondering all along is whether they are using a
connection pool.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 19:05:57

Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>>> Fail in what way. Hang, not connect, or get an error msg?
> 
>> Just verified with customer. Once the problem occurs the first time, the 
>> customer will continually get the same error message for each subsequent 
>> connection attempt:
> 
>> server sent data ("D" message) without prior row description ("T" message)
> 
> During the connection attempt?  I don't think libpq can report that
> message until it tries to do a regular query (might be wrong though).
> Is the client using some application that's going to issue a query
> immediately on connecting?

Well, windows ;) Customer says that they double click pgadmin and they 
get that message. I have informed them on how to increase to debug5 and 
hopefully we get something from that, of course it will likely be 24.85 
days from now ;)

> 
> It would be useful to turn on log_connections and log_statement (and
> perhaps crank log_min_messages all the way up to DEBUG5) to see if we
> can get anything in the postmaster log giving a hint what actually
> happens here.  A TCP sniff of the connection attempt traffic would be
> pretty useful too.
> 

Sincerely,

Joshua D. Drake


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 19:06:05

Alvaro Herrera wrote:
> Tom Lane wrote:
>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>>>> Fail in what way. Hang, not connect, or get an error msg?
>>> Just verified with customer. Once the problem occurs the first time, the 
>>> customer will continually get the same error message for each subsequent 
>>> connection attempt:
>>> server sent data ("D" message) without prior row description ("T" message)
>> During the connection attempt?  I don't think libpq can report that
>> message until it tries to do a regular query (might be wrong though).
>> Is the client using some application that's going to issue a query
>> immediately on connecting?
> 
> What I've been wondering all along is whether they are using a
> connection pool.
> 

Yes they are using a connection pool. A java based one.

Sincerely,

Joshua D. Drake


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Jeremy Drake

Date:

05 September 2006, 19:11:54

On Tue, 5 Sep 2006, Joshua D. Drake wrote:

> Right, but "just took a reboot to fix it" isn't very confidence inspiring ;)

Are you kidding?  This is standard procedure for troubleshooting Windows
problems :)

--
The world is coming to an end.  Please log off.

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 19:26:57

Alvaro Herrera wrote:
> Joshua D. Drake wrote:
>> Alvaro Herrera wrote:
>>> Tom Lane wrote:
>>>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>>>>>> Fail in what way. Hang, not connect, or get an error msg?
>>>>> Just verified with customer. Once the problem occurs the first time, the 
>>>>> customer will continually get the same error message for each subsequent 
>>>>> connection attempt:
>>>>> server sent data ("D" message) without prior row description ("T" 
>>>>> message)
>>>> During the connection attempt?  I don't think libpq can report that
>>>> message until it tries to do a regular query (might be wrong though).
>>>> Is the client using some application that's going to issue a query
>>>> immediately on connecting?
>>> What I've been wondering all along is whether they are using a
>>> connection pool.
>> Yes they are using a connection pool. A java based one.
> 
> It's quite possible that it's the connection pool that gets confused,
> and not PostgreSQL itself.  It would be interesting if they change the
> connection setting when the "hang" next occurs, to point directly to
> PostgreSQL bypassing the connection pool.

Well except when they are connecting with Pgadmin (which wouldn't go 
through the connection pool) they get the error as well.

Joshua D. Drake

> 
> OTOH the connection pool may be the thing with the TickCounter problem.
> 



-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Dave Cramer

Date:

05 September 2006, 19:33:11

On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:

> Alvaro Herrera wrote:
>> Tom Lane wrote:
>>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>>>>> Fail in what way. Hang, not connect, or get an error msg?
>>>> Just verified with customer. Once the problem occurs the first  
>>>> time, the customer will continually get the same error message  
>>>> for each subsequent connection attempt:
>>>> server sent data ("D" message) without prior row description  
>>>> ("T" message)
>>> During the connection attempt?  I don't think libpq can report that
>>> message until it tries to do a regular query (might be wrong  
>>> though).
>>> Is the client using some application that's going to issue a query
>>> immediately on connecting?
>> What I've been wondering all along is whether they are using a
>> connection pool.
>
> Yes they are using a connection pool. A java based one.
Since java has it's own protocol implementation, this is totally  
unrelated to any libpq error messages.

While I've not personally used the pool in question (c3p0) my  
understanding is that it is pretty robust.

Personally, I'm betting on some windows TCP/IP weirdness here.

Dave
>
> Sincerely,
>
> Joshua D. Drake
>
>
> -- 
>
>    === The PostgreSQL Company: Command Prompt, Inc. ===
> Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
>    Providing the most comprehensive  PostgreSQL solutions since 1997
>              http://www.commandprompt.com/
>
>
>
> ---------------------------(end of  
> broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that  
> your
>       message can get through to the mailing list cleanly
>

Re: Win32 hard crash problem

From

Alvaro Herrera

Date:

05 September 2006, 19:35:26

Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> >Joshua D. Drake wrote:
> >>Alvaro Herrera wrote:

> >>>What I've been wondering all along is whether they are using a
> >>>connection pool.
> >>Yes they are using a connection pool. A java based one.
> >
> >It's quite possible that it's the connection pool that gets confused,
> >and not PostgreSQL itself.  It would be interesting if they change the
> >connection setting when the "hang" next occurs, to point directly to
> >PostgreSQL bypassing the connection pool.
> 
> Well except when they are connecting with Pgadmin (which wouldn't go 
> through the connection pool) they get the error as well.

Are you assuming, or did they/you verify that this is indeed the case?
I see no reason to assume that pgAdmin can't connect via a pool.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Win32 hard crash problem

From

Tom Lane

Date:

05 September 2006, 19:36:00

Dave Cramer <pg@fastcrypt.com> writes:
> On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:
>> Yes they are using a connection pool. A java based one.

> Since java has it's own protocol implementation, this is totally  
> unrelated to any libpq error messages.

Another important point that we've not been given information on:
when pgAdmin/libpq starts failing like this, exactly what is happening
with the connection pool?  Is it still able to issue queries, and
if not what happens exactly?
        regards, tom lane

Re: Win32 hard crash problem

From

Alvaro Herrera

Date:

05 September 2006, 19:41:33

Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> >Tom Lane wrote:
> >>"Joshua D. Drake" <jd@commandprompt.com> writes:
> >>>>>Fail in what way. Hang, not connect, or get an error msg?
> >>>Just verified with customer. Once the problem occurs the first time, the 
> >>>customer will continually get the same error message for each subsequent 
> >>>connection attempt:
> >>>server sent data ("D" message) without prior row description ("T" 
> >>>message)
> >>During the connection attempt?  I don't think libpq can report that
> >>message until it tries to do a regular query (might be wrong though).
> >>Is the client using some application that's going to issue a query
> >>immediately on connecting?
> >
> >What I've been wondering all along is whether they are using a
> >connection pool.
> 
> Yes they are using a connection pool. A java based one.

It's quite possible that it's the connection pool that gets confused,
and not PostgreSQL itself.  It would be interesting if they change the
connection setting when the "hang" next occurs, to point directly to
PostgreSQL bypassing the connection pool.

OTOH the connection pool may be the thing with the TickCounter problem.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 19:44:51

Alvaro Herrera wrote:
> Joshua D. Drake wrote:
>> Alvaro Herrera wrote:
>>> Joshua D. Drake wrote:
>>>> Alvaro Herrera wrote:
> 
>>>>> What I've been wondering all along is whether they are using a
>>>>> connection pool.
>>>> Yes they are using a connection pool. A java based one.
>>> It's quite possible that it's the connection pool that gets confused,
>>> and not PostgreSQL itself.  It would be interesting if they change the
>>> connection setting when the "hang" next occurs, to point directly to
>>> PostgreSQL bypassing the connection pool.
>> Well except when they are connecting with Pgadmin (which wouldn't go 
>> through the connection pool) they get the error as well.
> 
> Are you assuming, or did they/you verify that this is indeed the case?
> I see no reason to assume that pgAdmin can't connect via a pool.
> 

Verified. They do not connect to the connection pool for pgadmin.

Although I would think pgadmin might have problems connecting to a java 
based pool. If I recall, (I could be cranked) JDBC apps can't use pgpool 
for example.

Sincerely,

Joshua D. Drake



-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 20:00:21

Tom Lane wrote:
> Dave Cramer <pg@fastcrypt.com> writes:
>> On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:
>>> Yes they are using a connection pool. A java based one.
> 
>> Since java has it's own protocol implementation, this is totally  
>> unrelated to any libpq error messages.
> 
> Another important point that we've not been given information on:
> when pgAdmin/libpq starts failing like this, exactly what is happening
> with the connection pool?  Is it still able to issue queries, and
> if not what happens exactly?

No, when this happens everything stops. The only thing they get back is 
that message until they reboot the server. The web app (via 
java/connection pool), pgAdmin both give the same error.

Which now that I think about it, seems odd if the message is coming from 
libpq yes?

Sincerely,

Joshua D. Drake


> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

"Dave Page"

Date:

05 September 2006, 20:00:59


-----Original Message-----
From: "Joshua D. Drake" <jd@commandprompt.com>
To: "Joshua D. Drake" <jd@commandprompt.com>; "Tom Lane" <tgl@sss.pgh.pa.us>; "Merlin Moncure" <mmoncure@gmail.com>;
"MagnusHagander" <mha@sollentuna.net>; "PostgreSQL-development" <pgsql-hackers@postgresql.org> 
Sent: 05/09/06 23:27
Subject: Re: [HACKERS] Win32 hard crash problem


> Well except when they are connecting with Pgadmin (which wouldn't go
> through the connection pool) they get the error as well.

It wouldn't? It's just a 'regular' libpq app. Doesn't say much for the connection pool if it cannot handle a simple
libpqconnection. 

/D

Re: Win32 hard crash problem

From

Alvaro Herrera

Date:

05 September 2006, 20:07:57

Joshua D. Drake wrote:
> Tom Lane wrote:
> >Dave Cramer <pg@fastcrypt.com> writes:
> >>On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:
> >>>Yes they are using a connection pool. A java based one.
> >
> >>Since java has it's own protocol implementation, this is totally  
> >>unrelated to any libpq error messages.
> >
> >Another important point that we've not been given information on:
> >when pgAdmin/libpq starts failing like this, exactly what is happening
> >with the connection pool?  Is it still able to issue queries, and
> >if not what happens exactly?
> 
> No, when this happens everything stops. The only thing they get back is 
> that message until they reboot the server. The web app (via 
> java/connection pool), pgAdmin both give the same error.

Actually Dave Cramer told me that if the postmaster was stopped and then
restarted, it would start answering fine again.  Which would make a lot
of sense.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 20:11:51

Alvaro Herrera wrote:
> Joshua D. Drake wrote:
>> Tom Lane wrote:
>>> Dave Cramer <pg@fastcrypt.com> writes:
>>>> On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:
>>>>> Yes they are using a connection pool. A java based one.
>>>> Since java has it's own protocol implementation, this is totally  
>>>> unrelated to any libpq error messages.
>>> Another important point that we've not been given information on:
>>> when pgAdmin/libpq starts failing like this, exactly what is happening
>>> with the connection pool?  Is it still able to issue queries, and
>>> if not what happens exactly?
>> No, when this happens everything stops. The only thing they get back is 
>> that message until they reboot the server. The web app (via 
>> java/connection pool), pgAdmin both give the same error.
> 
> Actually Dave Cramer told me that if the postmaster was stopped and then
> restarted, it would start answering fine again.  Which would make a lot
> of sense.

I already said that ;). The problem IS NOT that we can't restart the 
system and get postgresql back. It is that it happens at all.

Joshua D. Drake


> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 20:23:54

Hello,

O.k. to recap:

OS: Win2k3 SP1
PostgreSQL: 8.1.2
Application Server: Jboss
Connection Pooler: C3PO
JDBC Version: 8.1.404, Also verified with 8.0.311

Problem:

After 2/3 weeks, PostgreSQL will begin issuing the following message:

server sent data ("D" message) without prior row description ("T" message)

This message will present itself, if connection attempts are made from 
the Web Application (Java/JDBC), or locally via PgAdmin. Once the error 
message is received, all subsequent connection attempts will also result 
in that same message. We do not know if the error occurs before or after 
authentication.

The only known resolution is to reboot Windows. Using the service 
control panel to shutdown postgresql will fail once the message is 
received. It is unknown if using the task master to individually kill 
processes will work.

Sincerely,

Joshua D. Drake




-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Alvaro Herrera

Date:

05 September 2006, 20:24:08

Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> >Joshua D. Drake wrote:
> >>Tom Lane wrote:
> >>>Dave Cramer <pg@fastcrypt.com> writes:
> >>>>On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:
> >>>>>Yes they are using a connection pool. A java based one.
> >>>>Since java has it's own protocol implementation, this is totally  
> >>>>unrelated to any libpq error messages.
> >>>Another important point that we've not been given information on:
> >>>when pgAdmin/libpq starts failing like this, exactly what is happening
> >>>with the connection pool?  Is it still able to issue queries, and
> >>>if not what happens exactly?
> >>No, when this happens everything stops. The only thing they get back is 
> >>that message until they reboot the server. The web app (via 
> >>java/connection pool), pgAdmin both give the same error.
> >
> >Actually Dave Cramer told me that if the postmaster was stopped and then
> >restarted, it would start answering fine again.  Which would make a lot
> >of sense.
> 
> I already said that ;). The problem IS NOT that we can't restart the 
> system and get postgresql back. It is that it happens at all.

It is quite different a bug that can only be fixed by "rebooting the
server" (which to me means taking the operating system down and starting
it afresh) than one that can be fixed by restarting the PostgreSQL
server (_without_ taking the operating system down).  I've been reading
"reboot" all along -- sorry if I missed an email saying otherwise.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Win32 hard crash problem

From

Alvaro Herrera

Date:

05 September 2006, 20:30:30

Joshua D. Drake wrote:

> The only known resolution is to reboot Windows. Using the service  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> control panel to shutdown postgresql will fail once the message is 
> received. It is unknown if using the task master to individually kill 
> processes will work.

This is what I'm saying that doesn't match what Dave told me.

The stuff about failing to shut the postmaster down, is the first time I
hear.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Win32 hard crash problem

From

Tom Lane

Date:

05 September 2006, 20:36:41

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Joshua D. Drake wrote:
>> I already said that ;). The problem IS NOT that we can't restart the 
>> system and get postgresql back. It is that it happens at all.

> It is quite different a bug that can only be fixed by "rebooting the
> server" (which to me means taking the operating system down and starting
> it afresh) than one that can be fixed by restarting the PostgreSQL
> server (_without_ taking the operating system down).  I've been reading
> "reboot" all along -- sorry if I missed an email saying otherwise.

It sounds to me like we don't actually know that, because the client
doesn't know how to restart the postmaster without rebooting the OS.
(Josh says "pg_ctl stop" doesn't work in this state, which is a tad
interesting in itself, because that doesn't go through a connection
request.)  It would be useful to try killing off the postgres processes
via task manager and then see if a new postmaster can be started and if
things then behave normally, or if a reboot is truly needed.

The bottom line here is that all we have so far are client-side
observations ("I get this message") and we have no clue what state
the postmaster thinks it's in.  We really need more information.
        regards, tom lane

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

05 September 2006, 20:38:32

> It sounds to me like we don't actually know that, because the client
> doesn't know how to restart the postmaster without rebooting the OS.
> (Josh says "pg_ctl stop" doesn't work in this state, which is a tad
> interesting in itself, because that doesn't go through a connection
> request.)  It would be useful to try killing off the postgres processes
> via task manager and then see if a new postmaster can be started and if
> things then behave normally, or if a reboot is truly needed.

Right, and I have asked that the next time this happens that they try 
and use the task manager to kill the process.

> 
> The bottom line here is that all we have so far are client-side
> observations ("I get this message") and we have no clue what state
> the postmaster thinks it's in.  We really need more information.
> 

Yes, unfortunately there isn't much more to be had for another 2 weeks ;)

Sincerely,

Joshua D. Drake


>             regards, tom lane
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Dave Cramer

Date:

05 September 2006, 21:29:28

On 5-Sep-06, at 7:00 PM, Joshua D. Drake wrote:

> Tom Lane wrote:
>> Dave Cramer <pg@fastcrypt.com> writes:
>>> On 5-Sep-06, at 6:05 PM, Joshua D. Drake wrote:
>>>> Yes they are using a connection pool. A java based one.
>>> Since java has it's own protocol implementation, this is totally   
>>> unrelated to any libpq error messages.
>> Another important point that we've not been given information on:
>> when pgAdmin/libpq starts failing like this, exactly what is  
>> happening
>> with the connection pool?  Is it still able to issue queries, and
>> if not what happens exactly?
>
> No, when this happens everything stops. The only thing they get  
> back is that message until they reboot the server. The web app (via  
> java/connection pool), pgAdmin both give the same error.
>
> Which now that I think about it, seems odd if the message is coming  
> from libpq yes?
Yes, this is very odd, AFICS, this message does not exist in the java  
driver. So.... it would be interesting to get the actual logs from  
the client.

>
> Sincerely,
>
> Joshua D. Drake
>
>
>>             regards, tom lane
>> ---------------------------(end of  
>> broadcast)---------------------------
>> TIP 2: Don't 'kill -9' the postmaster
>
>
> -- 
>
>    === The PostgreSQL Company: Command Prompt, Inc. ===
> Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
>    Providing the most comprehensive  PostgreSQL solutions since 1997
>              http://www.commandprompt.com/
>
>
>
> ---------------------------(end of  
> broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Re: Win32 hard crash problem

From

Tom Lane

Date:

06 September 2006, 00:09:31

"Joshua D. Drake" <jd@commandprompt.com> writes:
> Yes, unfortunately there isn't much more to be had for another 2 weeks ;)

I trust they've got the reboot time and they will know exactly how long
from reboot to problem?  I'm not all that sold on the "GetTickCount
overflow" theory, but certainly we ought not be missing a chance to test
or disprove it.
        regards, tom lane

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

06 September 2006, 00:25:37

Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> Yes, unfortunately there isn't much more to be had for another 2 weeks ;)
> 
> I trust they've got the reboot time and they will know exactly how long
> from reboot to problem?  I'm not all that sold on the "GetTickCount
> overflow" theory, but certainly we ought not be missing a chance to test
> or disprove it.

Yes I documented all conversations and disclaimers :)

Joshua D. Drake

> 
>             regards, tom lane
> 


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Oleg Bartunov

Date:

06 September 2006, 01:16:02

I'm a bit fear to to engage into this thread, but I've seen also
reproducible case when libpq client stops working and 'vaccuum analyze'
helped. It's happened on Windows Server 2003 and XP with PostgreSQL 8.1.4.
I don't have client source code, so I can't say more, but customer's developer
said the same behaviour was observed on Linux with 8.1.0 and has gone in 8.1.4.
They said, that this happens only with enabled row statistics.
Client inserts some data in transaction, backend writes 'COMMIT' to log,
but client wait something and 'vacuum analyze' of all database in some
magic way pushed the process.

I've got their installation CD and will try to investigate this problem.
Any suggestions ? I'm not familiar with W32 at all.

Oleg

On Tue, 5 Sep 2006, Tom Lane wrote:

> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> Yes, unfortunately there isn't much more to be had for another 2 weeks ;)
>
> I trust they've got the reboot time and they will know exactly how long
> from reboot to problem?  I'm not all that sold on the "GetTickCount
> overflow" theory, but certainly we ought not be missing a chance to test
> or disprove it.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: Win32 hard crash problem

From

"Magnus Hagander"

Date:

06 September 2006, 04:27:36

> >>>> Yes they are using a connection pool. A java based one.
> >>> Since java has it's own protocol implementation, this is
> totally
> >>> unrelated to any libpq error messages.
> >> Another important point that we've not been given information
> on:
> >> when pgAdmin/libpq starts failing like this, exactly what is
> >> happening with the connection pool?  Is it still able to issue
> >> queries, and if not what happens exactly?
> >
> > No, when this happens everything stops. The only thing they get
> back
> > is that message until they reboot the server. The web app (via
> > java/connection pool), pgAdmin both give the same error.
> >
> > Which now that I think about it, seems odd if the message is
> coming
> > from libpq yes?
> Yes, this is very odd, AFICS, this message does not exist in the
> java driver. So.... it would be interesting to get the actual logs
> from the client.

Definitly - that error msg showing up in the web app really doesn't make
sense. However, are we sure that the error message is *exactly* the
same, word for word, or is it possible that it's just "the same in what
it says" but with different words? I assume there are screendumps to
verify this ;-)


Another point that at least I don't know - what kind of connection pool
is it? Is it an external one (like pgpool) to which the java app
connects (using FE/BE protocol, emulating a "proper postmaster" but
pooling access to the database), or is it running inside the app server
(like for example .net connection pooling does, which simply means that
when you run the Open() method on the connection object it will pick
something off an *internal* pool)?

//Magnus

Re: Win32 hard crash problem

From

Michael Paesold

Date:

06 September 2006, 05:18:28

Magnus Hagander wrote:> Another point that at least I don't know - what kind of connection pool> is it? Is it an
externalone (like pgpool) to which the java app> connects (using FE/BE protocol, emulating a "proper postmaster" but>
poolingaccess to the database), or is it running inside the app server> (like for example .net connection pooling does,
whichsimply means that> when you run the Open() method on the connection object it will pick> something off an
*internal*pool)?
 

Googling for 3CPO [1] shows that it is a Java-based connection pool that 
implements connection pooling using the JDBC API, i.e. it is an *internal* 
pool running inside the app servers JVM. PG Admin cannot in any case 
connect through this pool.

Best Regards
Michael Paesold

[1] http://sourceforge.net/projects/c3p0

Re: Win32 hard crash problem

From

"Magnus Hagander"

Date:

06 September 2006, 05:33:53

> > server sent data ("D" message) without prior row description ("T"
> > message)
>
> During the connection attempt?  I don't think libpq can report that
> message until it tries to do a regular query (might be wrong
> though).
> Is the client using some application that's going to issue a query
> immediately on connecting?

In the case of pgAdmin, it does. It will set datestyle, load a list of
dbs etc.


//Magnus

Re: Win32 hard crash problem

From

Dave Cramer

Date:

06 September 2006, 09:32:51

On 6-Sep-06, at 3:27 AM, Magnus Hagander wrote:

>>>>>> Yes they are using a connection pool. A java based one.
>>>>> Since java has it's own protocol implementation, this is
>> totally
>>>>> unrelated to any libpq error messages.
>>>> Another important point that we've not been given information
>> on:
>>>> when pgAdmin/libpq starts failing like this, exactly what is
>>>> happening with the connection pool?  Is it still able to issue
>>>> queries, and if not what happens exactly?
>>>
>>> No, when this happens everything stops. The only thing they get
>> back
>>> is that message until they reboot the server. The web app (via
>>> java/connection pool), pgAdmin both give the same error.
>>>
>>> Which now that I think about it, seems odd if the message is
>> coming
>>> from libpq yes?
>> Yes, this is very odd, AFICS, this message does not exist in the
>> java driver. So.... it would be interesting to get the actual logs
>> from the client.
>
> Definitly - that error msg showing up in the web app really doesn't  
> make
> sense. However, are we sure that the error message is *exactly* the
> same, word for word, or is it possible that it's just "the same in  
> what
> it says" but with different words? I assume there are screendumps to
> verify this ;-)

I looked at the code in the jdbc driver and it doesn't even do this  
check


>
>
> Another point that at least I don't know - what kind of connection  
> pool
> is it? Is it an external one (like pgpool) to which the java app
> connects (using FE/BE protocol, emulating a "proper postmaster" but
> pooling access to the database), or is it running inside the app  
> server
> (like for example .net connection pooling does, which simply means  
> that
> when you run the Open() method on the connection object it will pick
> something off an *internal* pool)?
It's an internal pool, and the client has told me off list they have  
removed it and are using the jdbc driver pool.

At this point I'm confused as to what they really are using, but as  
they have contracted Command Prompt to fix this for them, I am no  
longer in the private loop.

Dave
>
> //Magnus
>
>
> ---------------------------(end of  
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Re: Win32 hard crash problem

From

Gregory Stark

Date:

06 September 2006, 17:18:05

"Joshua D. Drake" <jd@commandprompt.com> writes:

> O.k. to recap:
>
> This message will present itself, if connection attempts are made from the Web
> Application (Java/JDBC), or locally via PgAdmin. Once the error message is
> received, all subsequent connection attempts will also result in that same
> message. We do not know if the error occurs before or after authentication.

I think other people have claimed that this message is in libpq and not in
JDBC source code which is inconsistent with this description.

> The only known resolution is to reboot Windows. Using the service control panel
> to shutdown postgresql will fail once the message is received. It is unknown if
> using the task master to individually kill processes will work.

This contradicts your previous email about restarting the postmaster working.

I think you have to sit down and write down *exactly* what sequence of actions
cause what results. Describing them in shorthand like "if connection attempts
are made" is leading to a lot of speculation instead of systematic deductions.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

06 September 2006, 22:42:22

Gregory Stark wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> 
>> O.k. to recap:
>>
>> This message will present itself, if connection attempts are made from the Web
>> Application (Java/JDBC), or locally via PgAdmin. Once the error message is
>> received, all subsequent connection attempts will also result in that same
>> message. We do not know if the error occurs before or after authentication.
> 
> I think other people have claimed that this message is in libpq and not in
> JDBC source code which is inconsistent with this description.

Yes I am fully aware of that. I am only relaying what the customer said.

> 
>> The only known resolution is to reboot Windows. Using the service control panel
>> to shutdown postgresql will fail once the message is received. It is unknown if
>> using the task master to individually kill processes will work.
> 
> This contradicts your previous email about restarting the postmaster working.

No, it doesn't. I never said restarting the postmaster would work. I 
said rebooting windows, allows postgresql to come back up. Those are 
entirely different things.

Sincerely,

Joshua D. Drake


-- 
   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240   Providing the most comprehensive  PostgreSQL
solutionssince 1997             http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Alvaro Herrera

Date:

07 September 2006, 00:06:38

Joshua D. Drake wrote:
> Gregory Stark wrote:
> >"Joshua D. Drake" <jd@commandprompt.com> writes:

> >>The only known resolution is to reboot Windows. Using the service
> >>control panel to shutdown postgresql will fail once the message is
> >>received. It is unknown if using the task master to individually
> >>kill processes will work.
> >
> >This contradicts your previous email about restarting the postmaster 
> >working.
> 
> No, it doesn't. I never said restarting the postmaster would work. I
> said rebooting windows, allows postgresql to come back up. Those are 
> entirely different things.

Yup.  It was me who said that restarting the postmaster solved the
problem.  That's what Dave Cramer told me.  But maybe Dave was not
certain about that -- he did use the word "reboot" and I asked for
confirmation about whether this was an actual reboot of the machine 
or just a postmaster "reboot", and he said it was the latter.  But this
may have been a suposition.

Sorry for the confusion.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Win32 hard crash problem

From

Gregory Stark

Date:

07 September 2006, 06:38:20

"Joshua D. Drake" <jd@commandprompt.com> writes:

> Yes I am fully aware of that. I am only relaying what the customer said.

Yeah sorry, I guess what I sent was pretty obvious to you. I should stop
confusing -general with -hackers :)

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

29 September 2006, 13:56:58

Joshua D. Drake wrote:
> Tom Lane wrote:
>> "Joshua D. Drake" <jd@commandprompt.com> writes:
>>> Yes, unfortunately there isn't much more to be had for another 2
>>> weeks ;)
>>
>> I trust they've got the reboot time and they will know exactly how long
>> from reboot to problem?  I'm not all that sold on the "GetTickCount
>> overflow" theory, but certainly we ought not be missing a chance to test
>> or disprove it.
> 
> Yes I documented all conversations and disclaimers :)

O.k. further on this.. the crashing is happening quickly now but not
predictably. (as in sometimes a week sometimes 2 days). I just now got
them to send some further logs... Interestingly:

2006-09-28 16:38:37.406  LOG:  could not send data to client: An
operation on a socket could not be performed because the system lacked
sufficient buffer space or because a queue was full.

That log entry is the last (of consequence) entry before the machine says:

2006-09-28 16:40:36.921  LOG:  received fast shutdown request
2006-09-28 16:40:36.921  LOG:  aborting any active transactions
2006-09-28 16:40:36.921  FATAL:  terminating connection due to
administrator command

On the ERROR side of things I have a bunch of standard, unique key
violations etc... AND:

postgresql-2006-09-27_000000.log:2006-09-27 23:49:57.671  FATAL:  could
not read from statistics collector pipe: No error

I have requested a clean run with entire log at DEBUG2. Hopefully that
will give us more info.

Sincerely,

Joshua D. Drake

> 
> Joshua D. Drake
> 
>>
>>             regards, tom lane
>>
> 
> 

-- 
  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240  Providing the most comprehensive  PostgreSQL
solutionssince 1997            http://www.commandprompt.com/

Re: Win32 hard crash problem

From

Tom Lane

Date:

29 September 2006, 18:55:37

"Joshua D. Drake" <jd@commandprompt.com> writes:
> O.k. further on this.. the crashing is happening quickly now but not
> predictably. (as in sometimes a week sometimes 2 days).

OK, that seems to eliminate the GetTickCount-overflow theory anyway.

> That log entry is the last (of consequence) entry before the machine says:
> 2006-09-28 16:40:36.921  LOG:  received fast shutdown request

Oh?  That's pretty interesting on a Windows machine, because AFAIK there
wouldn't be any standard mechanism that might tie into our homegrown
signal facility.  Anyone have a theory on what might trigger a SIGINT
to the postmaster, other than intentional pg_ctl invocation?
        regards, tom lane

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

29 September 2006, 23:31:35

Tom Lane wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
>> O.k. further on this.. the crashing is happening quickly now but not
>> predictably. (as in sometimes a week sometimes 2 days).
> 
> OK, that seems to eliminate the GetTickCount-overflow theory anyway.
> 
>> That log entry is the last (of consequence) entry before the machine says:
>> 2006-09-28 16:40:36.921  LOG:  received fast shutdown request
> 
> Oh?  That's pretty interesting on a Windows machine, because AFAIK there
> wouldn't be any standard mechanism that might tie into our homegrown
> signal facility.  Anyone have a theory on what might trigger a SIGINT
> to the postmaster, other than intentional pg_ctl invocation?

Well the other option would be a windows restart. On windows would that
send a SIGINT to the backend?

Joshua D. Drake


> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
> 


-- 
  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240  Providing the most comprehensive  PostgreSQL
solutionssince 1997            http://www.commandprompt.com/

Re: Win32 hard crash problem

From

"Magnus Hagander"

Date:

01 October 2006, 16:20:37

> > That log entry is the last (of consequence) entry before
> the machine says:
> > 2006-09-28 16:40:36.921  LOG:  received fast shutdown request
>
> Oh?  That's pretty interesting on a Windows machine, because
> AFAIK there wouldn't be any standard mechanism that might tie
> into our homegrown signal facility.  Anyone have a theory on
> what might trigger a SIGINT to the postmaster, other than
> intentional pg_ctl invocation?

pg_ctl will send SIGINT to the postmaster when the service is stopped,
or when windows is shutting down.

Do you get anything about the postgresql service in the eventlog within
say a minute of this happening? (before or after)


Could it be a backend or the postmaster trying to send a signal to a
different backend, that for some reason sends it to the wrong process?

//Magnus

Re: Win32 hard crash problem

From

"Joshua D. Drake"

Date:

01 October 2006, 19:46:04

Magnus Hagander wrote:
>>> That log entry is the last (of consequence) entry before 
>> the machine says:
>>> 2006-09-28 16:40:36.921  LOG:  received fast shutdown request
>> Oh?  That's pretty interesting on a Windows machine, because 
>> AFAIK there wouldn't be any standard mechanism that might tie 
>> into our homegrown signal facility.  Anyone have a theory on 
>> what might trigger a SIGINT to the postmaster, other than 
>> intentional pg_ctl invocation?
> 
> pg_ctl will send SIGINT to the postmaster when the service is stopped,
> or when windows is shutting down. 

O.k. that pretty much confirms my suspicion then. The SIGINT likely came
from the user rebooting windows.

> 
> Do you get anything about the postgresql service in the eventlog within
> say a minute of this happening? (before or after)

Too late to say now :( I will have to follow up with them.


Sincerely,

Joshua D. Drake

-- 
  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240  Providing the most comprehensive  PostgreSQL
solutionssince 1997            http://www.commandprompt.com/

Re: Win32 hard crash problem

From

"Andrew Dunstan"

Date:

01 October 2006, 21:23:11

IIRC there is no real SIGINT on Windows, so it can only come from a
postgres program. The windows shutdown could be calling pg_ctl to stop the
service, of course.

cheers

andrew

Joshua D. Drake wrote:
> Magnus Hagander wrote:
>>>> That log entry is the last (of consequence) entry before
>>> the machine says:
>>>> 2006-09-28 16:40:36.921  LOG:  received fast shutdown request
>>> Oh?  That's pretty interesting on a Windows machine, because
>>> AFAIK there wouldn't be any standard mechanism that might tie
>>> into our homegrown signal facility.  Anyone have a theory on
>>> what might trigger a SIGINT to the postmaster, other than
>>> intentional pg_ctl invocation?
>>
>> pg_ctl will send SIGINT to the postmaster when the service is stopped,
>> or when windows is shutting down.
>
> O.k. that pretty much confirms my suspicion then. The SIGINT likely came
> from the user rebooting windows.
>
>>
>> Do you get anything about the postgresql service in the eventlog within
>> say a minute of this happening? (before or after)
>
> Too late to say now :( I will have to follow up with them.
>

Re: Win32 hard crash problem

From

"Magnus Hagander"

Date:

02 October 2006, 03:05:33

> IIRC there is no real SIGINT on Windows, so it can only come
> from a postgres program. The windows shutdown could be
> calling pg_ctl to stop the service, of course.

Well, not quite that, but it will send a service command to the running
pg_ctl (which is our "service supervisor"), which *will* respond with a
SIGINT to the postmaster.


//Magnus