Thread: Questions about connection clean-up and "invalid page header"

Questions about connection clean-up and "invalid page header"

From
Herouth Maoz
Date:
Hi Everybody.

I have two questions.

1. We have a system that is accessed by Crystal reports which is in turned
controlled by another (3rd party) system. Now, when a report takes too long or
the user cancels it, it doesn't send a cancel request to Postgres. It just
kills the Crystal process that works on it.

As a result, the query is left alive on the Postgres backend. Eventually I get
the message "Unexpected End of file" and the query is cancelled. But this
doesn't happen soon enough for me - these are usually very heavy queries, and
I'd like them to be cleaned up as soon as possible if the client connection
has ended.

Is there a parameter to set in the configuration or some other means to
shorten the time before an abandoned backend's query is cancelled?

2. I get the following message in my development database:

vacuumdb: vacuuming of database "reports" failed: ERROR:  invalid page header
in block 6200 of relation "rb"

I had this already a couple of months ago. Looking around the web, I saw this
error is supposed to indicate a hardware error. I informed my sysadmin, but
since this is just the dev system and the data was not important, I did a
TRUNCATE TABLE on the "rb" relation, and the errors stopped...

But now the error is back, and I'm a bit suspicious. If this is a hardware
issue, it's rather suspicious that it returned in the exact same relation
after I did a "truncate table". I have many other relations in the system,
ones that fill up a lot faster. So I suspect this might be a PostgreSQL issue
after all. What can I do about this?


We are currently using PostgreSQL v. 8.3.1 on the server side.

TIA,
Herouth

Re: Questions about connection clean-up and "invalid page header"

From
Scott Marlowe
Date:
On Sun, Jan 24, 2010 at 3:17 AM, Herouth Maoz <herouth@unicell.co.il> wrote:
> Hi Everybody.
>
> I have two questions.
>
> 1. We have a system that is accessed by Crystal reports which is in turned
> controlled by another (3rd party) system. Now, when a report takes too long or
> the user cancels it, it doesn't send a cancel request to Postgres. It just
> kills the Crystal process that works on it.
>
> As a result, the query is left alive on the Postgres backend. Eventually I get
> the message "Unexpected End of file" and the query is cancelled. But this
> doesn't happen soon enough for me - these are usually very heavy queries, and
> I'd like them to be cleaned up as soon as possible if the client connection
> has ended.

The real solution is to fix the application.  But I understand
sometimes you can't do that.

> Is there a parameter to set in the configuration or some other means to
> shorten the time before an abandoned backend's query is cancelled?

You can shorten the tcp_keepalive settings so that dead connections
get detected faster.

> 2. I get the following message in my development database:
>
> vacuumdb: vacuuming of database "reports" failed: ERROR:  invalid page header
> in block 6200 of relation "rb"
>
> I had this already a couple of months ago. Looking around the web, I saw this
> error is supposed to indicate a hardware error. I informed my sysadmin, but
> since this is just the dev system and the data was not important, I did a
> TRUNCATE TABLE on the "rb" relation, and the errors stopped...
>
> But now the error is back, and I'm a bit suspicious. If this is a hardware
> issue, it's rather suspicious that it returned in the exact same relation
> after I did a "truncate table". I have many other relations in the system,
> ones that fill up a lot faster. So I suspect this might be a PostgreSQL issue
> after all. What can I do about this?

Might be, but not very likely.  I and many others run pgsql in
production environments where it handles thousands of updates /
inserts per minute with no corruption.  We run on server class
hardware with ECC memory and large RAID arrays with no corruption.

Have you run something as simple as memtest86+ on your machine to see
if it's got bad memory?

> We are currently using PostgreSQL v. 8.3.1 on the server side.

You should really update to the latest 8.3.x version (around 8.3.8 or
so).  It's simple and easy, and it's possible you've hit a bug in an
older version of 8.3.

Re: Questions about connection clean-up and "invalid page header"

From
Herouth Maoz
Date:

Scott Marlowe wrote:

You can shorten the tcp_keepalive settings so that dead connections
get detected faster. 
Thanks, I'll ask my sysadmin to do that.

Might be, but not very likely.  I and many others run pgsql in
production environments where it handles thousands of updates /
inserts per minute with no corruption.  We run on server class
hardware with ECC memory and large RAID arrays with no corruption. 
Someone pointed out to me, though, that comparing data warehouse systems to production systems is like Apples and Oranges - we also have a production system that, as you say, makes millions of inserts and updates per hour. It works very well with PostgreSQL - a lot better than with Sybase with which we worked previously. But the reports system on which I work makes bulk inserts using calculations based on complicated joins and each transaction is long and memory-consuming, as opposed to the production system, where each transaction takes a few milliseconds and is cleared immediately.

So far this only happened to me in the development server, and if it really is a matter of hardware, I'm not worried. What I am worried is if there really is some sort of bug that may carry to our production reports system.
Have you run something as simple as memtest86+ on your machine to see
if it's got bad memory? 
I'll tell my sysadmin to do that. Thank you.
 
We are currently using PostgreSQL v. 8.3.1 on the server side.   
You should really update to the latest 8.3.x version (around 8.3.8 or
so).  It's simple and easy, and it's possible you've hit a bug in an
older version of 8.3. 
OK, I'll also try to get that done.

Thanks for your help,
Herouth

Re: Questions about connection clean-up and "invalid page header"

From
Greg Stark
Date:
On Mon, Jan 25, 2010 at 8:15 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> Is there a parameter to set in the configuration or some other means to
>> shorten the time before an abandoned backend's query is cancelled?
>
> You can shorten the tcp_keepalive settings so that dead connections
> get detected faster.
>

This won't help. The TCP connection is already being closed (or I
think only half-closed). The problem is that in the Unix socket API
you don't find out about that unless you check or try to read or write
to it.

The tcp_keepalive setting would only come into play if the remote
machine crashed or was disconnected from the network.



--
greg

Re: Questions about connection clean-up and "invalid page header"

From
Herouth Maoz
Date:

Greg Stark wrote:

On Mon, Jan 25, 2010 at 8:15 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: 
Is there a parameter to set in the configuration or some other means to
shorten the time before an abandoned backend's query is cancelled?     
You can shorten the tcp_keepalive settings so that dead connections
get detected faster.
   
This won't help. The TCP connection is already being closed (or I
think only half-closed). The problem is that in the Unix socket API
you don't find out about that unless you check or try to read or write
to it.

The tcp_keepalive setting would only come into play if the remote
machine crashed or was disconnected from the network. 
That's the situation I'm having, so it's OK. Crystal, being a Windows application, obviously runs on a different server than the database itself, so the connection between them is TCP/IP, not Unix domain sockets. And furthermore, that was exactly the problem as I described it - the fact that the third party software, instead of somehow instructing Crystal to send a cancel request to PostgreSQL, instead just kills the client process on the Windows side.

Herouth

Re: Questions about connection clean-up and "invalid page header"

From
Greg Stark
Date:
On Mon, Jan 25, 2010 at 11:37 AM, Herouth Maoz <herouth@unicell.co.il> wrote:
> The tcp_keepalive setting would only come into play if the remote
> machine crashed or was disconnected from the network.
>
>
> That's the situation I'm having, so it's OK. Crystal, being a Windows
> application, obviously runs on a different server than the database itself,
> so the connection between them is TCP/IP, not Unix domain sockets.

The unix socket api is used for both unix domain sockets and internet
domain sockets. The point is that in the api there's no way to find
out about a connection the other side has closed except for when you
write or read from it or when you explicitly check.


> And
> furthermore, that was exactly the problem as I described it - the fact that
> the third party software, instead of somehow instructing Crystal to send a
> cancel request to PostgreSQL, instead just kills the client process on the
> Windows side.

Killing the client process doesn't mean the machine has crashed or
been disconnected from the network. I'm assuming Crystal isn't
crashing the machine just to stop the report... And even if it did and
tcp_keepalives kicked in the server *still* wouldn't notice until it
checked or tried to read or write to that socket.

--
greg

Re: Questions about connection clean-up and "invalid page header"

From
Herouth Maoz
Date:

Greg Stark wrote:

On Mon, Jan 25, 2010 at 11:37 AM, Herouth Maoz <herouth@unicell.co.il> wrote: 
The tcp_keepalive setting would only come into play if the remote
machine crashed or was disconnected from the network.


That's the situation I'm having, so it's OK. Crystal, being a Windows
application, obviously runs on a different server than the database itself,
so the connection between them is TCP/IP, not Unix domain sockets.   
The unix socket api is used for both unix domain sockets and internet
domain sockets. The point is that in the api there's no way to find
out about a connection the other side has closed except for when you
write or read from it or when you explicitly check.

 
And
furthermore, that was exactly the problem as I described it - the fact that
the third party software, instead of somehow instructing Crystal to send a
cancel request to PostgreSQL, instead just kills the client process on the
Windows side.   
Killing the client process doesn't mean the machine has crashed or
been disconnected from the network. I'm assuming Crystal isn't
crashing the machine just to stop the report... And even if it did and
tcp_keepalives kicked in the server *still* wouldn't notice until it
checked or tried to read or write to that socket.
 
Well, I assume by the fact that eventually I get an "Unexpected end of file" message for those queries, that something does go in and check them. Do you have any suggestion as to how to cause the postgresql server to do so earlier?

Herouth

Re: Questions about connection clean-up and "invalid page header"

From
Greg Stark
Date:
On Mon, Jan 25, 2010 at 1:16 PM, Herouth Maoz <herouth@unicell.co.il> wrote:
> Well, I assume by the fact that eventually I get an "Unexpected end of file"
> message for those queries, that something does go in and check them. Do you
> have any suggestion as to how to cause the postgresql server to do so
> earlier?

No, Postgres pretty intentionally doesn't check because checking would
be quite slow.

If this is a plpgsql function looping you can put a RAISE NOTICE in
the loop periodically. I suppose you could write such a function and
add it to your query but whether it does what you want will depend on
the query plan.

--
greg