Thread: Need help with error

Need help with error

From
Steven Saner
Date:
Using Postgres 7.0 on BSDI 4.1

For the last several days we are getting errors that look like this:

Error: cannot write block 0 of krftmp4 [adm] blind.

An interesting thing is that in this example, krftmp4 is a table that
the user that got this error message would not have accessed in any
way. The user was trying to do an update of a different table in the
adm database. When this happens, it seems that the backend dies, which
ends up causing the backend connections for all users to die.

Can someone give me an idea of what this error message means, and
perhaps how to fix it?

I haven't been keeping up with the list lately, so please forgive me
if this has been covered recently.

Thanks.

Re: Need help with error

From
Tom Lane
Date:
Steven Saner <ssaner@pantheranet.com> writes:
> Using Postgres 7.0 on BSDI 4.1
> For the last several days we are getting errors that look like this:

> Error: cannot write block 0 of krftmp4 [adm] blind.

> An interesting thing is that in this example, krftmp4 is a table that
> the user that got this error message would not have accessed in any
> way.

Right --- that's implicit in the blind-write logic.  A blind write
means trying to dump out a dirty page from the shared buffer pool
that belongs to a relation your own backend hasn't touched.

Since the write fails, the dirty block remains in the shared buffer
pool, waiting for some other backend to try to dump it again and fail
again :-(

The simplest recovery method is to restart the postmaster, causing a new
buffer pool to be set up.

However, from a developer's perspective, I'm more interested in finding
out how you got into this state in the first place.  We thought we'd
fixed all the bugs that could give rise to orphaned dirty blocks, which
was the cause of this type of error in all the cases we'd seen so far.
Perhaps there is still a remaining bug of that kind, or maybe you've
found a new way to cause this problem.  Do you have time to do some
investigation before you restart the postmaster?

One thing I'd like to know is why the write is failing in the first
place.  Have you deleted or renamed the krftmp4 table, or its containing
database adm, probably not too long before these errors started
appearing?

> When this happens, it seems that the backend dies, which
> ends up causing the backend connections for all users to die.

That shouldn't be happening either; blind write failure is classed as
a simple ERROR, not a FATAL error.  Does any message appear in the
postmaster log?  Is a corefile dumped, and if so what do you get from
a backtrace?

            regards, tom lane

Re: Need help with error

From
"Philip Poles"
Date:
Greetings...

I'm not sure if this is relevant, but I've seen similar errors occur when there
are too many open files on the filesystem (running Linux RH 6.2).  I'm not sure
if this problem is in the backend or the Linux kernal, or somewhere else, not
being very conversant in such matters myself, but I did have our admin increase
the limit for number of open files.
As far as I recall, when this happens, the postmaster tries to reset all
currently running backends.  I don't think I've seen it dump core, but I can
reproduce the situation fairly easily (by running a hundred or so concurrent 7
table join queries) to find out...I'll try it on Friday, if I get a chance.

Steven, I have no knowledge of how BSDI behaves, but might this have something
to do with your problem?
It seems to me as though postgres winds up with a LOT of open files when
processing complex queries - is this actually the case, or should I be looking
elsewhere for the cause of this problem?

    -Philip

P.S. Tom - I haven't actually been able to reproduce that problem I was having
with hash indicies...it just went away...and I know nothing has changed, except
maybe the load on the server...I'll keep trying, maybe I can get a bug report in
about it after all.

----- Original Message -----
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Steven Saner <ssaner@pantheranet.com>
Cc: <pgsql-general@postgresql.org>
Sent: Wednesday, July 05, 2000 4:29 PM
Subject: Re: [GENERAL] Need help with error


Steven Saner <ssaner@pantheranet.com> writes:
> Using Postgres 7.0 on BSDI 4.1
> For the last several days we are getting errors that look like this:

> Error: cannot write block 0 of krftmp4 [adm] blind.

> An interesting thing is that in this example, krftmp4 is a table that
> the user that got this error message would not have accessed in any
> way.

Right --- that's implicit in the blind-write logic.  A blind write
means trying to dump out a dirty page from the shared buffer pool
that belongs to a relation your own backend hasn't touched.

Since the write fails, the dirty block remains in the shared buffer
pool, waiting for some other backend to try to dump it again and fail
again :-(

The simplest recovery method is to restart the postmaster, causing a new
buffer pool to be set up.

However, from a developer's perspective, I'm more interested in finding
out how you got into this state in the first place.  We thought we'd
fixed all the bugs that could give rise to orphaned dirty blocks, which
was the cause of this type of error in all the cases we'd seen so far.
Perhaps there is still a remaining bug of that kind, or maybe you've
found a new way to cause this problem.  Do you have time to do some
investigation before you restart the postmaster?

One thing I'd like to know is why the write is failing in the first
place.  Have you deleted or renamed the krftmp4 table, or its containing
database adm, probably not too long before these errors started
appearing?

> When this happens, it seems that the backend dies, which
> ends up causing the backend connections for all users to die.

That shouldn't be happening either; blind write failure is classed as
a simple ERROR, not a FATAL error.  Does any message appear in the
postmaster log?  Is a corefile dumped, and if so what do you get from
a backtrace?

regards, tom lane



Re: Need help with error

From
Tom Lane
Date:
"Philip Poles" <philip@surfen.com> writes:
> I'm not sure if this is relevant, but I've seen similar errors occur
> when there are too many open files on the filesystem (running Linux RH
> 6.2).  I'm not sure if this problem is in the backend or the Linux
> kernal, or somewhere else, not being very conversant in such matters
> myself, but I did have our admin increase the limit for number of open
> files.

Good point.  If you are at the system limit on number of open files,
blind writes would fail where regular writes do not (in 7.0.* and
earlier --- this is fixed for 7.1).  However, if your kernel file
table is full, the symptoms are usually visible all over the place
not just in Postgres, so I'm not sure that this explains Steven's
problem.

I would recommend to Steven that he update to 7.0.2 soon, since there
is some additional debugging logic in 7.0.2 that logs the kernel error
code when a blind write fails --- that would give us some more info
about the underlying problem.  Of course an update implies a postmaster
restart which will make the problem go away, so unless he knows how to
reproduce it I'd prefer to investigate first...

            regards, tom lane

Re: Need help with error

From
Steven Saner
Date:
On Wed, Jul 05, 2000 at 04:29:16PM -0400, Tom Lane wrote:
> Steven Saner <ssaner@pantheranet.com> writes:
> > Using Postgres 7.0 on BSDI 4.1
> > For the last several days we are getting errors that look like this:
>
> > Error: cannot write block 0 of krftmp4 [adm] blind.
>
> > An interesting thing is that in this example, krftmp4 is a table that
> > the user that got this error message would not have accessed in any
> > way.
>
> Right --- that's implicit in the blind-write logic.  A blind write
> means trying to dump out a dirty page from the shared buffer pool
> that belongs to a relation your own backend hasn't touched.
>
> Since the write fails, the dirty block remains in the shared buffer
> pool, waiting for some other backend to try to dump it again and fail
> again :-(
>
> The simplest recovery method is to restart the postmaster, causing a new
> buffer pool to be set up.
>
> However, from a developer's perspective, I'm more interested in finding
> out how you got into this state in the first place.  We thought we'd
> fixed all the bugs that could give rise to orphaned dirty blocks, which
> was the cause of this type of error in all the cases we'd seen so far.
> Perhaps there is still a remaining bug of that kind, or maybe you've
> found a new way to cause this problem.  Do you have time to do some
> investigation before you restart the postmaster?
>
> One thing I'd like to know is why the write is failing in the first
> place.  Have you deleted or renamed the krftmp4 table, or its containing
> database adm, probably not too long before these errors started
> appearing?

Well, we have had this database version/configuration in operation for
a month or so. We rebooted the server July 1 as part of our normal
maintenance procedure. It has been after that that we have begun
seeing these problems. The adm database has been around for a long
time. The krftmp4 table is not the only table that I have seen listed
in these error messages.

> > When this happens, it seems that the backend dies, which
> > ends up causing the backend connections for all users to die.
>
> That shouldn't be happening either; blind write failure is classed as
> a simple ERROR, not a FATAL error.  Does any message appear in the
> postmaster log?  Is a corefile dumped, and if so what do you get from
> a backtrace?

Unfortunatly, I don't believe that there is any postmaster log. I will
probably have to restart the postmaster and redirect the stdout to a
file or something. As far as I can tell there are no core files being
created.

I will probably restart the postmaster tonight and make sure that
logging is being done. Then if it happens again we might have more to
go on.

Steve


Re: Need help with error

From
Tom Lane
Date:
Steven Saner <ssaner@pantheranet.com> writes:
> I will probably restart the postmaster tonight and make sure that
> logging is being done. Then if it happens again we might have more to
> go on.

If you're going to do that, please install 7.0.2 first, else the log
likely won't tell us much anyway ...

            regards, tom lane