Thread: Need help with error
Using Postgres 7.0 on BSDI 4.1 For the last several days we are getting errors that look like this: Error: cannot write block 0 of krftmp4 [adm] blind. An interesting thing is that in this example, krftmp4 is a table that the user that got this error message would not have accessed in any way. The user was trying to do an update of a different table in the adm database. When this happens, it seems that the backend dies, which ends up causing the backend connections for all users to die. Can someone give me an idea of what this error message means, and perhaps how to fix it? I haven't been keeping up with the list lately, so please forgive me if this has been covered recently. Thanks.
Steven Saner <ssaner@pantheranet.com> writes: > Using Postgres 7.0 on BSDI 4.1 > For the last several days we are getting errors that look like this: > Error: cannot write block 0 of krftmp4 [adm] blind. > An interesting thing is that in this example, krftmp4 is a table that > the user that got this error message would not have accessed in any > way. Right --- that's implicit in the blind-write logic. A blind write means trying to dump out a dirty page from the shared buffer pool that belongs to a relation your own backend hasn't touched. Since the write fails, the dirty block remains in the shared buffer pool, waiting for some other backend to try to dump it again and fail again :-( The simplest recovery method is to restart the postmaster, causing a new buffer pool to be set up. However, from a developer's perspective, I'm more interested in finding out how you got into this state in the first place. We thought we'd fixed all the bugs that could give rise to orphaned dirty blocks, which was the cause of this type of error in all the cases we'd seen so far. Perhaps there is still a remaining bug of that kind, or maybe you've found a new way to cause this problem. Do you have time to do some investigation before you restart the postmaster? One thing I'd like to know is why the write is failing in the first place. Have you deleted or renamed the krftmp4 table, or its containing database adm, probably not too long before these errors started appearing? > When this happens, it seems that the backend dies, which > ends up causing the backend connections for all users to die. That shouldn't be happening either; blind write failure is classed as a simple ERROR, not a FATAL error. Does any message appear in the postmaster log? Is a corefile dumped, and if so what do you get from a backtrace? regards, tom lane
Greetings... I'm not sure if this is relevant, but I've seen similar errors occur when there are too many open files on the filesystem (running Linux RH 6.2). I'm not sure if this problem is in the backend or the Linux kernal, or somewhere else, not being very conversant in such matters myself, but I did have our admin increase the limit for number of open files. As far as I recall, when this happens, the postmaster tries to reset all currently running backends. I don't think I've seen it dump core, but I can reproduce the situation fairly easily (by running a hundred or so concurrent 7 table join queries) to find out...I'll try it on Friday, if I get a chance. Steven, I have no knowledge of how BSDI behaves, but might this have something to do with your problem? It seems to me as though postgres winds up with a LOT of open files when processing complex queries - is this actually the case, or should I be looking elsewhere for the cause of this problem? -Philip P.S. Tom - I haven't actually been able to reproduce that problem I was having with hash indicies...it just went away...and I know nothing has changed, except maybe the load on the server...I'll keep trying, maybe I can get a bug report in about it after all. ----- Original Message ----- From: Tom Lane <tgl@sss.pgh.pa.us> To: Steven Saner <ssaner@pantheranet.com> Cc: <pgsql-general@postgresql.org> Sent: Wednesday, July 05, 2000 4:29 PM Subject: Re: [GENERAL] Need help with error Steven Saner <ssaner@pantheranet.com> writes: > Using Postgres 7.0 on BSDI 4.1 > For the last several days we are getting errors that look like this: > Error: cannot write block 0 of krftmp4 [adm] blind. > An interesting thing is that in this example, krftmp4 is a table that > the user that got this error message would not have accessed in any > way. Right --- that's implicit in the blind-write logic. A blind write means trying to dump out a dirty page from the shared buffer pool that belongs to a relation your own backend hasn't touched. Since the write fails, the dirty block remains in the shared buffer pool, waiting for some other backend to try to dump it again and fail again :-( The simplest recovery method is to restart the postmaster, causing a new buffer pool to be set up. However, from a developer's perspective, I'm more interested in finding out how you got into this state in the first place. We thought we'd fixed all the bugs that could give rise to orphaned dirty blocks, which was the cause of this type of error in all the cases we'd seen so far. Perhaps there is still a remaining bug of that kind, or maybe you've found a new way to cause this problem. Do you have time to do some investigation before you restart the postmaster? One thing I'd like to know is why the write is failing in the first place. Have you deleted or renamed the krftmp4 table, or its containing database adm, probably not too long before these errors started appearing? > When this happens, it seems that the backend dies, which > ends up causing the backend connections for all users to die. That shouldn't be happening either; blind write failure is classed as a simple ERROR, not a FATAL error. Does any message appear in the postmaster log? Is a corefile dumped, and if so what do you get from a backtrace? regards, tom lane
"Philip Poles" <philip@surfen.com> writes: > I'm not sure if this is relevant, but I've seen similar errors occur > when there are too many open files on the filesystem (running Linux RH > 6.2). I'm not sure if this problem is in the backend or the Linux > kernal, or somewhere else, not being very conversant in such matters > myself, but I did have our admin increase the limit for number of open > files. Good point. If you are at the system limit on number of open files, blind writes would fail where regular writes do not (in 7.0.* and earlier --- this is fixed for 7.1). However, if your kernel file table is full, the symptoms are usually visible all over the place not just in Postgres, so I'm not sure that this explains Steven's problem. I would recommend to Steven that he update to 7.0.2 soon, since there is some additional debugging logic in 7.0.2 that logs the kernel error code when a blind write fails --- that would give us some more info about the underlying problem. Of course an update implies a postmaster restart which will make the problem go away, so unless he knows how to reproduce it I'd prefer to investigate first... regards, tom lane
On Wed, Jul 05, 2000 at 04:29:16PM -0400, Tom Lane wrote: > Steven Saner <ssaner@pantheranet.com> writes: > > Using Postgres 7.0 on BSDI 4.1 > > For the last several days we are getting errors that look like this: > > > Error: cannot write block 0 of krftmp4 [adm] blind. > > > An interesting thing is that in this example, krftmp4 is a table that > > the user that got this error message would not have accessed in any > > way. > > Right --- that's implicit in the blind-write logic. A blind write > means trying to dump out a dirty page from the shared buffer pool > that belongs to a relation your own backend hasn't touched. > > Since the write fails, the dirty block remains in the shared buffer > pool, waiting for some other backend to try to dump it again and fail > again :-( > > The simplest recovery method is to restart the postmaster, causing a new > buffer pool to be set up. > > However, from a developer's perspective, I'm more interested in finding > out how you got into this state in the first place. We thought we'd > fixed all the bugs that could give rise to orphaned dirty blocks, which > was the cause of this type of error in all the cases we'd seen so far. > Perhaps there is still a remaining bug of that kind, or maybe you've > found a new way to cause this problem. Do you have time to do some > investigation before you restart the postmaster? > > One thing I'd like to know is why the write is failing in the first > place. Have you deleted or renamed the krftmp4 table, or its containing > database adm, probably not too long before these errors started > appearing? Well, we have had this database version/configuration in operation for a month or so. We rebooted the server July 1 as part of our normal maintenance procedure. It has been after that that we have begun seeing these problems. The adm database has been around for a long time. The krftmp4 table is not the only table that I have seen listed in these error messages. > > When this happens, it seems that the backend dies, which > > ends up causing the backend connections for all users to die. > > That shouldn't be happening either; blind write failure is classed as > a simple ERROR, not a FATAL error. Does any message appear in the > postmaster log? Is a corefile dumped, and if so what do you get from > a backtrace? Unfortunatly, I don't believe that there is any postmaster log. I will probably have to restart the postmaster and redirect the stdout to a file or something. As far as I can tell there are no core files being created. I will probably restart the postmaster tonight and make sure that logging is being done. Then if it happens again we might have more to go on. Steve
Steven Saner <ssaner@pantheranet.com> writes: > I will probably restart the postmaster tonight and make sure that > logging is being done. Then if it happens again we might have more to > go on. If you're going to do that, please install 7.0.2 first, else the log likely won't tell us much anyway ... regards, tom lane