Thread: database errors

database errors

From
Michael Brusser
Date:
Our customer has problems with Pg 7.3.2 on Solaris.
There are numerous errors in the app. server log and in the database log,
including these:

LOG:  open of /mnt_c1t2d0s0/<some-path>/postgresql/pg_xlog/0000000000000001
(log file 0, segment 1) failed: No such file or directory
LOG:  invalid primary checkpoint record
LOG:  open of /mnt_c1t2d0s0/<some-path>/postgresql/pg_xlog/0000000000000001
(log file 0, segment 1) failed: No such file or directory
LOG: invalid secondary checkpoint record
PANIC:  unable to locate a valid checkpoint record
LOG:  startup process (pid 16527) was terminated by signal 6
LOG:  aborting startup due to startup process failure
...
ERROR:  Cannot insert a duplicate key into unique index cr_pk
PANIC:  RecordTransactionAbort: xact 55143 already committed
LOG:  server process (pid 22185) was terminated by signal 6
LOG:  terminating any other active server processes
WARNING:  Message from PostgreSQL backend:       The Postmaster has informed me that some other backend       died
abnormallyand possibly corrupted shared memory.       I have rolled back the current transaction and am       going to
terminateyour database system connection and exit.       Please reconnect to the database system and repeat your
query.

LOG:  all server processes terminated; reinitializing shared memory and
semaphores
LOG:  database system was interrupted at 2004-05-10 10:51:01 CDT
LOG: checkpoint record is at 0/30005E0
LOG:  redo record is at 0/30005E0; undo record is at 0/0; shutdown TRUE
LOG:  next transaction id: 53340; next oid: 57982
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 0/3000620
LOG:  ReadRecord: record with zero length at 0/3000930
LOG:  redo done at 0/3000908
WARNING:  XLogFlush: request 0/A970F68 is not satisfied --- flushed only to
0/3000930
WARNING:  XLogFlush: request 0/A970FA8 is not satisfied --- flushed only to
0/3000930
WARNING:  XLogFlush: request 0/A970E00 is not satisfied --- flushed only to
0/3000930
WARNING:  XLogFlush: request 0/A970E40 is not satisfied --- flushed only to
0/3000930
FATAL:  The database system is starting up
...
----------------------------------------------
We've had "Cannot insert a duplicate key into unique index" in the past.
We ran pg_resetxlog and reloaded the database - this helped.

I wonder if message
"open of /mnt_c1t2d0s0/... (log file 0, segment 1) failed: No such file or
directory"
may indicate some kind of NFS problem.

Anything else I need to look at?

Thanks in advance,
Mike.




Re: database errors

From
Tom Lane
Date:
Michael Brusser <michael@synchronicity.com> writes:
> I wonder if message
> "open of /mnt_c1t2d0s0/... (log file 0, segment 1) failed: No such file or
> directory"
> may indicate some kind of NFS problem.

Running a database over NFS is widely considered a horrid idea --- the
NFS protocol is simply too prone to data loss.  I think you may have
a sterling example here of why not to do it :-(

The messages you quote certainly read like a badly corrupted database to
me.  In the case of a local filesystem I'd be counseling you to start
running memory and disk diagnostics.  That may still be appropriate
here, but you had better also reconsider the decision to use NFS.

If you're absolutely set on using NFS, one possibly useful tip is to
make sure it's a hard mount not a soft mount.  If your systems support
NFS-over-TCP instead of UDP, that might be worth trying too.

Also I would strongly advise an update to PG 7.3.6.  7.3.2 has serious
known bugs.
        regards, tom lane


Re: database errors

From
Michael Brusser
Date:
It looks that "No such file or directory" followed by the abort signal
resulted from manually removing logs. pg_resetxlog took care of this,
but other problems persisted.

I got a copy of the database and installed it on the local partition.
It does seem badly corrupted, these are some hard errors.

pg_dump fails and dumps the core:

pg_dump: ERROR:  XLogFlush: request 0/A971020 is not satisfied --- flushed only to 0/5000050 ... lost synchronization
withserver, resetting
 
connection

looking at the core file:
(dbx) where 15
=>[1] _libc_kill(0x0, 0x6, 0x0, 0xffffffff, 0x2eaf00, 0xff135888), at
0xff19f938 [2] abort(0xff1bc004, 0xff1c3a4c, 0x0, 0x7efefeff, 0x21c08, 0x2404c4), at
0xff13596c [3] elog(0x14, 0x267818, 0x0, 0xa971020, 0x0, 0x5006260), at 0x2407dc [4] XLogFlush(0xffbee908, 0xffbee908,
0x827e0,0x0, 0x0, 0x0), at 0x78530 [5] BufferSync(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x18df2c [6] FlushBufferPool(0x2,
0x1e554,0x0, 0x30000, 0x0, 0xffbeea79), at
 
0x18e5c4 [7] CreateCheckPoint(0x0, 0x0, 0x82c00, 0xff1bc004, 0x2212c, 0x83534), at
0x7d93c [8] BootstrapMain(0x5, 0xffbeec50, 0x10, 0xffbeec50, 0xffbeebc8,
0xffbeebc8), at 0x836bc [9] SSDataBase(0x3, 0x40a24a8a, 0x2e3800, 0x4, 0x2212c, 0x16f504), at
0x172590 [10] ServerLoop(0x5091, 0x2e398c, 0x2e3800, 0xff1c2940, 0xff1bc004,
0xff1c2940), at 0x16f3a0 [11] PostmasterMain(0x1, 0x323ad0, 0x2af000, 0x0, 0x65720000, 0x65720000),
at 0x16ef88 [12] main(0x1, 0xffbef68c, 0xffbef694, 0x2eaf08, 0x0, 0x0), at 0x12864c
======================
(I don't have the debug build at the moment to get more details)


this query fails:
LOG:  query: select count (1) from note_links_aux;
ERROR:  XLogFlush: request 0/A971020 is not satisfied --- flushed only to
0/5006260

drop table fails:
drop table note_links_aux;
ERROR:  getObjectDescription: Rule 17019 does not exist

Are there any pointers as to why this could happen, aside
of potential memory and disk problems?

As for NFS... I know how strong the Postgresql community is advising
against it, but we have to face it: our customers ARE running on NFS
and they WILL be running on NFS.
Is there such a thing as "better" and "worse" NFS versions?
(I made a note of what was said about hard mount vs. soft mount, etc)

Tom, you recommended upgrade from 7.3.2 to 7.3.6
Out next release is using v 7.3.4. (maybe it's not too late to upgrade)
Would v. 7.3.6 provide more protection against problems like this?

Thank you,
Mike


> -----Original Message-----
... ...
> The messages you quote certainly read like a badly corrupted database to
> me.  In the case of a local filesystem I'd be counseling you to start
> running memory and disk diagnostics.  That may still be appropriate
> here, but you had better also reconsider the decision to use NFS.
>
> If you're absolutely set on using NFS, one possibly useful tip is to
> make sure it's a hard mount not a soft mount.  If your systems support
> NFS-over-TCP instead of UDP, that might be worth trying too.
>
> Also I would strongly advise an update to PG 7.3.6.  7.3.2 has serious
> known bugs.
>
>             regards, tom lane
>




Re: database errors

From
Tom Lane
Date:
Michael Brusser <michael@synchronicity.com> writes:
> It looks that "No such file or directory" followed by the abort signal
> resulted from manually removing logs. pg_resetxlog took care of this,
> but other problems persisted.

> pg_dump: ERROR:  XLogFlush: request 0/A971020 is not satisfied ---
>   flushed only to 0/5000050 ... lost synchronization with server, resetting
> connection

Okay, you have a page with an LSN of A971020 which is past end of XLOG
(5000050).  You may have created this problem for yourself by doing
pg_resetxlog with poorly chosen parameters.  You could try redoing it
with an XLOG start address larger than that (I'd suggest quite a bit
larger, since there's no reason to believe that this is the
latest-modified page in the whole DB).

Theory B is that this particular page is corrupted and the LSN is just
trash.  But that seems less likely, since 7.3.4 has checks that test the
other page header fields fairly well.  Usually all the header fields are
garbage if any are.

> drop table fails:
> drop table note_links_aux;
> ERROR:  getObjectDescription: Rule 17019 does not exist

This looks like plain old corruption ...

> Out next release is using v 7.3.4. (maybe it's not too late to upgrade)
> Would v. 7.3.6 provide more protection against problems like this?

Read the release notes.  But I can't think of any reason to take the
time to update and not go all the way to the latest dot-release in your
branch.  It's not going to be any harder, and it will get you more bug
fixes.
        regards, tom lane


Re: database errors

From
Simon Riggs
Date:
On Fri, 2004-05-14 at 02:00, Tom Lane wrote:
> Michael Brusser <michael@synchronicity.com> writes:
> > It looks that "No such file or directory" followed by the abort signal
> > resulted from manually removing logs. pg_resetxlog took care of this,
> > but other problems persisted.
> 
> > pg_dump: ERROR:  XLogFlush: request 0/A971020 is not satisfied ---
> >   flushed only to 0/5000050 ... lost synchronization with server, resetting
> > connection
> 
> Okay, you have a page with an LSN of A971020 which is past end of XLOG
> (5000050).  You may have created this problem for yourself by doing
> pg_resetxlog with poorly chosen parameters. 

Michael,

>From reading this error logs, it would appear that this system has been
very strangely configured indeed.

The recommendations for usage are fairly clear
- don't use it on NFS....not cause we hate NFS....its just unsuited to
the task of serving files to a database system
- don't delete the transaction logs manually...they get recycled soon
enough anyhow

[ Is there a connection between the fact that it is on NFS and the logs
have been manually deleted? We know that SQLServer allows a "truncate
transcation log" facility....is that something that you were expecting
to see and trying to emulate with PostgreSQL? Were you trying to stop
NFS writes taking place?]

Your logs are rated very low. Is the transaction rate very low on this
system or has the system recently been set up? If it is the latter, then
its not too late to change.

Even if the transaction rate is low, what is the benefit of using NFS?
PostgreSQL offers client/server access - so why not use that instead?

Best Regards,

Simon Riggs



Re: database errors

From
Michael Brusser
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
>
> > > pg_dump: ERROR:  XLogFlush: request 0/A971020 is not satisfied ---
> > >   flushed only to 0/5000050 ... lost synchronization with
> server, resetting
> > > connection
> >
> > Okay, you have a page with an LSN of A971020 which is past end of XLOG
> > (5000050).  You may have created this problem for yourself by doing
> > pg_resetxlog with poorly chosen parameters.
>
> Michael,
>
> >From reading this error logs, it would appear that this system has been
> very strangely configured indeed.
>
> The recommendations for usage are fairly clear
> - don't use it on NFS....not cause we hate NFS....its just unsuited to
> the task of serving files to a database system
> - don't delete the transaction logs manually...they get recycled soon
> enough anyhow
>
> [ Is there a connection between the fact that it is on NFS and the logs
> have been manually deleted?

From what I know this was an attempt to make things better after they
ran into bad problems. There's no direct indication these problems
were in any way related to NFS, but I can't exclude this chance either.
They ran pg_resetxlog without any arguments, then ran it with -f.
(Perhaps this was done more than once) At some point they deleted the logs.
And the errors I posted above were generated after I got the copy of this
database and started experimenting with it.

> We know that SQLServer allows a "truncate transcation log" facility....
> is that something that you were expecting to see and trying to emulate
> with PostgreSQL? Were you trying to stop NFS writes taking place?
No, I don't think this was the idea.

> Your logs are rated very low. Is the transaction rate very low on this
> system or has the system recently been set up?
This was a very fresh database indeed.

> ... what is the benefit of using NFS?
> PostgreSQL offers client/server access - so why not use that instead?

We don't have a full control over this. The database is a relatively small
piece of a larger system, which includes the customized Apache server and
a number of other modules as well. Setting up the system involves some rules
and restrictions, one of them is that we don't yet support installing the
database server on a different host. (If this is what you meant)
We may actually support it soon, this is not a problem.
But NFS is an entirely another issue - our customers often install database
on NFS.
I am not sure if we can ever prevent it...
Thank you,
Mike

P.S. This is not the first time I'm bringing my problems to this list,
and I sincerely want to thank you, folks for responsiveness and help...

>
> Best Regards,
>
> Simon Riggs




Re: database errors

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Fri, 2004-05-14 at 02:00, Tom Lane wrote:
>> Okay, you have a page with an LSN of A971020 which is past end of XLOG
>> (5000050).  You may have created this problem for yourself by doing
>> pg_resetxlog with poorly chosen parameters. 

> Is there a way to know exactly what those parameters should be?

Not a very good one.  The thing about pg_resetxlog (which perhaps is
underemphasized in the documentation) is that it is by definition a
wizard's tool: if you need to use it then the software has failed,
and so it would be rather foolish to assume that the software can
give you reliable information about how to use the recovery tool.

Having said that, though, one could certainly imagine some kind of
scanning tool that gives you a better picture of what you have,
for instance statistics about all the page LSNs in the database.
I'd still want some human judgement in the loop, but gathering
raw data is what computers are good at.

If you feel like working on that, be my guest (but please finish
PITR first ;-))

> I was looking at writing an aggregate to allow use of xmax/xmin within a
> max function, then generate some SQL to run against every table.

Um.  Bear in mind that the only time you will want this info is when you
have a nonfunctional database.  Within-SQL tools will not save your
bacon in that situation.  I was thinking of some sort of standalone
tool (think pg_filedump on steroids...)
        regards, tom lane