Thread: database errors
Our customer has problems with Pg 7.3.2 on Solaris. There are numerous errors in the app. server log and in the database log, including these: LOG: open of /mnt_c1t2d0s0/<some-path>/postgresql/pg_xlog/0000000000000001 (log file 0, segment 1) failed: No such file or directory LOG: invalid primary checkpoint record LOG: open of /mnt_c1t2d0s0/<some-path>/postgresql/pg_xlog/0000000000000001 (log file 0, segment 1) failed: No such file or directory LOG: invalid secondary checkpoint record PANIC: unable to locate a valid checkpoint record LOG: startup process (pid 16527) was terminated by signal 6 LOG: aborting startup due to startup process failure ... ERROR: Cannot insert a duplicate key into unique index cr_pk PANIC: RecordTransactionAbort: xact 55143 already committed LOG: server process (pid 22185) was terminated by signal 6 LOG: terminating any other active server processes WARNING: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormallyand possibly corrupted shared memory. I have rolled back the current transaction and am going to terminateyour database system connection and exit. Please reconnect to the database system and repeat your query. LOG: all server processes terminated; reinitializing shared memory and semaphores LOG: database system was interrupted at 2004-05-10 10:51:01 CDT LOG: checkpoint record is at 0/30005E0 LOG: redo record is at 0/30005E0; undo record is at 0/0; shutdown TRUE LOG: next transaction id: 53340; next oid: 57982 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/3000620 LOG: ReadRecord: record with zero length at 0/3000930 LOG: redo done at 0/3000908 WARNING: XLogFlush: request 0/A970F68 is not satisfied --- flushed only to 0/3000930 WARNING: XLogFlush: request 0/A970FA8 is not satisfied --- flushed only to 0/3000930 WARNING: XLogFlush: request 0/A970E00 is not satisfied --- flushed only to 0/3000930 WARNING: XLogFlush: request 0/A970E40 is not satisfied --- flushed only to 0/3000930 FATAL: The database system is starting up ... ---------------------------------------------- We've had "Cannot insert a duplicate key into unique index" in the past. We ran pg_resetxlog and reloaded the database - this helped. I wonder if message "open of /mnt_c1t2d0s0/... (log file 0, segment 1) failed: No such file or directory" may indicate some kind of NFS problem. Anything else I need to look at? Thanks in advance, Mike.
Michael Brusser <michael@synchronicity.com> writes: > I wonder if message > "open of /mnt_c1t2d0s0/... (log file 0, segment 1) failed: No such file or > directory" > may indicate some kind of NFS problem. Running a database over NFS is widely considered a horrid idea --- the NFS protocol is simply too prone to data loss. I think you may have a sterling example here of why not to do it :-( The messages you quote certainly read like a badly corrupted database to me. In the case of a local filesystem I'd be counseling you to start running memory and disk diagnostics. That may still be appropriate here, but you had better also reconsider the decision to use NFS. If you're absolutely set on using NFS, one possibly useful tip is to make sure it's a hard mount not a soft mount. If your systems support NFS-over-TCP instead of UDP, that might be worth trying too. Also I would strongly advise an update to PG 7.3.6. 7.3.2 has serious known bugs. regards, tom lane
It looks that "No such file or directory" followed by the abort signal resulted from manually removing logs. pg_resetxlog took care of this, but other problems persisted. I got a copy of the database and installed it on the local partition. It does seem badly corrupted, these are some hard errors. pg_dump fails and dumps the core: pg_dump: ERROR: XLogFlush: request 0/A971020 is not satisfied --- flushed only to 0/5000050 ... lost synchronization withserver, resetting connection looking at the core file: (dbx) where 15 =>[1] _libc_kill(0x0, 0x6, 0x0, 0xffffffff, 0x2eaf00, 0xff135888), at 0xff19f938 [2] abort(0xff1bc004, 0xff1c3a4c, 0x0, 0x7efefeff, 0x21c08, 0x2404c4), at 0xff13596c [3] elog(0x14, 0x267818, 0x0, 0xa971020, 0x0, 0x5006260), at 0x2407dc [4] XLogFlush(0xffbee908, 0xffbee908, 0x827e0,0x0, 0x0, 0x0), at 0x78530 [5] BufferSync(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x18df2c [6] FlushBufferPool(0x2, 0x1e554,0x0, 0x30000, 0x0, 0xffbeea79), at 0x18e5c4 [7] CreateCheckPoint(0x0, 0x0, 0x82c00, 0xff1bc004, 0x2212c, 0x83534), at 0x7d93c [8] BootstrapMain(0x5, 0xffbeec50, 0x10, 0xffbeec50, 0xffbeebc8, 0xffbeebc8), at 0x836bc [9] SSDataBase(0x3, 0x40a24a8a, 0x2e3800, 0x4, 0x2212c, 0x16f504), at 0x172590 [10] ServerLoop(0x5091, 0x2e398c, 0x2e3800, 0xff1c2940, 0xff1bc004, 0xff1c2940), at 0x16f3a0 [11] PostmasterMain(0x1, 0x323ad0, 0x2af000, 0x0, 0x65720000, 0x65720000), at 0x16ef88 [12] main(0x1, 0xffbef68c, 0xffbef694, 0x2eaf08, 0x0, 0x0), at 0x12864c ====================== (I don't have the debug build at the moment to get more details) this query fails: LOG: query: select count (1) from note_links_aux; ERROR: XLogFlush: request 0/A971020 is not satisfied --- flushed only to 0/5006260 drop table fails: drop table note_links_aux; ERROR: getObjectDescription: Rule 17019 does not exist Are there any pointers as to why this could happen, aside of potential memory and disk problems? As for NFS... I know how strong the Postgresql community is advising against it, but we have to face it: our customers ARE running on NFS and they WILL be running on NFS. Is there such a thing as "better" and "worse" NFS versions? (I made a note of what was said about hard mount vs. soft mount, etc) Tom, you recommended upgrade from 7.3.2 to 7.3.6 Out next release is using v 7.3.4. (maybe it's not too late to upgrade) Would v. 7.3.6 provide more protection against problems like this? Thank you, Mike > -----Original Message----- ... ... > The messages you quote certainly read like a badly corrupted database to > me. In the case of a local filesystem I'd be counseling you to start > running memory and disk diagnostics. That may still be appropriate > here, but you had better also reconsider the decision to use NFS. > > If you're absolutely set on using NFS, one possibly useful tip is to > make sure it's a hard mount not a soft mount. If your systems support > NFS-over-TCP instead of UDP, that might be worth trying too. > > Also I would strongly advise an update to PG 7.3.6. 7.3.2 has serious > known bugs. > > regards, tom lane >
Michael Brusser <michael@synchronicity.com> writes: > It looks that "No such file or directory" followed by the abort signal > resulted from manually removing logs. pg_resetxlog took care of this, > but other problems persisted. > pg_dump: ERROR: XLogFlush: request 0/A971020 is not satisfied --- > flushed only to 0/5000050 ... lost synchronization with server, resetting > connection Okay, you have a page with an LSN of A971020 which is past end of XLOG (5000050). You may have created this problem for yourself by doing pg_resetxlog with poorly chosen parameters. You could try redoing it with an XLOG start address larger than that (I'd suggest quite a bit larger, since there's no reason to believe that this is the latest-modified page in the whole DB). Theory B is that this particular page is corrupted and the LSN is just trash. But that seems less likely, since 7.3.4 has checks that test the other page header fields fairly well. Usually all the header fields are garbage if any are. > drop table fails: > drop table note_links_aux; > ERROR: getObjectDescription: Rule 17019 does not exist This looks like plain old corruption ... > Out next release is using v 7.3.4. (maybe it's not too late to upgrade) > Would v. 7.3.6 provide more protection against problems like this? Read the release notes. But I can't think of any reason to take the time to update and not go all the way to the latest dot-release in your branch. It's not going to be any harder, and it will get you more bug fixes. regards, tom lane
On Fri, 2004-05-14 at 02:00, Tom Lane wrote: > Michael Brusser <michael@synchronicity.com> writes: > > It looks that "No such file or directory" followed by the abort signal > > resulted from manually removing logs. pg_resetxlog took care of this, > > but other problems persisted. > > > pg_dump: ERROR: XLogFlush: request 0/A971020 is not satisfied --- > > flushed only to 0/5000050 ... lost synchronization with server, resetting > > connection > > Okay, you have a page with an LSN of A971020 which is past end of XLOG > (5000050). You may have created this problem for yourself by doing > pg_resetxlog with poorly chosen parameters. Michael, >From reading this error logs, it would appear that this system has been very strangely configured indeed. The recommendations for usage are fairly clear - don't use it on NFS....not cause we hate NFS....its just unsuited to the task of serving files to a database system - don't delete the transaction logs manually...they get recycled soon enough anyhow [ Is there a connection between the fact that it is on NFS and the logs have been manually deleted? We know that SQLServer allows a "truncate transcation log" facility....is that something that you were expecting to see and trying to emulate with PostgreSQL? Were you trying to stop NFS writes taking place?] Your logs are rated very low. Is the transaction rate very low on this system or has the system recently been set up? If it is the latter, then its not too late to change. Even if the transaction rate is low, what is the benefit of using NFS? PostgreSQL offers client/server access - so why not use that instead? Best Regards, Simon Riggs
> -----Original Message----- > From: pgsql-hackers-owner@postgresql.org > > > > pg_dump: ERROR: XLogFlush: request 0/A971020 is not satisfied --- > > > flushed only to 0/5000050 ... lost synchronization with > server, resetting > > > connection > > > > Okay, you have a page with an LSN of A971020 which is past end of XLOG > > (5000050). You may have created this problem for yourself by doing > > pg_resetxlog with poorly chosen parameters. > > Michael, > > >From reading this error logs, it would appear that this system has been > very strangely configured indeed. > > The recommendations for usage are fairly clear > - don't use it on NFS....not cause we hate NFS....its just unsuited to > the task of serving files to a database system > - don't delete the transaction logs manually...they get recycled soon > enough anyhow > > [ Is there a connection between the fact that it is on NFS and the logs > have been manually deleted? From what I know this was an attempt to make things better after they ran into bad problems. There's no direct indication these problems were in any way related to NFS, but I can't exclude this chance either. They ran pg_resetxlog without any arguments, then ran it with -f. (Perhaps this was done more than once) At some point they deleted the logs. And the errors I posted above were generated after I got the copy of this database and started experimenting with it. > We know that SQLServer allows a "truncate transcation log" facility.... > is that something that you were expecting to see and trying to emulate > with PostgreSQL? Were you trying to stop NFS writes taking place? No, I don't think this was the idea. > Your logs are rated very low. Is the transaction rate very low on this > system or has the system recently been set up? This was a very fresh database indeed. > ... what is the benefit of using NFS? > PostgreSQL offers client/server access - so why not use that instead? We don't have a full control over this. The database is a relatively small piece of a larger system, which includes the customized Apache server and a number of other modules as well. Setting up the system involves some rules and restrictions, one of them is that we don't yet support installing the database server on a different host. (If this is what you meant) We may actually support it soon, this is not a problem. But NFS is an entirely another issue - our customers often install database on NFS. I am not sure if we can ever prevent it... Thank you, Mike P.S. This is not the first time I'm bringing my problems to this list, and I sincerely want to thank you, folks for responsiveness and help... > > Best Regards, > > Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Fri, 2004-05-14 at 02:00, Tom Lane wrote: >> Okay, you have a page with an LSN of A971020 which is past end of XLOG >> (5000050). You may have created this problem for yourself by doing >> pg_resetxlog with poorly chosen parameters. > Is there a way to know exactly what those parameters should be? Not a very good one. The thing about pg_resetxlog (which perhaps is underemphasized in the documentation) is that it is by definition a wizard's tool: if you need to use it then the software has failed, and so it would be rather foolish to assume that the software can give you reliable information about how to use the recovery tool. Having said that, though, one could certainly imagine some kind of scanning tool that gives you a better picture of what you have, for instance statistics about all the page LSNs in the database. I'd still want some human judgement in the loop, but gathering raw data is what computers are good at. If you feel like working on that, be my guest (but please finish PITR first ;-)) > I was looking at writing an aggregate to allow use of xmax/xmin within a > max function, then generate some SQL to run against every table. Um. Bear in mind that the only time you will want this info is when you have a nonfunctional database. Within-SQL tools will not save your bacon in that situation. I was thinking of some sort of standalone tool (think pg_filedump on steroids...) regards, tom lane