Thread: Can't connect (2 dbs) or login (2 others)
We have a cluster with four nodes, each running a postgres 7.4.8 database. Due to a large amount of pilot error and possibly hardware problems (still trying to get to the bottom of it all), two of the databases won't start, and I can't login to two others, with any registered user. Two of the nodes have logs that look like this: 2007-09-22 07:06:05 [3055] LOG: could not create IPv6 socket: Address family not supported by protocol 2007-09-22 07:06:05 [3060] LOG: database system shutdown was interrupted at 2007-09-22 06:56:27 PDT 2007-09-22 07:06:05 [3060] LOG: could not open file "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment 0): No such file or directory 2007-09-22 07:06:05 [3060] LOG: invalid primary checkpoint record 2007-09-22 07:06:05 [3060] LOG: could not open file "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment 0): No such file or directory 2007-09-22 07:06:05 [3060] LOG: invalid secondary checkpoint record 2007-09-22 07:06:05 [3060] PANIC: could not locate a valid checkpoint record 2007-09-22 07:06:05 [3055] LOG: startup process (PID 3060) was terminated by signal 6 2007-09-22 07:06:05 [3055] LOG: aborting startup due to startup process failure The other two look like this: LOG: database system was shut down at 2007-09-22 05:52:26 PDT LOG: checkpoint record is at 0/9B0B8C LOG: redo record is at 0/9B0B8C; undo record is at 0/0; shutdown TRUE LOG: next transaction ID: 536; next OID: 17142 LOG: database system is ready FATAL: IDENT authentication failed for user "xxx" FATAL: IDENT authentication failed for user "yyy" Can someone tell what is going wrong and how I can recover? Morris
"Morris Goldstein" <morris.x.goldstein@gmail.com> writes: > Two of the nodes have logs that look like this: > 2007-09-22 07:06:05 [3060] LOG: could not open file > "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment > 0): No such file or directory > 2007-09-22 07:06:05 [3060] LOG: invalid primary checkpoint record > 2007-09-22 07:06:05 [3060] LOG: could not open file > "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment > 0): No such file or directory > 2007-09-22 07:06:05 [3060] LOG: invalid secondary checkpoint record > 2007-09-22 07:06:05 [3060] PANIC: could not locate a valid checkpoint record This is not good. Is it really complaining about log file 0? I can hardly credit that a database you care about wouldn't have used more log space than that. You might be able to recover, at least to the extent of getting some data out of these DBs, by using pg_resetxlog. Read its man page *carefully* beforehand. http://www.postgresql.org/docs/7.4/static/app-pgresetxlog.html > The other two look like this: > FATAL: IDENT authentication failed for user "xxx" This looks like user error, ie, trying to log in as a different Postgres user than your Unix username is. You can hardly have been accustomed to doing that if you were normally using IDENT authentication before. If you want to switch to some other authentication method, read the docs http://www.postgresql.org/docs/7.4/static/client-authentication.html and edit pg_hba.conf. BTW, 7.4.8 is pretty old; you really ought to be on something more current. The latest release in that branch is 7.4.18 --- you are missing two years worth of bug fixes. regards, tom lane
On 9/22/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Morris Goldstein" <morris.x.goldstein@gmail.com> writes: > > Two of the nodes have logs that look like this: > > > 2007-09-22 07:06:05 [3060] LOG: could not open file > > "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment > > 0): No such file or directory > > 2007-09-22 07:06:05 [3060] LOG: invalid primary checkpoint record > > 2007-09-22 07:06:05 [3060] LOG: could not open file > > "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment > > 0): No such file or directory > > 2007-09-22 07:06:05 [3060] LOG: invalid secondary checkpoint record > > 2007-09-22 07:06:05 [3060] PANIC: could not locate a valid checkpoint record > > This is not good. Is it really complaining about log file 0? I can > hardly credit that a database you care about wouldn't have used more log > space than that. There is much more log space. The pg_xlog files are 0000000900000C2- 0000000A0000006C. > > You might be able to recover, at least to the extent of getting some > data out of these DBs, by using pg_resetxlog. Read its man page > *carefully* beforehand. > http://www.postgresql.org/docs/7.4/static/app-pgresetxlog.html > > > The other two look like this: > > > FATAL: IDENT authentication failed for user "xxx" > > This looks like user error, ie, trying to log in as a different Postgres > user than your Unix username is. You can hardly have been accustomed to > doing that if you were normally using IDENT authentication before. > If you want to switch to some other authentication method, read the docs > http://www.postgresql.org/docs/7.4/static/client-authentication.html > and edit pg_hba.conf. No, it isn't user error. I tried logging in myself with our usual ids and they were rejected. Here is the end of strace output on an attempt to connect using psql. open("/lib/libnss_files.so.2", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\240\31"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=46740, ...}) = 0 mmap2(NULL, 41616, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xd26000 mmap2(0xd2f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x8) = 0xd2f000 close(3) = 0 mprotect(0xd2f000, 4096, PROT_READ) = 0 munmap(0xb7f2f000, 17508) = 0 open("/etc/passwd", O_RDONLY) = 3 fcntl64(3, F_GETFD) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 fstat64(3, {st_mode=S_IFREG|0644, st_size=1435, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f33000 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1435 close(3) = 0 munmap(0xb7f33000, 4096) = 0 stat64("/root/.pgpass", 0xbf9b3288) = -1 ENOENT (No such file or directory) socket(PF_FILE, SOCK_STREAM, 0) = 3 fcntl64(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 connect(3, {sa_family=AF_FILE, path="/tmp/.s.PGSQL.5432"}, 110) = -1 ENOENT (No such file or directory) close(3) = 0 write(2, "psql: could not connect to serve"..., 163psql: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.5432"? ) = 163 exit_group(2) = ? Process 11653 detached > > BTW, 7.4.8 is pretty old; you really ought to be on something more > current. The latest release in that branch is 7.4.18 --- you are > missing two years worth of bug fixes. The fun and games started during an upgrade to newer software, which includes 7.4.18. Morris
Sorry to reply to myself but here's a bit more info. That strace shows a crash. The node that was denying logins is now complaining about checkpoint file 000...000. It appears to be the case that a few attempts to start converts a db that rejects logins to one that crashes on startup. (When I first started examining the damage, I was able to start all four dbs but not login.) Morris
On 9/22/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Morris Goldstein" <morris.x.goldstein@gmail.com> writes: > > ... > > 2007-09-22 07:06:05 [3060] LOG: could not open file > > "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment 0): No such file or directory > > ... > > ... > > You might be able to recover, at least to the extent of getting some > data out of these DBs, by using pg_resetxlog. Read its man page > *carefully* beforehand. > http://www.postgresql.org/docs/7.4/static/app-pgresetxlog.html I looked at the pg_resetxlog documentation and have a question. Here is output from pg_resetxlog -n: pg_control values: pg_control version number: 72 Catalog version number: 200310211 Current log file ID: 0 Next log file segment: 1 Latest checkpoint's StartUpID: 12 Latest checkpoint's NextXID: 536 Latest checkpoint's NextOID: 17142 Database block size: 8192 Blocks per segment of large relation: 131072 Maximum length of identifiers: 64 Maximum number of function arguments: 32 Date/time type storage: floating-point numbers Maximum length of locale name: 128 LC_COLLATE: C LC_CTYPE: C Is it safe to run pg_resetxlog without the -l argument? Or should I specify -l, constructing an argument from the last pg_xlog filename? Morris
"Morris Goldstein" <morris.x.goldstein@gmail.com> writes: > I looked at the pg_resetxlog documentation and have a question. Here > is output from pg_resetxlog -n: [ snipped to just the non-constant numbers ] > Current log file ID: 0 > Next log file segment: 1 > Latest checkpoint's StartUpID: 12 > Latest checkpoint's NextXID: 536 > Latest checkpoint's NextOID: 17142 Ick. This looks *exactly* like what pg_control would contain immediately after initdb, in a 7.4 database. I suppose that someone tried an initdb as a recovery method, without understanding the consequences. > Is it safe to run pg_resetxlog without the -l argument? No, or at least it will not improve your situation. All those numbers need to be higher, probably a lot higher. See the advice in the pg_resetxlog man page about deriving reasonable values. regards, tom lane
Thanks for your help with pg_resetxlog. It recovered all of our databases, and it looks like we got lucky in that no updates were lost.
We are deciding on the goals for our next release, and one of the issues on the table is an upgrade to postgres 8. Can you comment on the improvements in performance and especially reliability over postgres 7.4? In particular, if the risk of pg_xlog corruption is lower, that would support the move to postgres 8 in a particularly effective way, while the pain of the recent episode is still vivid.
Morris
We are deciding on the goals for our next release, and one of the issues on the table is an upgrade to postgres 8. Can you comment on the improvements in performance and especially reliability over postgres 7.4? In particular, if the risk of pg_xlog corruption is lower, that would support the move to postgres 8 in a particularly effective way, while the pain of the recent episode is still vivid.
Morris
On 9/25/07, Morris Goldstein <morris.x.goldstein@gmail.com> wrote: > Thanks for your help with pg_resetxlog. It recovered all of our databases, > and it looks like we got lucky in that no updates were lost. > > We are deciding on the goals for our next release, and one of the issues on > the table is an upgrade to postgres 8. Can you comment on the improvements > in performance and especially reliability over postgres 7.4? In particular, > if the risk of pg_xlog corruption is lower, that would support the move to > postgres 8 in a particularly effective way, while the pain of the recent > episode is still vivid. My experience has been that 7.4 was rock solid stable. But our uses may not be similar to yours. I would question how you managed to get your servers into this state. if one server out of four had this problem I would have said to examine your system to see if it has bad memory, CPU, or drive arrays. But since it hit all of your machines, and at about the same time, I tend to think that someone did something to these machines that caused this issue, and it's not a 7.4.x problem. Did you update / upgrade kernels, device drivers, hardware, etc... What is common between all these systems besides postgresql? Was there a power outage? All machines had the same admin one day who had a brain cramp and did something stupid? Simply put, we need more info on how this happened.
On 9/25/07, Scott Marlowe <scott.marlowe@gmail.com> wrote:
I'm sure it is pilot error, and we're still trying to figure out exactly which pilot and what error.
This occurred as part of an upgrade -- new OS, kernel, drivers.
We've recovered. There is root cause analysis going on. The question is whether I can use an argument about 8.0 vs. 7.4 reliability from this fiasco to help us get to 8.0.
8.0 actually is more reliable than 7.4, I assume.
Morris
But since it hit all of your machines, and at about the same time, I
tend to think that someone did something to these machines that caused
this issue, and it's not a 7.4.x problem.
I'm sure it is pilot error, and we're still trying to figure out exactly which pilot and what error.
Did you update / upgrade kernels, device drivers, hardware, etc...
What is common between all these systems besides postgresql? Was
there a power outage? All machines had the same admin one day who had
a brain cramp and did something stupid?
This occurred as part of an upgrade -- new OS, kernel, drivers.
Simply put, we need more info on how this happened.
We've recovered. There is root cause analysis going on. The question is whether I can use an argument about 8.0 vs. 7.4 reliability from this fiasco to help us get to 8.0.
8.0 actually is more reliable than 7.4, I assume.
Morris
"Morris Goldstein" <morris.x.goldstein@gmail.com> writes: > We've recovered. There is root cause analysis going on. The question is > whether I can use an argument about 8.0 vs. 7.4 reliability from this fiasco > to help us get to 8.0. > 8.0 actually is more reliable than 7.4, I assume. I don't know that I'd make that argument for those two versions. If you compare 7.4.x to 8.2.x, then yes I'd say the later version is noticeably more stable. If nothing else, its ability to defend itself against transaction wraparound is a big leg up over 7.4. regards, tom lane
On 9/25/07, Morris Goldstein <morris.x.goldstein@gmail.com> wrote: > On 9/25/07, Scott Marlowe <scott.marlowe@gmail.com> wrote: > > But since it hit all of your machines, and at about the same time, I > > tend to think that someone did something to these machines that caused > > this issue, and it's not a 7.4.x problem. > > I'm sure it is pilot error, and we're still trying to figure out exactly > which pilot and what error. > > > Did you update / upgrade kernels, device drivers, hardware, etc... > > What is common between all these systems besides postgresql? Was > > there a power outage? All machines had the same admin one day who had > > a brain cramp and did something stupid? > > This occurred as part of an upgrade -- new OS, kernel, drivers. > > > Simply put, we need more info on how this happened. > > > > We've recovered. There is root cause analysis going on. The question is > whether I can use an argument about 8.0 vs. 7.4 reliability from this fiasco > to help us get to 8.0. > 8.0 actually is more reliable than 7.4, I assume. Well,if you're going to upgrade look at 8.1 as a minimum, 8.2 if possible. I can't say for sure that 8.0, 8.1 or 8.2 would have handled this much better, but having something like Point In Time Replication or other forms of replication readily available could have certainly limited your downtime in this instance. I would highly recommend 8.2.5 as your upgrade target. Look over the release notes for 8.0, 8.1 and 8.2. I will say that 8.2 is noticeably faster than 7.4, and is at least as stable for me.