Thread: Can't connect (2 dbs) or login (2 others)

Can't connect (2 dbs) or login (2 others)

From
"Morris Goldstein"
Date:
We have a cluster with four nodes, each running a postgres 7.4.8
database. Due to a large amount of pilot error and possibly hardware
problems (still trying to get to the bottom of it all), two of the
databases won't start, and I can't login to two others, with any
registered user.

Two of the nodes have logs that look like this:

2007-09-22 07:06:05 [3055] LOG:  could not create IPv6 socket: Address
family not supported by protocol
2007-09-22 07:06:05 [3060] LOG:  database system shutdown was
interrupted at 2007-09-22 06:56:27 PDT
2007-09-22 07:06:05 [3060] LOG:  could not open file
"/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment
0): No such file or directory
2007-09-22 07:06:05 [3060] LOG:  invalid primary checkpoint record
2007-09-22 07:06:05 [3060] LOG:  could not open file
"/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment
0): No such file or directory
2007-09-22 07:06:05 [3060] LOG:  invalid secondary checkpoint record
2007-09-22 07:06:05 [3060] PANIC:  could not locate a valid checkpoint record
2007-09-22 07:06:05 [3055] LOG:  startup process (PID 3060) was
terminated by signal 6
2007-09-22 07:06:05 [3055] LOG:  aborting startup due to startup process failure

The other two look like this:

LOG:  database system was shut down at 2007-09-22 05:52:26 PDT
LOG:  checkpoint record is at 0/9B0B8C
LOG:  redo record is at 0/9B0B8C; undo record is at 0/0; shutdown TRUE
LOG:  next transaction ID: 536; next OID: 17142
LOG:  database system is ready
FATAL:  IDENT authentication failed for user "xxx"
FATAL:  IDENT authentication failed for user "yyy"


Can someone tell what is going wrong and how I can recover?

Morris

Re: Can't connect (2 dbs) or login (2 others)

From
Tom Lane
Date:
"Morris Goldstein" <morris.x.goldstein@gmail.com> writes:
> Two of the nodes have logs that look like this:

> 2007-09-22 07:06:05 [3060] LOG:  could not open file
> "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment
> 0): No such file or directory
> 2007-09-22 07:06:05 [3060] LOG:  invalid primary checkpoint record
> 2007-09-22 07:06:05 [3060] LOG:  could not open file
> "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment
> 0): No such file or directory
> 2007-09-22 07:06:05 [3060] LOG:  invalid secondary checkpoint record
> 2007-09-22 07:06:05 [3060] PANIC:  could not locate a valid checkpoint record

This is not good.  Is it really complaining about log file 0?  I can
hardly credit that a database you care about wouldn't have used more log
space than that.

You might be able to recover, at least to the extent of getting some
data out of these DBs, by using pg_resetxlog.  Read its man page
*carefully* beforehand.
http://www.postgresql.org/docs/7.4/static/app-pgresetxlog.html

> The other two look like this:

> FATAL:  IDENT authentication failed for user "xxx"

This looks like user error, ie, trying to log in as a different Postgres
user than your Unix username is.  You can hardly have been accustomed to
doing that if you were normally using IDENT authentication before.
If you want to switch to some other authentication method, read the docs
http://www.postgresql.org/docs/7.4/static/client-authentication.html
and edit pg_hba.conf.

BTW, 7.4.8 is pretty old; you really ought to be on something more
current.  The latest release in that branch is 7.4.18 --- you are
missing two years worth of bug fixes.

            regards, tom lane

Re: Can't connect (2 dbs) or login (2 others)

From
"Morris Goldstein"
Date:
On 9/22/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Morris Goldstein" <morris.x.goldstein@gmail.com> writes:
> > Two of the nodes have logs that look like this:
>
> > 2007-09-22 07:06:05 [3060] LOG:  could not open file
> > "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment
> > 0): No such file or directory
> > 2007-09-22 07:06:05 [3060] LOG:  invalid primary checkpoint record
> > 2007-09-22 07:06:05 [3060] LOG:  could not open file
> > "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment
> > 0): No such file or directory
> > 2007-09-22 07:06:05 [3060] LOG:  invalid secondary checkpoint record
> > 2007-09-22 07:06:05 [3060] PANIC:  could not locate a valid checkpoint record
>
> This is not good.  Is it really complaining about log file 0?  I can
> hardly credit that a database you care about wouldn't have used more log
> space than that.

There is much more log space. The pg_xlog files are 0000000900000C2-
0000000A0000006C.

>
> You might be able to recover, at least to the extent of getting some
> data out of these DBs, by using pg_resetxlog.  Read its man page
> *carefully* beforehand.
> http://www.postgresql.org/docs/7.4/static/app-pgresetxlog.html
>
> > The other two look like this:
>
> > FATAL:  IDENT authentication failed for user "xxx"
>
> This looks like user error, ie, trying to log in as a different Postgres
> user than your Unix username is.  You can hardly have been accustomed to
> doing that if you were normally using IDENT authentication before.
> If you want to switch to some other authentication method, read the docs
> http://www.postgresql.org/docs/7.4/static/client-authentication.html
> and edit pg_hba.conf.

No, it isn't user error. I tried logging in myself with our usual ids and they
were rejected. Here is the end of strace output on an attempt to connect
using psql.

open("/lib/libnss_files.so.2", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\240\31"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=46740, ...}) = 0
mmap2(NULL, 41616, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0xd26000
mmap2(0xd2f000, 8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x8) = 0xd2f000
close(3)                                = 0
mprotect(0xd2f000, 4096, PROT_READ)     = 0
munmap(0xb7f2f000, 17508)               = 0
open("/etc/passwd", O_RDONLY)           = 3
fcntl64(3, F_GETFD)                     = 0
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
fstat64(3, {st_mode=S_IFREG|0644, st_size=1435, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7f33000
read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 1435
close(3)                                = 0
munmap(0xb7f33000, 4096)                = 0
stat64("/root/.pgpass", 0xbf9b3288)     = -1 ENOENT (No such file or directory)
socket(PF_FILE, SOCK_STREAM, 0)         = 3
fcntl64(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
connect(3, {sa_family=AF_FILE, path="/tmp/.s.PGSQL.5432"}, 110) = -1
ENOENT (No such file or directory)
close(3)                                = 0
write(2, "psql: could not connect to serve"..., 163psql: could not
connect to server: No such file or directory
    Is the server running locally and accepting
    connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
) = 163
exit_group(2)                           = ?
Process 11653 detached

>
> BTW, 7.4.8 is pretty old; you really ought to be on something more
> current.  The latest release in that branch is 7.4.18 --- you are
> missing two years worth of bug fixes.

The fun and games started during an upgrade to newer software, which
includes 7.4.18.

Morris

Re: Can't connect (2 dbs) or login (2 others)

From
"Morris Goldstein"
Date:
Sorry to reply to myself but here's a bit more info. That strace shows
a crash. The node that was denying logins is now complaining about
checkpoint file 000...000. It appears to be the case that a few
attempts to start converts a db that rejects logins to one that
crashes on startup. (When I first started examining the damage, I was
able to start all four dbs but not login.)

Morris

Re: Can't connect (2 dbs) or login (2 others)

From
"Morris Goldstein"
Date:
On 9/22/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Morris Goldstein" <morris.x.goldstein@gmail.com> writes:
> > ...
> > 2007-09-22 07:06:05 [3060] LOG:  could not open file
> > "/var/lib/pgsql/data/pg_xlog/0000000000000000" (log file 0, segment 0): No such file or directory
> > ...
>
> ...
>
> You might be able to recover, at least to the extent of getting some
> data out of these DBs, by using pg_resetxlog.  Read its man page
> *carefully* beforehand.
> http://www.postgresql.org/docs/7.4/static/app-pgresetxlog.html

I looked at the pg_resetxlog documentation and have a question. Here
is output from pg_resetxlog -n:

pg_control values:

pg_control version number:            72
Catalog version number:               200310211
Current log file ID:                  0
Next log file segment:                1
Latest checkpoint's StartUpID:        12
Latest checkpoint's NextXID:          536
Latest checkpoint's NextOID:          17142
Database block size:                  8192
Blocks per segment of large relation: 131072
Maximum length of identifiers:        64
Maximum number of function arguments: 32
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           C
LC_CTYPE:                             C

Is it safe to run pg_resetxlog without the -l argument? Or should I
specify -l, constructing an argument from the last pg_xlog filename?

Morris

Re: Can't connect (2 dbs) or login (2 others)

From
Tom Lane
Date:
"Morris Goldstein" <morris.x.goldstein@gmail.com> writes:
> I looked at the pg_resetxlog documentation and have a question. Here
> is output from pg_resetxlog -n:

[ snipped to just the non-constant numbers ]

> Current log file ID:                  0
> Next log file segment:                1
> Latest checkpoint's StartUpID:        12
> Latest checkpoint's NextXID:          536
> Latest checkpoint's NextOID:          17142

Ick.  This looks *exactly* like what pg_control would contain
immediately after initdb, in a 7.4 database.  I suppose that someone
tried an initdb as a recovery method, without understanding the
consequences.

> Is it safe to run pg_resetxlog without the -l argument?

No, or at least it will not improve your situation.  All those numbers
need to be higher, probably a lot higher.  See the advice in the
pg_resetxlog man page about deriving reasonable values.

            regards, tom lane

Re: Can't connect (2 dbs) or login (2 others)

From
"Morris Goldstein"
Date:
Thanks for your help with pg_resetxlog. It recovered all of our databases, and it looks like we got lucky in that no updates were lost.

We are deciding on the goals for our next release, and one of the issues on the table is an upgrade to postgres 8. Can you comment on the improvements in performance and especially reliability over postgres 7.4? In particular, if the risk of pg_xlog corruption is lower, that would support the move to postgres 8 in a particularly effective way, while the pain of the recent episode is still vivid.

Morris

Re: Can't connect (2 dbs) or login (2 others)

From
"Scott Marlowe"
Date:
On 9/25/07, Morris Goldstein <morris.x.goldstein@gmail.com> wrote:
> Thanks for your help with pg_resetxlog. It recovered all of our databases,
> and it looks like we got lucky in that no updates were lost.
>
> We are deciding on the goals for our next release, and one of the issues on
> the table is an upgrade to postgres 8. Can you comment on the improvements
> in performance and especially reliability over postgres 7.4? In particular,
> if the risk of pg_xlog corruption is lower, that would support the move to
> postgres 8 in a particularly effective way, while the pain of the recent
> episode is still vivid.

My experience has been that 7.4 was rock solid stable.  But our uses
may not be similar to yours. I would question how you managed to get
your servers into this state.  if one server out of four had this
problem I would have said to examine your system to see if it has bad
memory, CPU, or drive arrays.

But since it hit all of your machines, and at about the same time, I
tend to think that someone did something to these machines that caused
this issue, and it's not a 7.4.x problem.

Did you update / upgrade kernels, device drivers, hardware, etc...
What is common between all these systems besides postgresql?  Was
there a power outage?  All machines had the same admin one day who had
a brain cramp and did something stupid?

Simply put, we need more info on how this happened.

Re: Can't connect (2 dbs) or login (2 others)

From
"Morris Goldstein"
Date:
On 9/25/07, Scott Marlowe <scott.marlowe@gmail.com> wrote:
But since it hit all of your machines, and at about the same time, I
tend to think that someone did something to these machines that caused
this issue, and it's not a 7.4.x problem.

I'm sure it is pilot error, and we're still trying to figure out exactly which pilot and what error.

Did you update / upgrade kernels, device drivers, hardware, etc...
What is common between all these systems besides postgresql?  Was
there a power outage?  All machines had the same admin one day who had
a brain cramp and did something stupid?

This occurred as part of an upgrade -- new OS, kernel, drivers.

Simply put, we need more info on how this happened.

We've recovered. There is root cause analysis going on. The question is whether I can use an argument about 8.0 vs. 7.4 reliability from this fiasco to help us get to 8.0.
8.0 actually is more reliable than 7.4, I assume.

Morris

Re: Can't connect (2 dbs) or login (2 others)

From
Tom Lane
Date:
"Morris Goldstein" <morris.x.goldstein@gmail.com> writes:
> We've recovered. There is root cause analysis going on. The question is
> whether I can use an argument about 8.0 vs. 7.4 reliability from this fiasco
> to help us get to 8.0.
> 8.0 actually is more reliable than 7.4, I assume.

I don't know that I'd make that argument for those two versions.
If you compare 7.4.x to 8.2.x, then yes I'd say the later version
is noticeably more stable.  If nothing else, its ability to defend
itself against transaction wraparound is a big leg up over 7.4.

            regards, tom lane

Re: Can't connect (2 dbs) or login (2 others)

From
"Scott Marlowe"
Date:
On 9/25/07, Morris Goldstein <morris.x.goldstein@gmail.com> wrote:
> On 9/25/07, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> > But since it hit all of your machines, and at about the same time, I
> > tend to think that someone did something to these machines that caused
> > this issue, and it's not a 7.4.x problem.
>
> I'm sure it is pilot error, and we're still trying to figure out exactly
> which pilot and what error.
>
> > Did you update / upgrade kernels, device drivers, hardware, etc...
> > What is common between all these systems besides postgresql?  Was
> > there a power outage?  All machines had the same admin one day who had
> > a brain cramp and did something stupid?
>
> This occurred as part of an upgrade -- new OS, kernel, drivers.
>
> > Simply put, we need more info on how this happened.
> >
>
> We've recovered. There is root cause analysis going on. The question is
> whether I can use an argument about 8.0 vs. 7.4 reliability from this fiasco
> to help us get to 8.0.
> 8.0 actually is more reliable than 7.4, I assume.

Well,if you're going to upgrade look at 8.1 as a minimum, 8.2 if possible.

I can't say for sure that 8.0, 8.1 or 8.2 would have handled this much
better, but having something like Point In Time Replication or other
forms of replication readily available could have certainly limited
your downtime in this instance.

I would highly recommend 8.2.5 as your upgrade target.  Look over the
release notes for 8.0, 8.1 and 8.2.

I will say that 8.2 is noticeably faster than 7.4, and is at least as
stable for me.