Re: pg_dumpall problem - duplicated users - Mailing list pgsql-admin

From Tom Lane
Subject Re: pg_dumpall problem - duplicated users
Date
Msg-id 13804.1125670088@sss.pgh.pa.us
Whole thread Raw
In response to pg_dumpall problem - duplicated users  (Przemysław Nogaj <pn@tey.pl>)
List pgsql-admin
Bartosz Nowak <grubby@go2.pl> writes:
>   Tom Lane napisa�(a):
>>> Yeah. We need to look at the contents of the infomask for these rows.

>  Item   1 -- Length:   95  Offset: 8096 (0x1fa0)  Flags: USED
>   XID: min (1)  CMIN|XMAX: 50469  CMAX|XVAC: 0
>   Block Id: 0  linp Index: 29   Attributes: 8   Size: 24
>   infomask: 0x0501 (HASNULL|XMIN_COMMITTED|XMAX_COMMITTED)
>   t_bits: [0]: 0x1f

>   1fa0: 01000000 25c50000 00000000 00000000  ....%...........
>   1fb0: 1d000800 0105181f 706f7374 67726573  ........postgres

Well, you are definitely suffering from transaction ID wraparound.
Other rows in the table bear XIDs as high as 2559800779, so your
current transaction counter is at least that much.  Transaction 50469,
which obsoleted this row, was a *long* time ago.

The reason that this row is suddenly being seen as valid, rather than
deleted, is that the XID counter has wrapped around and so 50469
is now seen as "in the future" not "in the past" --- that is, the code
thinks the row is deleted, but by a concurrent transaction that started
later than our own transaction.  So it's valid, but you can't delete it
because it's already deleted.

I can't see any evidence of data corruption.  I think you just forgot
to vacuum pg_shadow for a very long time.  This row should have been
physically deleted long ago --- but evidently VACUUM was never run
across pg_shadow, until it was too late.

While you could perhaps manually repair this one row with a hex editor,
I don't think that will get you out of trouble.  The odds are good that
there are similar problems elsewhere, including both system catalogs and
user tables (ie, if you weren't vacuuming pg_shadow, what else weren't
you vacuuming?).

I'm not sure that you are going to be able to get out of this without
losing data.  Here's what I'd try (but bear in mind that I just made up
this procedure on the spot):

1. Stop the postmaster and take a physical dump of the $PGDATA tree,
if you don't have one already.  This will at least let you get back to
where you are now if the rest doesn't work.

2. Restart the postmaster, but don't let anyone else in (for safety
I'd suggest modifying pg_hba.conf to guarantee this).  You're going
to be down for a little bit :-(

3. Do a VACUUM FREEZE (database-wide) in *each* database, or at least
each one you care about recovering.  *Don't* use FULL.

4. Stop the postmaster again.  Use pg_controldata to note the current
transaction counter ("latest checkpoint's NextXID").  Then use
pg_resetxlog with the -x switch to deliberately back up the XID counter
to something less than 2 billion.

5. Restart the postmaster, and try to pg_dump everything.  Also do
whatever sanity tests you can think of --- look for missing and
duplicated rows in particular.

6. If you think you have a good dump, initdb and reload the dump,
and you're back in business.  If it didn't work, well, revert to
your last backup :-(

After you get out of your immediate problem, you had better look at
your vacuuming procedures and find out why you are in this fix.  See
http://www.postgresql.org/docs/7.3/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

            regards, tom lane

pgsql-admin by date:

Previous
From: "Manikandan C"
Date:
Subject: Re: Reg:Connection Object
Next
From: Tom Lane
Date:
Subject: Re: pg_dumpall problem - duplicated users