Re: pgsql 8.3.3 "Invalid page header" leads to "FATAL: segment too big" - Mailing list pgsql-admin

From Achilleas Mantzios
Subject Re: pgsql 8.3.3 "Invalid page header" leads to "FATAL: segment too big"
Date
Msg-id 201004231206.38741.achill@matrix.gatewaynet.com
Whole thread Raw
In response to Re: pgsql 8.3.3 "Invalid page header" leads to "FATAL: segment too big"  (Cédric Villemain <cedric.villemain.debian@gmail.com>)
List pgsql-admin
Στις Thursday 22 April 2010 19:02:00 ο/η Cédric Villemain έγραψε:
> 2010/4/22 Achilleas Mantzios <achill@matrix.gatewaynet.com>:
> > Στις Thursday 22 April 2010 16:53:05 ο/η Cédric Villemain έγραψε:
> >> 2010/4/22 Achilleas Mantzios <achill@matrix.gatewaynet.com>:
> >> > Hello,
> >> > i have this serious problem in one of our remote vessels. (comm is done by minicom to the remote satelite modem)
> >> > I think that this server was under some sort of constant resets or hardware failures.
> >> > Initially,i had this problem:
> >> > ERROR:  invalid page header in block 672720 of relation "pg_toast_125716009"
> >> >
> >> > This toast table corresponds to a table named "mail_message",
> >> >                                Table "public.mail_message"
> >> >  Column   |       Type        |                         Modifiers
> >> > -----------+-------------------+-----------------------------------------------------------
> >> >  msgno     | mail_msgno_domain | not null default nextval('mail_msgno_sequence'::regclass)
> >> >  msgsource | bytea             |
> >> > Indexes:
> >> >    "mail_message_key" PRIMARY KEY, btree (msgno)
> >> >
> >> > (obviously the TOAST table serves the msgsource varlena) the contents of which is not of vital importance.
> >> > I tried, REINDEXING, with no success, and after that, i tried resetting the said block on disk as per this
> >> > suggestion by Tom here: http://old.nabble.com/invalid-page-header-td11981154.html
> >> >
> >> > i found the oid of the table:
> >> > SELECT tableoid,1 from pg_toast.pg_toast_125716009 order by chunk_id LIMIT 1
> >> >  tableoid  | ?column?
> >> > -----------+----------
> >> >  125716013 |        1
> >> >
> >> > (and just to verify)
> >> > SELECT relname from pg_class where oid=125716013;
> >> >      relname
> >> > --------------------
> >> >  pg_toast_125716009
> >> >
> >> > Then i did: (as i said i do not need the contents of msgsource - yet the contents of msgno are vital)
> >> > dd if=/dev/zero of=/usr/local/var/lib/pgsql/data/base/125714957/125716013 seek=672720 bs=8192 count=1
> >>
> >> segment have 1.1GB size maximum. You have to catch in what segment the
> >> faulty block is, and reajust the block value from the error report to
> >> the real one in the good segment.
> >>
> >
> > Thanx,
> > Taking a look at /usr/local/src/postgresql-8.3.3/src/backend/storage/smgr/md.c
> > i see the the error comes from function mdnblocks
> >
> > if (nblocks > ((BlockNumber) RELSEG_SIZE))
> >                        elog(FATAL, "segment too big");
> >
> > That means, that some segment file is bigger than RELSEG_SIZE
> > At least in my system:
> > #define BLCKSZ   8192
> > #define RELSEG_SIZE (0x40000000 / BLCKSZ)
> > So, any segment file cannot be bigger than RELSEG_SIZE blocks (or 2^30 bytes = 1GB)
> >
> > Currently i dont have any access to the machine but tomorrow i will check the file sizes.
> >
> > Can anyone shed some light as to some method of identifying all the segment files of a table?
> > The first one has the same name as the tableoid.
> > How about the subsequent segments?
>
> Your execution of dd make your first segment bigger than expected.
> Other segment have the same name with a .1 .2 etc  suffix.
>
> You have to shrink your first segment to the correct size.
> check what happens, you should have now the original error.
>
> And, I have never used it, but I think it is the purpose of
> zero_damaged_pages to parameter to allow postgresql itself to zero the
> bad black. (reading
> src/backend/storage/buffer/bufmgr.c confirm that.
>
> *BUT* take care that it will zero *every* bad page, perhaps not only
> the one trapping an error.
>
> In those situation, it is good to make a snapshot of the pgdata
> directory, in case your fingers surf  too fast on the keyboard....
>
> If you don't want to activate zero_damage_page,  then go and calculate
> which block in which segment you have to zeroing.
>
> side note, it may be usefull to have the relevant information in the
> error message...
>


Many Thanks, Cédric Villemain and Tom
What i did was first to correct the first segment file with smth like
dd if=216293737 of=216293737.good seek=0 bs=8192 count=131072
which effectively truncates all but the first 131072 blocks (or 2^30 bytes = 1GB)

After that was done, and restarting postgresql backend, then i fell back to the situation
with the Invalid page header, as noted before.

I stopped the backend and calculated the exact segment file and offset where the problem was:
The block with the invalid header was the block with number: 672720
Now each segment contains at most 131072 blocks, with all but the last
containing exactly 131072 blocks.
So my problematic segment was the one with number:
672720 /131072  = 5
and the block offset inside this segment was:
672720 - (5*131072) = 17360

so i tried to zero that patricular block with

dd conv=notrunc if=216293737.5 of=216293737.5.GOOD seek=17360 bs=8192 count=1

i started postgresql and i threw out a warning about initializing this zero page.
After that, i reported error in header for block 672740,
i repeated the same procedure, and i was able to go further and even reindex the whole database
which went fine.

I tried as the ultimate test (and a useful one at this point) to pg_dump the database.
Unfortunately one table seems to be in error: the error is:

pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  compressed data is corrupt
pg_dump: The command was: COPY public.mail_entity (msgno, entno, entparentno....

I think the initial issue of this thread is solved, i'll come back with news on the pg_dump issue.

> >
> >> >
> >> > However, after that, unfortunately i get constant postgresql server restarts with:
> >> > FATAL:  segment too big
> >> > server closed the connection unexpectedly
> >> >        This probably means the server terminated abnormally
> >> >        before or while processing the request.
> >> > The connection to the server was lost. Attempting reset: Succeeded.
> >> >
> >> > Is there anything i can do to savage the situation?
> >> >
> >> > (one of) the hard part here is that i dont have neither physical nor network access to the server
> >> > (only ultra expensive unreliable satellite comms)
> >> >
> >> > Thanks for any hints...
> >> >
> >> > --
> >> > Achilleas Mantzios
> >> >
> >> > --
> >> > Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> >> > To make changes to your subscription:
> >> > http://www.postgresql.org/mailpref/pgsql-admin
> >> >
> >>
> >>
> >>
> >
> >
> >
> > --
> > Achilleas Mantzios
> >
>
>
>
> --
> Cédric Villemain
>



--
Achilleas Mantzios

pgsql-admin by date:

Previous
From: Edwin Quijada
Date:
Subject: Re: [pgsql-es-ayuda] REF: Migracion de Sqlserver a Postgresql
Next
From: Achilleas Mantzios
Date:
Subject: pg_dump: Error message from server: ERROR: compressed data is corrupt