Thread: trying to run PITR recovery

trying to run PITR recovery

From
Warren Little
Date:
Hello,
I'm testing my PITR recovery procedures and something doesn't look right.
The following is from the logs upon starting postgres with recovery.conf file in place

 2007-03-23 05:56:00 MDTLOG:  database system was interrupted at 2007-03-18 05:09:15 MDT
@ 2007-03-23 05:56:00 MDTLOG:  starting archive recovery
@ 2007-03-23 05:56:00 MDTLOG:  restore_command = "cp /data/pgLocal/archive/WAL_restore/%f "%p""
cp: cannot stat `/data/pgLocal/archive/WAL_restore/00000001.history': No such file or directory
@ 2007-03-23 05:56:00 MDTLOG:  restored log file "000000010000011A000000EE.004E0060.backup" from archive
@ 2007-03-23 05:56:00 MDTLOG:  restored log file "000000010000011A000000EE" from archive
@ 2007-03-23 05:56:00 MDTLOG:  checkpoint record is at 11A/EE4E0060
@ 2007-03-23 05:56:00 MDTLOG:  redo record is at 11A/EE4E0060; undo record is at 0/0; shutdown FALSE
@ 2007-03-23 05:56:00 MDTLOG:  next transaction ID: 2415965426; next OID: 81701223
@ 2007-03-23 05:56:00 MDTLOG:  next MultiXactId: 43380; next MultiXactOffset: 92368
@ 2007-03-23 05:56:00 MDTLOG:  automatic recovery in progress
@ 2007-03-23 05:56:01 MDTLOG:  redo starts at 11A/EE4E00B0
@ 2007-03-23 05:56:15 MDTLOG:  restored log file "000000010000011A000000EF" from archive
@ 2007-03-23 05:56:27 MDTLOG:  restored log file "000000010000011A000000F0" from archive
@ 2007-03-23 05:56:31 MDTLOG:  restored log file "000000010000011A000000F1" from archive
@ 2007-03-23 05:56:34 MDTLOG:  restored log file "000000010000011A000000F2" from archive
@ 2007-03-23 05:56:40 MDTLOG:  restored log file "000000010000011A000000F3" from archive
@ 2007-03-23 05:56:50 MDTLOG:  restored log file "000000010000011A000000F4" from archive
@ 2007-03-23 05:57:02 MDTLOG:  restored log file "000000010000011A000000F5" from archive
@ 2007-03-23 05:57:07 MDTLOG:  restored log file "000000010000011A000000F6" from archive
@ 2007-03-23 05:57:11 MDTLOG:  restored log file "000000010000011A000000F7" from archive
@ 2007-03-23 05:57:14 MDTLOG:  restored log file "000000010000011A000000F8" from archive
@ 2007-03-23 05:57:18 MDTLOG:  restored log file "000000010000011A000000F9" from archive
@ 2007-03-23 05:57:21 MDTLOG:  restored log file "000000010000011A000000FA" from archive
@ 2007-03-23 05:57:27 MDTLOG:  restored log file "000000010000011A000000FB" from archive
@ 2007-03-23 05:57:29 MDTLOG:  restored log file "000000010000011A000000FC" from archive
@ 2007-03-23 05:57:33 MDTLOG:  restored log file "000000010000011A000000FD" from archive
@ 2007-03-23 05:57:35 MDTLOG:  incorrect resource manager data checksum in record at 11A/FD492B20
@ 2007-03-23 05:57:35 MDTLOG:  redo done at 11A/FD492210
@ 2007-03-23 05:57:36 MDTLOG:  restored log file "000000010000011A000000FD" from archive
@ 2007-03-23 05:57:36 MDTLOG:  archive recovery complete
@ 2007-03-23 05:57:36 MDTLOG:  could not truncate directory "pg_multixact/members": apparent wraparound
@ 2007-03-23 05:59:00 MDTLOG:  database system is ready
@ 2007-03-23 05:59:00 MDTLOG:  transaction ID wrap limit is 3065701724, limited by database "postgres"
@ 2007-03-23 12:14:04 MDTLOG:  autovacuum: processing database "postgres"

My concern is that there were many more logfiles to be played following "00000010000011A000000FD" 
(ie 000000010000011E0000005C) yet it appears the recovery stop at that point and called it good.
I would assume all WAL logs would be restored.

The recovery.conf was simple: 
restore_command = 'cp /data/pgLocal/archive/WAL_restore/%f "%p"'

The backup file 000000010000011A000000EE.004E0060.backup" contained:
START WAL LOCATION: 11A/EE4E0060 (file 000000010000011A000000EE)
STOP WAL LOCATION: 11A/EFF68AB8 (file 000000010000011A000000EF)
CHECKPOINT LOCATION: 11A/EE4E0060
START TIME: 2007-03-17 20:29:16 MDT
LABEL: 076_pgdata.tar
STOP TIME: 2007-03-18 05:16:17 MDT

Does the line:  incorrect resource manager data checksum in record at 11A/FD492B20
mean there is a corrupted WAL log file?

Any insight here would be helpful 
version PG 8.1.2 64 bit Linux


thanks


Warren Little
Chief Technology Officer
Meridias Capital Inc
ph 866.369.7763



AYUDA URGENTE CON TABLA PG_SHADOW

From
"Daniel Ricardo Medina"
Date:

Ayuda por favor

Cuando ejecuto este comando me sale una doble tupla del usuario postgres, esto me error cuando ejecuto el comando Pg_dump para realizar el respectivo backup..

 

radar=# select * from pg_user;

usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig

-----------+----------+-------------+----------+-----------+----------+----------+-----------

postgres | 1 | t | t | t | ******** | |

postgres | 1 | t | t | t | ******** | |

userradar | 100 | f | f | f | ******** | |

(3 rows)

He intentado borrar los usuario directamente de la tabla subiendo postgres com postgres -O -P -D directorio

Y corriendo estos comandos..

select * from pg_shadow;

delete from pg_shadow;

CREATE USER postgres WITH CREATEDB CREATEUSER PASSWORD '***' SYSID 1; CREATE USER userradar PASSWORD '***'

Pero me crea los mismos tres usuarios en la tabla pg_shadow, el problema es que no he podido hacer el backup y ya no tengo espacio en la particion donde tengo montada la base

 

 

 

 

 

 

Ayuda por favor

Cuando ejecuto este comando me sale una doble tupla del usuario postgres, esto me error cuando ejecuto el comando Pg_dump para realizar el respectivo backup..

 

radar=# select * from pg_user;

usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig

-----------+----------+-------------+----------+-----------+----------+----------+-----------

postgres | 1 | t | t | t | ******** | |

postgres | 1 | t | t | t | ******** | |

userradar | 100 | f | f | f | ******** | |

(3 rows)

He intentado borrar los usuario directamente de la tabla subiendo postgres com postgres -O -P -D directorio

Y corriendo estos comandos..

select * from pg_shadow;

delete from pg_shadow;

CREATE USER postgres WITH CREATEDB CREATEUSER PASSWORD '***' SYSID 1; CREATE USER userradar PASSWORD '***'

Pero me crea los mismos tres usuarios en la tabla pg_shadow, el problema es que no he podido hacer el backup y ya no tengo espacio en la particion donde tengo montada la base

 

 

 

 

 

 
Daniel Ricardo Medina
Ing. de Desarrollo
Computec S.A.
Tel 260 71 11 Ext 1640
 


De: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] En nombre de Warren Little
Enviado el: Friday, March 23, 2007 6:16 PM
Para: pgsql-admin@postgresql.org
Asunto: [ADMIN] trying to run PITR recovery

Hello,
I'm testing my PITR recovery procedures and something doesn't look right.
The following is from the logs upon starting postgres with recovery.conf file in place

 2007-03-23 05:56:00 MDTLOG:  database system was interrupted at 2007-03-18 05:09:15 MDT
@ 2007-03-23 05:56:00 MDTLOG:  starting archive recovery
@ 2007-03-23 05:56:00 MDTLOG:  restore_command = "cp /data/pgLocal/archive/WAL_restore/%f "%p""
cp: cannot stat `/data/pgLocal/archive/WAL_restore/00000001.history': No such file or directory
@ 2007-03-23 05:56:00 MDTLOG:  restored log file "000000010000011A000000EE.004E0060.backup" from archive
@ 2007-03-23 05:56:00 MDTLOG:  restored log file "000000010000011A000000EE" from archive
@ 2007-03-23 05:56:00 MDTLOG:  checkpoint record is at 11A/EE4E0060
@ 2007-03-23 05:56:00 MDTLOG:  redo record is at 11A/EE4E0060; undo record is at 0/0; shutdown FALSE
@ 2007-03-23 05:56:00 MDTLOG:  next transaction ID: 2415965426; next OID: 81701223
@ 2007-03-23 05:56:00 MDTLOG:  next MultiXactId: 43380; next MultiXactOffset: 92368
@ 2007-03-23 05:56:00 MDTLOG:  automatic recovery in progress
@ 2007-03-23 05:56:01 MDTLOG:  redo starts at 11A/EE4E00B0
@ 2007-03-23 05:56:15 MDTLOG:  restored log file "000000010000011A000000EF" from archive
@ 2007-03-23 05:56:27 MDTLOG:  restored log file "000000010000011A000000F0" from archive
@ 2007-03-23 05:56:31 MDTLOG:  restored log file "000000010000011A000000F1" from archive
@ 2007-03-23 05:56:34 MDTLOG:  restored log file "000000010000011A000000F2" from archive
@ 2007-03-23 05:56:40 MDTLOG:  restored log file "000000010000011A000000F3" from archive
@ 2007-03-23 05:56:50 MDTLOG:  restored log file "000000010000011A000000F4" from archive
@ 2007-03-23 05:57:02 MDTLOG:  restored log file "000000010000011A000000F5" from archive
@ 2007-03-23 05:57:07 MDTLOG:  restored log file "000000010000011A000000F6" from archive
@ 2007-03-23 05:57:11 MDTLOG:  restored log file "000000010000011A000000F7" from archive
@ 2007-03-23 05:57:14 MDTLOG:  restored log file "000000010000011A000000F8" from archive
@ 2007-03-23 05:57:18 MDTLOG:  restored log file "000000010000011A000000F9" from archive
@ 2007-03-23 05:57:21 MDTLOG:  restored log file "000000010000011A000000FA" from archive
@ 2007-03-23 05:57:27 MDTLOG:  restored log file "000000010000011A000000FB" from archive
@ 2007-03-23 05:57:29 MDTLOG:  restored log file "000000010000011A000000FC" from archive
@ 2007-03-23 05:57:33 MDTLOG:  restored log file "000000010000011A000000FD" from archive
@ 2007-03-23 05:57:35 MDTLOG:  incorrect resource manager data checksum in record at 11A/FD492B20
@ 2007-03-23 05:57:35 MDTLOG:  redo done at 11A/FD492210
@ 2007-03-23 05:57:36 MDTLOG:  restored log file "000000010000011A000000FD" from archive
@ 2007-03-23 05:57:36 MDTLOG:  archive recovery complete
@ 2007-03-23 05:57:36 MDTLOG:  could not truncate directory "pg_multixact/members": apparent wraparound
@ 2007-03-23 05:59:00 MDTLOG:  database system is ready
@ 2007-03-23 05:59:00 MDTLOG:  transaction ID wrap limit is 3065701724, limited by database "postgres"
@ 2007-03-23 12:14:04 MDTLOG:  autovacuum: processing database "postgres"

My concern is that there were many more logfiles to be played following "00000010000011A000000FD" 
(ie 000000010000011E0000005C) yet it appears the recovery stop at that point and called it good.
I would assume all WAL logs would be restored.

The recovery.conf was simple: 
restore_command = 'cp /data/pgLocal/archive/WAL_restore/%f "%p"'

The backup file 000000010000011A000000EE.004E0060.backup" contained:
START WAL LOCATION: 11A/EE4E0060 (file 000000010000011A000000EE)
STOP WAL LOCATION: 11A/EFF68AB8 (file 000000010000011A000000EF)
CHECKPOINT LOCATION: 11A/EE4E0060
START TIME: 2007-03-17 20:29:16 MDT
LABEL: 076_pgdata.tar
STOP TIME: 2007-03-18 05:16:17 MDT

Does the line:  incorrect resource manager data checksum in record at 11A/FD492B20
mean there is a corrupted WAL log file?

Any insight here would be helpful 
version PG 8.1.2 64 bit Linux


thanks


Warren Little
Chief Technology Officer
Meridias Capital Inc
ph 866.369.7763



URGENT TABLE PG_SHADOW CORRUTEP

From
"Daniel Ricardo Medina"
Date:
help please
 
when I execute  this query
 
 
select * from pg_shadow; 
 
 

usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig

-----------+----------+-------------+----------+-----------+----------+----------+-----------

postgres | 1 | t | t | t | ******** | |

postgres | 1 | t | t | t | ******** | |

userradar | 100 | f | f | f | ******** | |

(3 rows)

 
the result are  once tuples with the same username "postgres" but the first don´t have password. in additon, when  I execute   pg_dump the result is
 
 
pg_dump: saving encoding
pg_dump: saving database definition
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  more than one row returned by a subq
uery used as an expression
pg_dump: The command was: SELECT (SELECT usename FROM pg_user WHERE usesysid = d
atdba
) as dba, pg_encoding_to_char(encoding) as encoding, datpath FROM pg_databa
se WHERE datname = 'radar'
pg_dump: *** aborted because of error
 
because there are two users with SYSID  = 1, the result is in red
 
I have try to run this command in single user mod
 

select * from pg_shadow;

delete from pg_shadow;

CREATE USER postgres WITH CREATEDB CREATEUSER PASSWORD '***' SYSID 1;

CREATE USER userradar PASSWORD '***'

but the result is the same.....
 
thanks.....
 
 
 
 
Daniel Ricardo Medina
Ing. de Desarrollo
Computec S.A.
Tel 260 71 11 Ext 1640
 


De: Daniel Ricardo Medina
Enviado el: Friday, March 23, 2007 6:38 PM
Para: pgsql-admin@postgresql.org
Asunto: AYUDA URGENTE CON TABLA PG_SHADOW

Ayuda por favor

Cuando ejecuto este comando me sale una doble tupla del usuario postgres, esto me error cuando ejecuto el comando Pg_dump para realizar el respectivo backup..

 

radar=# select * from pg_user;

usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig

-----------+----------+-------------+----------+-----------+----------+----------+-----------

postgres | 1 | t | t | t | ******** | |

postgres | 1 | t | t | t | ******** | |

userradar | 100 | f | f | f | ******** | |

(3 rows)

He intentado borrar los usuario directamente de la tabla subiendo postgres com postgres -O -P -D directorio

Y corriendo estos comandos..

select * from pg_shadow;

delete from pg_shadow;

CREATE USER postgres WITH CREATEDB CREATEUSER PASSWORD '***' SYSID 1; CREATE USER userradar PASSWORD '***'

Pero me crea los mismos tres usuarios en la tabla pg_shadow, el problema es que no he podido hacer el backup y ya no tengo espacio en la particion donde tengo montada la base

 

 

 

 

 

 

Ayuda por favor

Cuando ejecuto este comando me sale una doble tupla del usuario postgres, esto me error cuando ejecuto el comando Pg_dump para realizar el respectivo backup..

 

radar=# select * from pg_user;

usename | usesysid | usecreatedb | usesuper | usecatupd | passwd | valuntil | useconfig

-----------+----------+-------------+----------+-----------+----------+----------+-----------

postgres | 1 | t | t | t | ******** | |

postgres | 1 | t | t | t | ******** | |

userradar | 100 | f | f | f | ******** | |

(3 rows)

He intentado borrar los usuario directamente de la tabla subiendo postgres com postgres -O -P -D directorio

Y corriendo estos comandos..

select * from pg_shadow;

delete from pg_shadow;

CREATE USER postgres WITH CREATEDB CREATEUSER PASSWORD '***' SYSID 1; CREATE USER userradar PASSWORD '***'

Pero me crea los mismos tres usuarios en la tabla pg_shadow, el problema es que no he podido hacer el backup y ya no tengo espacio en la particion donde tengo montada la base

 

 

 

 

 

 
Daniel Ricardo Medina
Ing. de Desarrollo
Computec S.A.
Tel 260 71 11 Ext 1640
 


De: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] En nombre de Warren Little
Enviado el: Friday, March 23, 2007 6:16 PM
Para: pgsql-admin@postgresql.org
Asunto: [ADMIN] trying to run PITR recovery

Hello,
I'm testing my PITR recovery procedures and something doesn't look right.
The following is from the logs upon starting postgres with recovery.conf file in place

 2007-03-23 05:56:00 MDTLOG:  database system was interrupted at 2007-03-18 05:09:15 MDT
@ 2007-03-23 05:56:00 MDTLOG:  starting archive recovery
@ 2007-03-23 05:56:00 MDTLOG:  restore_command = "cp /data/pgLocal/archive/WAL_restore/%f "%p""
cp: cannot stat `/data/pgLocal/archive/WAL_restore/00000001.history': No such file or directory
@ 2007-03-23 05:56:00 MDTLOG:  restored log file "000000010000011A000000EE.004E0060.backup" from archive
@ 2007-03-23 05:56:00 MDTLOG:  restored log file "000000010000011A000000EE" from archive
@ 2007-03-23 05:56:00 MDTLOG:  checkpoint record is at 11A/EE4E0060
@ 2007-03-23 05:56:00 MDTLOG:  redo record is at 11A/EE4E0060; undo record is at 0/0; shutdown FALSE
@ 2007-03-23 05:56:00 MDTLOG:  next transaction ID: 2415965426; next OID: 81701223
@ 2007-03-23 05:56:00 MDTLOG:  next MultiXactId: 43380; next MultiXactOffset: 92368
@ 2007-03-23 05:56:00 MDTLOG:  automatic recovery in progress
@ 2007-03-23 05:56:01 MDTLOG:  redo starts at 11A/EE4E00B0
@ 2007-03-23 05:56:15 MDTLOG:  restored log file "000000010000011A000000EF" from archive
@ 2007-03-23 05:56:27 MDTLOG:  restored log file "000000010000011A000000F0" from archive
@ 2007-03-23 05:56:31 MDTLOG:  restored log file "000000010000011A000000F1" from archive
@ 2007-03-23 05:56:34 MDTLOG:  restored log file "000000010000011A000000F2" from archive
@ 2007-03-23 05:56:40 MDTLOG:  restored log file "000000010000011A000000F3" from archive
@ 2007-03-23 05:56:50 MDTLOG:  restored log file "000000010000011A000000F4" from archive
@ 2007-03-23 05:57:02 MDTLOG:  restored log file "000000010000011A000000F5" from archive
@ 2007-03-23 05:57:07 MDTLOG:  restored log file "000000010000011A000000F6" from archive
@ 2007-03-23 05:57:11 MDTLOG:  restored log file "000000010000011A000000F7" from archive
@ 2007-03-23 05:57:14 MDTLOG:  restored log file "000000010000011A000000F8" from archive
@ 2007-03-23 05:57:18 MDTLOG:  restored log file "000000010000011A000000F9" from archive
@ 2007-03-23 05:57:21 MDTLOG:  restored log file "000000010000011A000000FA" from archive
@ 2007-03-23 05:57:27 MDTLOG:  restored log file "000000010000011A000000FB" from archive
@ 2007-03-23 05:57:29 MDTLOG:  restored log file "000000010000011A000000FC" from archive
@ 2007-03-23 05:57:33 MDTLOG:  restored log file "000000010000011A000000FD" from archive
@ 2007-03-23 05:57:35 MDTLOG:  incorrect resource manager data checksum in record at 11A/FD492B20
@ 2007-03-23 05:57:35 MDTLOG:  redo done at 11A/FD492210
@ 2007-03-23 05:57:36 MDTLOG:  restored log file "000000010000011A000000FD" from archive
@ 2007-03-23 05:57:36 MDTLOG:  archive recovery complete
@ 2007-03-23 05:57:36 MDTLOG:  could not truncate directory "pg_multixact/members": apparent wraparound
@ 2007-03-23 05:59:00 MDTLOG:  database system is ready
@ 2007-03-23 05:59:00 MDTLOG:  transaction ID wrap limit is 3065701724, limited by database "postgres"
@ 2007-03-23 12:14:04 MDTLOG:  autovacuum: processing database "postgres"

My concern is that there were many more logfiles to be played following "00000010000011A000000FD" 
(ie 000000010000011E0000005C) yet it appears the recovery stop at that point and called it good.
I would assume all WAL logs would be restored.

The recovery.conf was simple: 
restore_command = 'cp /data/pgLocal/archive/WAL_restore/%f "%p"'

The backup file 000000010000011A000000EE.004E0060.backup" contained:
START WAL LOCATION: 11A/EE4E0060 (file 000000010000011A000000EE)
STOP WAL LOCATION: 11A/EFF68AB8 (file 000000010000011A000000EF)
CHECKPOINT LOCATION: 11A/EE4E0060
START TIME: 2007-03-17 20:29:16 MDT
LABEL: 076_pgdata.tar
STOP TIME: 2007-03-18 05:16:17 MDT

Does the line:  incorrect resource manager data checksum in record at 11A/FD492B20
mean there is a corrupted WAL log file?

Any insight here would be helpful 
version PG 8.1.2 64 bit Linux


thanks


Warren Little
Chief Technology Officer
Meridias Capital Inc
ph 866.369.7763



Re: URGENT TABLE PG_SHADOW CORRUTEP

From
Tom Lane
Date:
"Daniel Ricardo Medina" <dmedina@computec.com.co> writes:
> I have try to run this command in single user mod
>
> select * from pg_shadow;
> delete from pg_shadow;
> CREATE USER postgres WITH CREATEDB CREATEUSER PASSWORD '***' SYSID 1;
> CREATE USER userradar PASSWORD '***'

> but the result is the same.....

What do you see if you look in pg_shadow right after doing the DELETE?
What PG version is this?

            regards, tom lane

Re: URGENT TABLE PG_SHADOW CORRUTEP

From
"Peter Koczan"
Date:
I saw something along these lines recently when I was running stuff from postgres 7.4.7 (we recently upgraded to 8.2.3). Basically, a unique index somehow fails and allows duplicate rows...but I don't know why.

Deleting doesn't work because postgres only looks for the record shown by the index (though this may have been fixed in more recent releases). At least this is the behaviour I observed.

The way I fixed it was to drop the unique index, remove the offending rows, reinsert only one of the offending rows (so things would once again be unique), and remake the index. It worked.

There are two things that concern me:
1. Trying to do this on a system table is likely ill-advised at best, and dangerous at worst.
2. If the table only has 3 rows in it and it got corrupted, there's something really wrong. My table in question was a user table that had almost a million rows.

On 3/25/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Daniel Ricardo Medina" <dmedina@computec.com.co> writes:
> I have try to run this command in single user mod
>
> select * from pg_shadow;
> delete from pg_shadow;
> CREATE USER postgres WITH CREATEDB CREATEUSER PASSWORD '***' SYSID 1;
> CREATE USER userradar PASSWORD '***'

> but the result is the same.....

What do you see if you look in pg_shadow right after doing the DELETE?
What PG version is this?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate

Re: trying to run PITR recovery

From
Tom Lane
Date:
Warren Little <Warren.Little@MeridiasCapital.com> writes:
> I'm testing my PITR recovery procedures and something doesn't look
> right.
> The following is from the logs upon starting postgres with
> recovery.conf file in place

> @ 2007-03-23 05:57:33 MDTLOG:  restored log file
> "000000010000011A000000FD" from archive
> @ 2007-03-23 05:57:35 MDTLOG:  incorrect resource manager data
> checksum in record at 11A/FD492B20
> @ 2007-03-23 05:57:35 MDTLOG:  redo done at 11A/FD492210

> My concern is that there were many more logfiles to be played
> following "00000010000011A000000FD"
> (ie 000000010000011E0000005C) yet it appears the recovery stop at
> that point and called it good.

Perhaps you have a bad copy of that particular log segment file?
The "incorrect checksum" failure is normal in certain circumstances
(eg, where the system crashes partway through writing out a large
WAL record), which is why the code treats it as indicating the end
of WAL.  But in this case it looks more like corrupt data ...

            regards, tom lane

Re: URGENT TABLE PG_SHADOW CORRUTEP

From
"Daniel Ricardo Medina"
Date:
After delete all, postgres said no exist the user id 1...



The version is 7.4.8

Daniel Ricardo Medina
Ing. de Desarrollo
Computec S.A.
Tel 260 71 11 Ext 1640

-----Mensaje original-----
De: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Enviado el: Sunday, March 25, 2007 1:52 PM
Para: Daniel Ricardo Medina
CC: pgsql-admin@postgresql.org
Asunto: Re: [ADMIN] URGENT TABLE PG_SHADOW CORRUTEP

"Daniel Ricardo Medina" <dmedina@computec.com.co> writes:
> I have try to run this command in single user mod
>
> select * from pg_shadow;
> delete from pg_shadow;
> CREATE USER postgres WITH CREATEDB CREATEUSER PASSWORD '***' SYSID 1;
> CREATE USER userradar PASSWORD '***'

> but the result is the same.....

What do you see if you look in pg_shadow right after doing the DELETE?
What PG version is this?

            regards, tom lane

Re: URGENT TABLE PG_SHADOW CORRUTEP

From
Tom Lane
Date:
"Daniel Ricardo Medina" <dmedina@computec.com.co> writes:
> After delete all, postgres said no exist the user id 1...

Maybe reindexing pg_shadow would help.

            regards, tom lane

Re: URGENT TABLE PG_SHADOW CORRUTEP

From
"Daniel Ricardo Medina"
Date:
ok thank, in single mode  i try to erase two unique index on pg_shadow but postgres show this messages
 
this index is used by system postgres... can´t erase it....
 
 
 
Daniel Ricardo Medina
Ing. de Desarrollo
Computec S.A.
Tel 260 71 11 Ext 1640
 


De: Peter Koczan [mailto:pjkoczan@gmail.com]
Enviado el: Sunday, March 25, 2007 7:28 PM
Para: Tom Lane
CC: Daniel Ricardo Medina; pgsql-admin@postgresql.org
Asunto: Re: [ADMIN] URGENT TABLE PG_SHADOW CORRUTEP

I saw something along these lines recently when I was running stuff from postgres 7.4.7 (we recently upgraded to 8.2.3). Basically, a unique index somehow fails and allows duplicate rows...but I don't know why.

Deleting doesn't work because postgres only looks for the record shown by the index (though this may have been fixed in more recent releases). At least this is the behaviour I observed.

The way I fixed it was to drop the unique index, remove the offending rows, reinsert only one of the offending rows (so things would once again be unique), and remake the index. It worked.

There are two things that concern me:
1. Trying to do this on a system table is likely ill-advised at best, and dangerous at worst.
2. If the table only has 3 rows in it and it got corrupted, there's something really wrong. My table in question was a user table that had almost a million rows.

On 3/25/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Daniel Ricardo Medina" <dmedina@computec.com.co> writes:
> I have try to run this command in single user mod
>
> select * from pg_shadow;
> delete from pg_shadow;
> CREATE USER postgres WITH CREATEDB CREATEUSER PASSWORD '***' SYSID 1;
> CREATE USER userradar PASSWORD '***'

> but the result is the same.....

What do you see if you look in pg_shadow right after doing the DELETE?
What PG version is this?

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate

Re: trying to run PITR recovery

From
"Simon Riggs"
Date:
On Fri, 2007-03-23 at 17:16 -0600, Warren Little wrote:

> My concern is that there were many more logfiles to be played
> following "00000010000011A000000FD"
> (ie 000000010000011E0000005C) yet it appears the recovery stop at that
> point and called it good.
> I would assume all WAL logs would be restored.

I'm interested in your feedback here. How would you like it to have
acted?

The WAL file was clearly corrupt.

1. Don't continue and don't come up. Have the recovery fail. In order to
bring the server up, we would have to restart recovery with an
additional command to say "I note that my recovery has failed and would
like recovery to come up at the last possible point."

2. Attempt to continue after we fail the CRC check. This is both
dangerous and in many cases won't work either, since this is one of the
normal ending points.

3. Continue after a CRC check, don't attempt to apply the records, just
look at them to determine if they look correct. i.e. see if the CRC
error applies to just that record

4. Add a command to ignore specific WAL records
ignore_record = '11A/FD492B20'

These may also not work very well at all, since many records depend upon
previous data changes, so could quickly end in further errors.

What would you suggest?

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: trying to run PITR recovery

From
Warren Little
Date:
Simon,
I have no issues with how the error was handled, just the notification that an error was encountered.

@ 2007-03-23 05:57:33 MDTLOG:  restored log file
"000000010000011A000000FD" from archive
@ 2007-03-23 05:57:35 MDTLOG:  incorrect resource manager data
checksum in record at 11A/FD492B20
@ 2007-03-23 05:57:35 MDTLOG:  redo done at 11A/FD492210

The first message says it restored the file, the second message looks like an error, but for myself, who does this process very seldom, its hard to tell what exactly transpired.

On slightly different topic, is there some way to determine the timeline of the corrupted segment, ie what was the original time of the last restored transaction.





On Mar 30, 2007, at 5:16 AM, Simon Riggs wrote:

On Fri, 2007-03-23 at 17:16 -0600, Warren Little wrote:

My concern is that there were many more logfiles to be played
following "00000010000011A000000FD" 
(ie 000000010000011E0000005C) yet it appears the recovery stop at that
point and called it good.
I would assume all WAL logs would be restored.

I'm interested in your feedback here. How would you like it to have
acted?

The WAL file was clearly corrupt. 

1. Don't continue and don't come up. Have the recovery fail. In order to
bring the server up, we would have to restart recovery with an
additional command to say "I note that my recovery has failed and would
like recovery to come up at the last possible point."

2. Attempt to continue after we fail the CRC check. This is both
dangerous and in many cases won't work either, since this is one of the
normal ending points.

3. Continue after a CRC check, don't attempt to apply the records, just
look at them to determine if they look correct. i.e. see if the CRC
error applies to just that record

4. Add a command to ignore specific WAL records
ignore_record = '11A/FD492B20'

These may also not work very well at all, since many records depend upon
previous data changes, so could quickly end in further errors.

What would you suggest?

-- 
  Simon Riggs             
  EnterpriseDB   http://www.enterprisedb.com



Warren Little
Chief Technology Officer
Meridias Capital Inc
ph 866.369.7763



Re: trying to run PITR recovery

From
"Simon Riggs"
Date:
On Fri, 2007-03-30 at 08:23 -0600, Warren Little wrote:

> On slightly different topic, is there some way to determine the
> timeline of the corrupted segment, ie what was the original time of
> the last restored transaction.

You need to examine the WAL files using xlogdump available from
pgfoundry.org - its not part of the main distribution yet.

Reporting the time of the last commit/abort record seen in the WAL seems
like an easy and useful addition to make.

e.g. LOG:  last transaction originally completed at 2007-02-23 05:57:35

Point-in-time recovery really ought to tell you the point-in-time it has
stopped at sounds blindingly obvious in retrospect.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: trying to run PITR recovery

From
"Simon Riggs"
Date:
On Sun, 2007-03-25 at 20:57 -0400, Tom Lane wrote:
> Warren Little <Warren.Little@MeridiasCapital.com> writes:
> > I'm testing my PITR recovery procedures and something doesn't look
> > right.
> > The following is from the logs upon starting postgres with
> > recovery.conf file in place
>
> > @ 2007-03-23 05:57:33 MDTLOG:  restored log file
> > "000000010000011A000000FD" from archive
> > @ 2007-03-23 05:57:35 MDTLOG:  incorrect resource manager data
> > checksum in record at 11A/FD492B20
> > @ 2007-03-23 05:57:35 MDTLOG:  redo done at 11A/FD492210
>
> > My concern is that there were many more logfiles to be played
> > following "00000010000011A000000FD"
> > (ie 000000010000011E0000005C) yet it appears the recovery stop at
> > that point and called it good.
>
> Perhaps you have a bad copy of that particular log segment file?
> The "incorrect checksum" failure is normal in certain circumstances
> (eg, where the system crashes partway through writing out a large
> WAL record), which is why the code treats it as indicating the end
> of WAL.  But in this case it looks more like corrupt data ...

Tom,

I think there is a problem here. If we stop before the end of logs we
should be incrementing the timeline id.

I'm not going to think about it right now, but I'll think on that next
week. Not the most likely problem we'll ever see, I grant you.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: trying to run PITR recovery

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> I think there is a problem here. If we stop before the end of logs we
> should be incrementing the timeline id.

There is no good reason here to think that we have stopped before the
end of logs, and I don't think I want the code bumping the timeline ID
on every crash restart.

            regards, tom lane

Re: trying to run PITR recovery

From
"Simon Riggs"
Date:
On Fri, 2007-03-30 at 12:40 -0400, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > I think there is a problem here. If we stop before the end of logs we
> > should be incrementing the timeline id.
>
> There is no good reason here to think that we have stopped before the
> end of logs, and I don't think I want the code bumping the timeline ID
> on every crash restart.

The timeline is a protection against confusing ourselves when we have
two log files both called the same thing. In the OP's case, there were
clearly unapplied log files that end up as duplicates. I can see the
difficulty in knowing whether or not to bump the timeline id.

At very least we need to document this, since if the manual's advice
were taken "The archive command should generally be designed to refuse
to overwrite any pre-existing archive file." then the OP's system would
start throwing errors when the first xlog fills after the recovered
system re-enters normal operation.

We should say:

"During recovery it is possible, if you're unlucky, that one of the WAL
files has been damaged. If so, recovery will stop at the point at which
the damage has occurred. It is probable that WAL files higher than the
damaged WAL file exist in the archive. If that is the case, you may need
to begin archiving to a different location, or move the earlier WAL
files out of the archive, to allow the newly restored server to continue
archive operations correctly. If you don't, the server will operate
normally but further archiving may not occur correctly. Take good care
of your archived WAL files or better still take two copies.".

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com