Thread: Cluster seems broken after pg_basebackup

Cluster seems broken after pg_basebackup

From

Guillaume Drolet

Date:

06 February 2015, 13:03:58

Hi,

Yesterday I ran a pg_basebackup of my cluster. Since it has completed, my cluster doesn't work properly. I tried restarting the computer (or service) a few times but I always get the same messages in my logs (it's in French. If someone is willing to help me I can try to translate the logs. Just ask):

2015-02-06 07:11:38 EST LOG: le système de bases de données a été interrompu ; dernier lancement connu à 2015-02-06 07:05:05 EST
2015-02-06 07:11:38 EST LOG: le système de bases de données n'a pas été arrêté proprement ; restauration
automatique en cours
2015-02-06 07:11:38 EST LOG: record with zero length at 24B/2C000160
2015-02-06 07:11:38 EST LOG: la ré-exécution n'est pas nécessaire
2015-02-06 07:11:38 EST LOG: le système de bases de données est prêt pour accepter les connexions
2015-02-06 07:11:38 EST LOG: lancement du processus autovacuum
2015-02-06 07:11:38 EST FATAL: le rôle « 208375PT$ » n'existe pas

Then if I start pgAdmin I get a series of pop-ups I have to click OK to to continue:

An error has ocurred: Column not found in pgSet: "datlastsysoid"
An error has ocurred: Column not found in pgSet: datlastsysoid
An error has ocurred: Column not found in pgSet: oid
An error has ocurred: Column not found in pgSet: encoding
An error has ocurred: Column not found in pgSet: Connection to database broken

And after that, I went back to the log file and there's new information added:

2015-02-06 07:51:05 EST LOG: processus serveur (PID 184) a été arrêté par l'exception 0x80000004
2015-02-06 07:51:05 EST DÉTAIL: Le processus qui a échoué exécutait : SELECT version();
2015-02-06 07:51:05 EST ASTUCE : Voir le fichier d'en-tête C « ntstatus.h » pour une description de la valeur
    hexadécimale.
2015-02-06 07:51:05 EST LOG: arrêt des autres processus serveur actifs
2015-02-06 07:51:05 EST ATTENTION: arrêt de la connexion à cause de l'arrêt brutal d'un autre processus serveur
2015-02-06 07:51:05 EST DÉTAIL: Le postmaster a commandé à ce processus serveur d'annuler la transaction
    courante et de quitter car un autre processus serveur a quitté anormalement
    et qu'il existe probablement de la mémoire partagée corrompue.
2015-02-06 07:51:05 EST ASTUCE : Dans un moment, vous devriez être capable de vous reconnecter à la base de
    données et de relancer votre commande.
2015-02-06 07:51:05 EST LOG: processus d'archivage (PID 692) quitte avec le code de sortie 1
2015-02-06 07:51:05 EST LOG: tous les processus serveur se sont arrêtés, réinitialisation
2015-02-06 07:51:15 EST FATAL: le bloc de mémoire partagé pré-existant est toujours en cours d'utilisation
2015-02-06 07:51:15 EST ASTUCE : Vérifier s'il n'y a pas de vieux processus serveur en cours d'exécution. Si c'est le
    cas, fermez-les.

I was about to try restarting postgresql using the base backup I made yesterday but since this means I'll have to copy my database again (700 GB takes a while...) I am looking for a better solution from more experienced people.

Thanks a lot for helping! Guillaume

Re: Cluster seems broken after pg_basebackup

From

Adrian Klaver

Date:

06 February 2015, 16:29:03

On 02/06/2015 05:03 AM, Guillaume Drolet wrote:
> Hi,
>
> Yesterday I ran a pg_basebackup of my cluster. Since it has completed,
> my cluster doesn't work properly. I tried restarting the computer (or
> service) a few times but I always get the same messages in my logs (it's
> in French. If someone is willing to help me I can try to translate the
> logs. Just ask):

Enter Google Translate:)

First some questions:

1) What Postgres version?

2) What OS(s)? I am assuming Windows from the log info below, but we all
know what assuming gets you.

3) Where were you backing up from and to?

4) Which cluster does not start, the master or the child you created
with pg_basebackup?

>
> 2015-02-06 07:11:38 EST LOG:  le système de bases de données a été
> interrompu ; dernier lancement connu à 2015-02-06 07:05:05 EST
> 2015-02-06 07:11:38 EST LOG:  le système de bases de données n'a pas été
> arrêté proprement ; restauration
>      automatique en cours
> 2015-02-06 07:11:38 EST LOG:  record with zero length at 24B/2C000160
> 2015-02-06 07:11:38 EST LOG:  la ré-exécution n'est pas nécessaire
> 2015-02-06 07:11:38 EST LOG:  le système de bases de données est prêt
> pour accepter les connexions
> 2015-02-06 07:11:38 EST LOG:  lancement du processus autovacuum
> 2015-02-06 07:11:38 EST FATAL:  le rôle « 208375PT$ » n'existe pas

So where is role 208375PT$ supposed to come from?

>
> Then if I start pgAdmin I get a series of pop-ups I have to click OK to
> to continue:
>
> An error has ocurred: Column not found in pgSet: "datlastsysoid"
> An error has ocurred: Column not found in pgSet: datlastsysoid
> An error has ocurred: Column not found in pgSet: oid
> An error has ocurred: Column not found in pgSet: encoding
> An error has ocurred: Column not found in pgSet: Connection to database
> broken

Not sure about that this, someone more versed in pgAdmin will have to
answer.

>
> And after that, I went back to the log file and there's new information
> added:
>
> 2015-02-06 07:51:05 EST LOG:  processus serveur (PID 184) a été arrêté
> par l'exception 0x80000004
> 2015-02-06 07:51:05 EST DÉTAIL:  Le processus qui a échoué exécutait :
> SELECT version();
> 2015-02-06 07:51:05 EST ASTUCE :  Voir le fichier d'en-tête C «
> ntstatus.h » pour une description de la valeur
>      hexadécimale.

Well according to here:

https://msdn.microsoft.com/en-us/library/cc704588.aspx

0x80000004
STATUS_SINGLE_STEP


{EXCEPTION} Single Step A single step or trace operation has just been
completed.

A developer is going to have explain what that means.


> 2015-02-06 07:51:05 EST LOG:  arrêt des autres processus serveur actifs
> 2015-02-06 07:51:05 EST ATTENTION:  arrêt de la connexion à cause de
> l'arrêt brutal d'un autre processus serveur
> 2015-02-06 07:51:05 EST DÉTAIL:  Le postmaster a commandé à ce processus
> serveur d'annuler la transaction
>      courante et de quitter car un autre processus serveur a quitté
> anormalement
>      et qu'il existe probablement de la mémoire partagée corrompue.
> 2015-02-06 07:51:05 EST ASTUCE :  Dans un moment, vous devriez être
> capable de vous reconnecter à la base de
>      données et de relancer votre commande.
> 2015-02-06 07:51:05 EST LOG:  processus d'archivage (PID 692) quitte
> avec le code de sortie 1
> 2015-02-06 07:51:05 EST LOG:  tous les processus serveur se sont
> arrêtés, réinitialisation
> 2015-02-06 07:51:15 EST FATAL:  le bloc de mémoire partagé pré-existant
> est toujours en cours d'utilisation
> 2015-02-06 07:51:15 EST ASTUCE :  Vérifier s'il n'y a pas de vieux
> processus serveur en cours d'exécution. Si c'est le
>      cas, fermez-les.
>
> I was about to try restarting postgresql using the base backup I made
> yesterday but since this means I'll have to copy my database again (700
> GB takes a while...) I am looking for a better solution from more
> experienced people.


My suspicion is you copied at least partly over a running server.

>
> Thanks a lot for helping! Guillaume
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Cluster seems broken after pg_basebackup

From

Guillaume Drolet

Date:

06 February 2015, 17:18:12

Dear Adrian,

Thanks for helping me. Sorry for the lack of details, I had said to myself I had to not forget to give these details but I hit the send button too fast. You know how it is...

I added more info in your reply below.

2015-02-06 11:28 GMT-05:00 Adrian Klaver <adrian.klaver@aklaver.com>:

On 02/06/2015 05:03 AM, Guillaume Drolet wrote:
Hi,

Yesterday I ran a pg_basebackup of my cluster. Since it has completed,
my cluster doesn't work properly. I tried restarting the computer (or
service) a few times but I always get the same messages in my logs (it's
in French. If someone is willing to help me I can try to translate the
logs. Just ask):

Enter Google Translate:)

Not a big fan. I've seen bad misunderstandings happen there!

First some questions:

1) What Postgres version?

9.3

2) What OS(s)? I am assuming Windows from the log info below, but we all know what assuming gets you.

Windows 7

3) Where were you backing up from and to?

Backing up from my only cluster (PGDATA) on disk E, to a backup directory on an other disk (F:) using this command:

pg_basebackup -D "F:\\db_base_backup" -Fp -Xs -R -P --label="basebackup20150205" --username=postgres

What's weird is that I did some successful tests last week on the same system (backing up, archiving, recovering) using the same procedure. Only difference was the cluster, which was much smaller for testing purposes, but located at the same place (i.e. E:\data) and PostgresSQL installed in C:\Programs\...

4) Which cluster does not start, the master or the child you created with pg_basebackup?

The master. I haven't tried the child yet. But I saw that the message about role "208375PT$" is in logs from before the backup too.

2015-02-06 07:11:38 EST LOG: le système de bases de données a été
interrompu ; dernier lancement connu à 2015-02-06 07:05:05 EST
2015-02-06 07:11:38 EST LOG: le système de bases de données n'a pas été
arrêté proprement ; restauration
automatique en cours
2015-02-06 07:11:38 EST LOG: record with zero length at 24B/2C000160
2015-02-06 07:11:38 EST LOG: la ré-exécution n'est pas nécessaire
2015-02-06 07:11:38 EST LOG: le système de bases de données est prêt
pour accepter les connexions
2015-02-06 07:11:38 EST LOG: lancement du processus autovacuum
2015-02-06 07:11:38 EST FATAL: le rôle « 208375PT$ » n'existe pas

So where is role 208375PT$ supposed to come from?

This is the local domain of my machine. I log onto my machine with a local admin account and using domain name 208375PT (I didn't set this part of my machine, the IT guys here at work did). The thing is: I don't understand why it's there in the log file??

Then if I start pgAdmin I get a series of pop-ups I have to click OK to
to continue:

An error has ocurred: Column not found in pgSet: "datlastsysoid"
An error has ocurred: Column not found in pgSet: datlastsysoid
An error has ocurred: Column not found in pgSet: oid
An error has ocurred: Column not found in pgSet: encoding
An error has ocurred: Column not found in pgSet: Connection to database
broken

Not sure about that this, someone more versed in pgAdmin will have to answer.

And after that, I went back to the log file and there's new information
added:

2015-02-06 07:51:05 EST LOG: processus serveur (PID 184) a été arrêté
par l'exception 0x80000004
2015-02-06 07:51:05 EST DÉTAIL: Le processus qui a échoué exécutait :
SELECT version();
2015-02-06 07:51:05 EST ASTUCE : Voir le fichier d'en-tête C «
ntstatus.h » pour une description de la valeur
hexadécimale.

Well according to here:

https://msdn.microsoft.com/en-us/library/cc704588.aspx

0x80000004
STATUS_SINGLE_STEP

{EXCEPTION} Single Step A single step or trace operation has just been completed.

A developer is going to have explain what that means.

2015-02-06 07:51:05 EST LOG: arrêt des autres processus serveur actifs
2015-02-06 07:51:05 EST ATTENTION: arrêt de la connexion à cause de
l'arrêt brutal d'un autre processus serveur
2015-02-06 07:51:05 EST DÉTAIL: Le postmaster a commandé à ce processus
serveur d'annuler la transaction
courante et de quitter car un autre processus serveur a quitté
anormalement
et qu'il existe probablement de la mémoire partagée corrompue.
2015-02-06 07:51:05 EST ASTUCE : Dans un moment, vous devriez être
capable de vous reconnecter à la base de
données et de relancer votre commande.
2015-02-06 07:51:05 EST LOG: processus d'archivage (PID 692) quitte
avec le code de sortie 1
2015-02-06 07:51:05 EST LOG: tous les processus serveur se sont
arrêtés, réinitialisation
2015-02-06 07:51:15 EST FATAL: le bloc de mémoire partagé pré-existant
est toujours en cours d'utilisation
2015-02-06 07:51:15 EST ASTUCE : Vérifier s'il n'y a pas de vieux
processus serveur en cours d'exécution. Si c'est le
cas, fermez-les.

I was about to try restarting postgresql using the base backup I made
yesterday but since this means I'll have to copy my database again (700
GB takes a while...) I am looking for a better solution from more
experienced people.

My suspicion is you copied at least partly over a running server.

How would that be possible? Using the pg_basebackup command I wrote above, it is clear that I wrote the backup on disk F and not E.

While writing this post, I started my backup using:

pg_ctl start -D "F:\db_basebackup"

Similar stuff happened with pgAdmin and the log (message about symbolic link is related to my post from yesterday. I don't know if this could be involved in the current problem):

2015-02-06 12:13:58 EST LOG: le système de bases de données a été interrompu ; dernier lancement connu à 2015-02-05 14:30:34 EST
2015-02-06 12:13:58 EST LOG: création du répertoire manquant « pg_xlog/archive_status » pour les journaux de transactions
2015-02-06 12:13:58 EST LOG: la ré-exécution commence à 24B/28000090
2015-02-06 12:13:58 EST LOG: n'a pas pu supprimer le lien symbolique « pg_tblspc/940585 » : No such file or directory
2015-02-06 12:13:58 EST CONTEXTE : xlog redo drop tablespace: 940585
2015-02-06 12:13:58 EST LOG: état de restauration cohérent atteint à 24B/290000B8
2015-02-06 12:13:58 EST LOG: ré-exécution faite à 24B/290000B8
2015-02-06 12:13:58 EST LOG: la dernière transaction a eu lieu à 2015-02-05 09:06:04.892-05 (moment de la journalisation)
2015-02-06 12:13:59 EST LOG: le système de bases de données est prêt pour accepter les connexions
2015-02-06 12:13:59 EST LOG: lancement du processus autovacuum
2015-02-06 12:14:42 EST LOG: processus serveur (PID 1784) a été arrêté par l'exception 0x80000004
2015-02-06 12:14:42 EST DÉTAIL: Le processus qui a échoué exécutait : SELECT version();
2015-02-06 12:14:42 EST ASTUCE : Voir le fichier d'en-tête C « ntstatus.h » pour une description de la valeur
    hexadécimale.
2015-02-06 12:14:42 EST LOG: arrêt des autres processus serveur actifs
2015-02-06 12:14:42 EST ATTENTION: arrêt de la connexion à cause de l'arrêt brutal d'un autre processus serveur
2015-02-06 12:14:42 EST DÉTAIL: Le postmaster a commandé à ce processus serveur d'annuler la transaction
    courante et de quitter car un autre processus serveur a quitté anormalement
    et qu'il existe probablement de la mémoire partagée corrompue.
2015-02-06 12:14:42 EST ASTUCE : Dans un moment, vous devriez être capable de vous reconnecter à la base de
    données et de relancer votre commande.
2015-02-06 12:14:42 EST LOG: tous les processus serveur se sont arrêtés, réinitialisation

Any ideas where to go from here?

Thanks a lot again.

Thanks a lot for helping! Guillaume

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Cluster seems broken after pg_basebackup

From

Adrian Klaver

Date:

06 February 2015, 23:45:00

On 02/06/2015 09:17 AM, Guillaume Drolet wrote:
> Dear Adrian,
>
> Thanks for helping me. Sorry for the lack of details, I had said to
> myself I had to not forget to give these details but I hit the send
> button too fast. You know how it is...
>
> I added more info in your reply below.
>
>
>     First some questions:
>
>     1) What Postgres version?
>
>
> 9.3
>

> Windows 7
>
>
>     3) Where were you backing up from and to?
>
>
> Backing up from my only cluster (PGDATA) on disk E, to a backup
> directory on an other disk (F:) using this command:
>
> pg_basebackup -D "F:\\db_base_backup" -Fp -Xs  -R -P
> --label="basebackup20150205" --username=postgres
>
> What's weird is that I did some successful tests last week on the same
> system (backing up, archiving, recovering) using the same procedure.
> Only difference was the cluster, which was much smaller for testing
> purposes, but located at the same place (i.e. E:\data) and PostgresSQL
> installed in C:\Programs\...
>
>
>     4) Which cluster does not start, the master or the child you created
>     with pg_basebackup?
>
>
>
> The master. I haven't tried the child yet. But I saw that the message
> about role "208375PT$" is in logs from before the backup too.
>

> This is the local domain of my machine. I log onto my machine with a
> local admin account and using domain name 208375PT (I didn't set this
> part of my machine, the IT guys here at work did). The thing is: I don't
> understand why it's there in the log file??

Not sure.

What are you using for an authentication method for database login?

>

>         And after that, I went back to the log file and there's new
>         information
>         added:
>
>         2015-02-06 07:51:05 EST LOG:  processus serveur (PID 184) a été
>         arrêté
>         par l'exception 0x80000004
>         2015-02-06 07:51:05 EST DÉTAIL:  Le processus qui a échoué
>         exécutait :
>         SELECT version();
>         2015-02-06 07:51:05 EST ASTUCE :  Voir le fichier d'en-tête C «
>         ntstatus.h » pour une description de la valeur
>               hexadécimale.
>
>
>     Well according to here:
>
>     https://msdn.microsoft.com/en-__us/library/cc704588.aspx
>     <https://msdn.microsoft.com/en-us/library/cc704588.aspx>
>
>     0x80000004
>     STATUS_SINGLE_STEP
>
>
>     {EXCEPTION} Single Step A single step or trace operation has just
>     been completed.
>
>     A developer is going to have explain what that means.
>

>
>
>     My suspicion is you copied at least partly over a running server.
>
>
> How would that be possible? Using the pg_basebackup command I wrote
> above, it is clear that I wrote the backup on disk F and not E.

I was just speculating, I would not put too much stock in it.

>
> While writing this post, I started my backup using:
>
> pg_ctl start -D "F:\db_basebackup"
>
> Similar stuff happened with pgAdmin and the log (message about symbolic
> link is related to my post from yesterday. I don't know if this could be
> involved in the current problem):
>
> 2015-02-06 12:13:58 EST LOG:  le système de bases de données a été
> interrompu ; dernier lancement connu à 2015-02-05 14:30:34 EST
> 2015-02-06 12:13:58 EST LOG:  création du répertoire manquant «
> pg_xlog/archive_status » pour les journaux de transactions
> 2015-02-06 12:13:58 EST LOG:  la ré-exécution commence à 24B/28000090
> 2015-02-06 12:13:58 EST LOG:  n'a pas pu supprimer le lien symbolique «
> pg_tblspc/940585 » : No such file or directory
> 2015-02-06 12:13:58 EST CONTEXTE :  xlog redo drop tablespace: 940585
> 2015-02-06 12:13:58 EST LOG:  état de restauration cohérent atteint à
> 24B/290000B8
> 2015-02-06 12:13:58 EST LOG:  ré-exécution faite à 24B/290000B8
> 2015-02-06 12:13:58 EST LOG:  la dernière transaction a eu lieu à
> 2015-02-05 09:06:04.892-05 (moment de la journalisation)
> 2015-02-06 12:13:59 EST LOG:  le système de bases de données est prêt
> pour accepter les connexions
> 2015-02-06 12:13:59 EST LOG:  lancement du processus autovacuum
> 2015-02-06 12:14:42 EST LOG:  processus serveur (PID 1784) a été arrêté
> par l'exception 0x80000004
> 2015-02-06 12:14:42 EST DÉTAIL:  Le processus qui a échoué exécutait :
> SELECT version();
> 2015-02-06 12:14:42 EST ASTUCE :  Voir le fichier d'en-tête C «
> ntstatus.h » pour une description de la valeur
>      hexadécimale.
> 2015-02-06 12:14:42 EST LOG:  arrêt des autres processus serveur actifs
> 2015-02-06 12:14:42 EST ATTENTION:  arrêt de la connexion à cause de
> l'arrêt brutal d'un autre processus serveur
> 2015-02-06 12:14:42 EST DÉTAIL:  Le postmaster a commandé à ce processus
> serveur d'annuler la transaction
>      courante et de quitter car un autre processus serveur a quitté
> anormalement
>      et qu'il existe probablement de la mémoire partagée corrompue.
> 2015-02-06 12:14:42 EST ASTUCE :  Dans un moment, vous devriez être
> capable de vous reconnecter à la base de
>      données et de relancer votre commande.
> 2015-02-06 12:14:42 EST LOG:  tous les processus serveur se sont
> arrêtés, réinitialisation
>
>
> Any ideas where to go from here?

In both cases the database got to the point below, which would seem to
indicate everything was alright.

2015-02-06 7:11:38 ET LOG: the re-execution is not required
2015-02-06 7:11:38 ET LOG: the database system is ready for
accept connections

Also from what I can see the server crashed at this point:

2015-02-06 12:13:59 LOG IS: launch autovacuum processes
2015-02-06 12:14:42 EST LOG: server process (PID 1784) was arrested by
the exception 0x80000004


Now 0x80000004 is supposed to mean:

STATUS_SINGLE_STEP


{EXCEPTION} Single Step A single step or trace operation has just been
completed.

Some digging indicates this is the result of debugger command. Have no
idea how that would invoked in Postgres running production code. This
leads to my default question when I see unexplained behavior on a
Windows machine; do you have anti-virus machine running against the drives?

>
> Thanks a lot again.
>
>
>         Thanks a lot for helping! Guillaume
>
>
>
>     --
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Cluster seems broken after pg_basebackup

From

Guillaume Lelarge

Date:

07 February 2015, 06:24:50

Le 6 févr. 2015 17:31, "Adrian Klaver" <adrian.klaver@aklaver.com> a écrit :
>
> On 02/06/2015 05:03 AM, Guillaume Drolet wrote:
>>
>> Hi,
>>
>> Yesterday I ran a pg_basebackup of my cluster. Since it has completed,
>> my cluster doesn't work properly. I tried restarting the computer (or
>> service) a few times but I always get the same messages in my logs (it's
>> in French. If someone is willing to help me I can try to translate the
>> logs. Just ask):
>
>
> Enter Google Translate:)
>

But first, Guillaume, do yourself and everyone else a favor: turn the dam log into English. Set lc_messages to 'C' in postgresql.conf.

> First some questions:
>
> 1) What Postgres version?
>
> 2) What OS(s)? I am assuming Windows from the log info below, but we all know what assuming gets you.
>
> 3) Where were you backing up from and to?
>
> 4) Which cluster does not start, the master or the child you created with pg_basebackup?
>
>
>>
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données a été
>> interrompu ; dernier lancement connu à 2015-02-06 07:05:05 EST
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données n'a pas été
>> arrêté proprement ; restauration
>> automatique en cours
>> 2015-02-06 07:11:38 EST LOG: record with zero length at 24B/2C000160
>> 2015-02-06 07:11:38 EST LOG: la ré-exécution n'est pas nécessaire
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données est prêt
>> pour accepter les connexions
>> 2015-02-06 07:11:38 EST LOG: lancement du processus autovacuum
>> 2015-02-06 07:11:38 EST FATAL: le rôle « 208375PT$ » n'existe pas
>
>
> So where is role 208375PT$ supposed to come from?
>
>
>>
>> Then if I start pgAdmin I get a series of pop-ups I have to click OK to
>> to continue:
>>
>> An error has ocurred: Column not found in pgSet: "datlastsysoid"
>> An error has ocurred: Column not found in pgSet: datlastsysoid
>> An error has ocurred: Column not found in pgSet: oid
>> An error has ocurred: Column not found in pgSet: encoding
>> An error has ocurred: Column not found in pgSet: Connection to database
>> broken
>
>
> Not sure about that this, someone more versed in pgAdmin will have to answer.
>

Usually you see these messages when you're using a pgadmin major release older than a PostgreSQL make release. For a 9.3 release, that would mean a pgadmin older than 1.18.

>
>>
>> And after that, I went back to the log file and there's new information
>> added:
>>
>> 2015-02-06 07:51:05 EST LOG: processus serveur (PID 184) a été arrêté
>> par l'exception 0x80000004
>> 2015-02-06 07:51:05 EST DÉTAIL: Le processus qui a échoué exécutait :
>> SELECT version();
>> 2015-02-06 07:51:05 EST ASTUCE : Voir le fichier d'en-tête C «
>> ntstatus.h » pour une description de la valeur
>> hexadécimale.
>
>
> Well according to here:
>
> https://msdn.microsoft.com/en-us/library/cc704588.aspx
>
> 0x80000004
> STATUS_SINGLE_STEP
>
>
> {EXCEPTION} Single Step A single step or trace operation has just been completed.
>
> A developer is going to have explain what that means.
>
>
>
>> 2015-02-06 07:51:05 EST LOG: arrêt des autres processus serveur actifs
>> 2015-02-06 07:51:05 EST ATTENTION: arrêt de la connexion à cause de
>> l'arrêt brutal d'un autre processus serveur
>> 2015-02-06 07:51:05 EST DÉTAIL: Le postmaster a commandé à ce processus
>> serveur d'annuler la transaction
>> courante et de quitter car un autre processus serveur a quitté
>> anormalement
>> et qu'il existe probablement de la mémoire partagée corrompue.
>> 2015-02-06 07:51:05 EST ASTUCE : Dans un moment, vous devriez être
>> capable de vous reconnecter à la base de
>> données et de relancer votre commande.
>> 2015-02-06 07:51:05 EST LOG: processus d'archivage (PID 692) quitte
>> avec le code de sortie 1
>> 2015-02-06 07:51:05 EST LOG: tous les processus serveur se sont
>> arrêtés, réinitialisation
>> 2015-02-06 07:51:15 EST FATAL: le bloc de mémoire partagé pré-existant
>> est toujours en cours d'utilisation
>> 2015-02-06 07:51:15 EST ASTUCE : Vérifier s'il n'y a pas de vieux
>> processus serveur en cours d'exécution. Si c'est le
>> cas, fermez-les.
>>
>> I was about to try restarting postgresql using the base backup I made
>> yesterday but since this means I'll have to copy my database again (700
>> GB takes a while...) I am looking for a better solution from more
>> experienced people.
>
>
>
> My suspicion is you copied at least partly over a running server.
>

Re: Cluster seems broken after pg_basebackup

From

Guillaume Drolet

Date:

09 February 2015, 16:37:58

2015-02-07 1:24 GMT-05:00 Guillaume Lelarge <guillaume@lelarge.info>:

Le 6 févr. 2015 17:31, "Adrian Klaver" <adrian.klaver@aklaver.com> a écrit :
>
> On 02/06/2015 05:03 AM, Guillaume Drolet wrote:
>>
>> Hi,
>>
>> Yesterday I ran a pg_basebackup of my cluster. Since it has completed,
>> my cluster doesn't work properly. I tried restarting the computer (or
>> service) a few times but I always get the same messages in my logs (it's
>> in French. If someone is willing to help me I can try to translate the
>> logs. Just ask):
>
>
> Enter Google Translate:)
>
But first, Guillaume, do yourself and everyone else a favor: turn the dam log into English. Set lc_messages to 'C' in postgresql.conf.

Thanks for this! I didn't know about this great feature.

> First some questions:
>
> 1) What Postgres version?
>
> 2) What OS(s)? I am assuming Windows from the log info below, but we all know what assuming gets you.
>
> 3) Where were you backing up from and to?
>
> 4) Which cluster does not start, the master or the child you created with pg_basebackup?
>
>
>>
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données a été
>> interrompu ; dernier lancement connu à 2015-02-06 07:05:05 EST
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données n'a pas été
>> arrêté proprement ; restauration
>> automatique en cours
>> 2015-02-06 07:11:38 EST LOG: record with zero length at 24B/2C000160
>> 2015-02-06 07:11:38 EST LOG: la ré-exécution n'est pas nécessaire
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données est prêt
>> pour accepter les connexions
>> 2015-02-06 07:11:38 EST LOG: lancement du processus autovacuum
>> 2015-02-06 07:11:38 EST FATAL: le rôle « 208375PT$ » n'existe pas
>
>
> So where is role 208375PT$ supposed to come from?
>
>
>>
>> Then if I start pgAdmin I get a series of pop-ups I have to click OK to
>> to continue:
>>
>> An error has ocurred: Column not found in pgSet: "datlastsysoid"
>> An error has ocurred: Column not found in pgSet: datlastsysoid
>> An error has ocurred: Column not found in pgSet: oid
>> An error has ocurred: Column not found in pgSet: encoding
>> An error has ocurred: Column not found in pgSet: Connection to database
>> broken
>
>
> Not sure about that this, someone more versed in pgAdmin will have to answer.
>
Usually you see these messages when you're using a pgadmin major release older than a PostgreSQL make release. For a 9.3 release, that would mean a pgadmin older than 1.18.

I'm running pgadmin 1.18.1

>
>>
>> And after that, I went back to the log file and there's new information
>> added:
>>
>> 2015-02-06 07:51:05 EST LOG: processus serveur (PID 184) a été arrêté
>> par l'exception 0x80000004
>> 2015-02-06 07:51:05 EST DÉTAIL: Le processus qui a échoué exécutait :
>> SELECT version();
>> 2015-02-06 07:51:05 EST ASTUCE : Voir le fichier d'en-tête C «
>> ntstatus.h » pour une description de la valeur
>> hexadécimale.
>
>
> Well according to here:
>
> https://msdn.microsoft.com/en-us/library/cc704588.aspx
>
> 0x80000004
> STATUS_SINGLE_STEP
>
>
> {EXCEPTION} Single Step A single step or trace operation has just been completed.
>
> A developer is going to have explain what that means.
>
>
>
>> 2015-02-06 07:51:05 EST LOG: arrêt des autres processus serveur actifs
>> 2015-02-06 07:51:05 EST ATTENTION: arrêt de la connexion à cause de
>> l'arrêt brutal d'un autre processus serveur
>> 2015-02-06 07:51:05 EST DÉTAIL: Le postmaster a commandé à ce processus
>> serveur d'annuler la transaction
>> courante et de quitter car un autre processus serveur a quitté
>> anormalement
>> et qu'il existe probablement de la mémoire partagée corrompue.
>> 2015-02-06 07:51:05 EST ASTUCE : Dans un moment, vous devriez être
>> capable de vous reconnecter à la base de
>> données et de relancer votre commande.
>> 2015-02-06 07:51:05 EST LOG: processus d'archivage (PID 692) quitte
>> avec le code de sortie 1
>> 2015-02-06 07:51:05 EST LOG: tous les processus serveur se sont
>> arrêtés, réinitialisation
>> 2015-02-06 07:51:15 EST FATAL: le bloc de mémoire partagé pré-existant
>> est toujours en cours d'utilisation
>> 2015-02-06 07:51:15 EST ASTUCE : Vérifier s'il n'y a pas de vieux
>> processus serveur en cours d'exécution. Si c'est le
>> cas, fermez-les.
>>
>> I was about to try restarting postgresql using the base backup I made
>> yesterday but since this means I'll have to copy my database again (700
>> GB takes a while...) I am looking for a better solution from more
>> experienced people.
>
>
>
> My suspicion is you copied at least partly over a running server.
>

Re: Cluster seems broken after pg_basebackup

From

Guillaume Drolet

Date:

09 February 2015, 17:40:43

I tried starting the cluster again. Once again everything looked fine at the start (first three lines of this log, in English this time):

2015-02-09 11:40:55 EST LOG: database system was shut down at 2015-02-06 09:50:21 EST
2015-02-09 11:40:55 EST LOG: database system is ready to accept connections
2015-02-09 11:40:55 EST LOG: autovacuum launcher started

Since it seemed to work, I opened the terminal and tried connecting to the database:

C:\Users\admlocal>psql -U postgres -d mortalite

So far so good, I got a connection:

psql (9.3.5)
Attention : l'encodage console (850) diffère de l'encodage Windows (1252).
            Les caractères 8 bits peuvent ne pas fonctionner correctement.
            Voir la section « Notes aux utilisateurs de Windows » de la page
            référence de psql pour les détails.
Saisissez « help » pour l'aide.

mortalite=#

I tried the help command and it worked:

mortalite=# help
Vous utilisez psql, l'interface en ligne de commande de PostgreSQL.
Saisissez:
    \copyright pour les termes de distribution
    \h pour l'aide-mémoire des commandes SQL
    \? pour l'aide-mémoire des commandes psql
    \g ou point-virgule en fin d'instruction pour exécuter la requête
    \q pour quitter

But then when I tried to query the db, it crashed:

mortalite=# \dt
la connexion au serveur a été coupée de façon inattendue
Le serveur s'est peut-être arrêté anormalement avant ou durant le
traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
!>

And here's the rest of the log file after the crash:

2015-02-09 12:29:19 EST LOG: server process (PID 2240) was terminated by exception 0xC0000005
2015-02-09 12:29:19 EST DETAIL: Failed process was running: SELECT n.nspname as "Schema",
    c.relname as "Name",
    CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' END as "Type",
    pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
    FROM pg_catalog.pg_class c
         LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
    WHERE c.relkind IN ('r','')
          AND n.nspname <> 'pg_catalog'
          AND n.nspname <> 'information_schema'
          AND n.nspname !~ '^pg_toast'
    AND pg_catalog.pg_table_is_visible(c.oid)
    ORDER BY 1,2;
2015-02-09 12:29:19 EST HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
2015-02-09 12:29:19 EST LOG: terminating any other active server processes
2015-02-09 12:29:19 EST WARNING: terminating connection because of crash of another server process
2015-02-09 12:29:19 EST DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2015-02-09 12:29:19 EST HINT: In a moment you should be able to reconnect to the database and repeat your command.
2015-02-09 12:29:19 EST LOG: archiver process (PID 4576) exited with exit code 1
2015-02-09 12:29:19 EST LOG: all server processes terminated; reinitializing
2015-02-09 12:29:29 EST FATAL: pre-existing shared memory block is still in use
2015-02-09 12:29:29 EST HINT: Check if there are any old server processes still running, and terminate them.

According to this page, exception 0xC0000005 means STATUS_ACCESS_VIOLATION - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s. This is not of much help to me.

I hope these additional bits of information can help someone figuring out a solution to get my cluster up and running again.

PS. I was thinking of reinstalling PGSQL over my current install but keeping my PGDATA. I've done it in the past for fixing problems with starting the service and it worked. What do you think?

Cheers,

Guillaume

2015-02-09 11:37 GMT-05:00 Guillaume Drolet <droletguillaume@gmail.com>:

2015-02-07 1:24 GMT-05:00 Guillaume Lelarge <guillaume@lelarge.info>:
Le 6 févr. 2015 17:31, "Adrian Klaver" <adrian.klaver@aklaver.com> a écrit :
>
> On 02/06/2015 05:03 AM, Guillaume Drolet wrote:
>>
>> Hi,
>>
>> Yesterday I ran a pg_basebackup of my cluster. Since it has completed,
>> my cluster doesn't work properly. I tried restarting the computer (or
>> service) a few times but I always get the same messages in my logs (it's
>> in French. If someone is willing to help me I can try to translate the
>> logs. Just ask):
>
>
> Enter Google Translate:)
>
But first, Guillaume, do yourself and everyone else a favor: turn the dam log into English. Set lc_messages to 'C' in postgresql.conf.
Thanks for this! I didn't know about this great feature.

> First some questions:
>
> 1) What Postgres version?
>
> 2) What OS(s)? I am assuming Windows from the log info below, but we all know what assuming gets you.
>
> 3) Where were you backing up from and to?
>
> 4) Which cluster does not start, the master or the child you created with pg_basebackup?
>
>
>>
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données a été
>> interrompu ; dernier lancement connu à 2015-02-06 07:05:05 EST
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données n'a pas été
>> arrêté proprement ; restauration
>> automatique en cours
>> 2015-02-06 07:11:38 EST LOG: record with zero length at 24B/2C000160
>> 2015-02-06 07:11:38 EST LOG: la ré-exécution n'est pas nécessaire
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données est prêt
>> pour accepter les connexions
>> 2015-02-06 07:11:38 EST LOG: lancement du processus autovacuum
>> 2015-02-06 07:11:38 EST FATAL: le rôle « 208375PT$ » n'existe pas
>
>
> So where is role 208375PT$ supposed to come from?
>
>
>>
>> Then if I start pgAdmin I get a series of pop-ups I have to click OK to
>> to continue:
>>
>> An error has ocurred: Column not found in pgSet: "datlastsysoid"
>> An error has ocurred: Column not found in pgSet: datlastsysoid
>> An error has ocurred: Column not found in pgSet: oid
>> An error has ocurred: Column not found in pgSet: encoding
>> An error has ocurred: Column not found in pgSet: Connection to database
>> broken
>
>
> Not sure about that this, someone more versed in pgAdmin will have to answer.
>
Usually you see these messages when you're using a pgadmin major release older than a PostgreSQL make release. For a 9.3 release, that would mean a pgadmin older than 1.18.

I'm running pgadmin 1.18.1
>
>>
>> And after that, I went back to the log file and there's new information
>> added:
>>
>> 2015-02-06 07:51:05 EST LOG: processus serveur (PID 184) a été arrêté
>> par l'exception 0x80000004
>> 2015-02-06 07:51:05 EST DÉTAIL: Le processus qui a échoué exécutait :
>> SELECT version();
>> 2015-02-06 07:51:05 EST ASTUCE : Voir le fichier d'en-tête C «
>> ntstatus.h » pour une description de la valeur
>> hexadécimale.
>
>
> Well according to here:
>
> https://msdn.microsoft.com/en-us/library/cc704588.aspx
>
> 0x80000004
> STATUS_SINGLE_STEP
>
>
> {EXCEPTION} Single Step A single step or trace operation has just been completed.
>
> A developer is going to have explain what that means.
>
>
>
>> 2015-02-06 07:51:05 EST LOG: arrêt des autres processus serveur actifs
>> 2015-02-06 07:51:05 EST ATTENTION: arrêt de la connexion à cause de
>> l'arrêt brutal d'un autre processus serveur
>> 2015-02-06 07:51:05 EST DÉTAIL: Le postmaster a commandé à ce processus
>> serveur d'annuler la transaction
>> courante et de quitter car un autre processus serveur a quitté
>> anormalement
>> et qu'il existe probablement de la mémoire partagée corrompue.
>> 2015-02-06 07:51:05 EST ASTUCE : Dans un moment, vous devriez être
>> capable de vous reconnecter à la base de
>> données et de relancer votre commande.
>> 2015-02-06 07:51:05 EST LOG: processus d'archivage (PID 692) quitte
>> avec le code de sortie 1
>> 2015-02-06 07:51:05 EST LOG: tous les processus serveur se sont
>> arrêtés, réinitialisation
>> 2015-02-06 07:51:15 EST FATAL: le bloc de mémoire partagé pré-existant
>> est toujours en cours d'utilisation
>> 2015-02-06 07:51:15 EST ASTUCE : Vérifier s'il n'y a pas de vieux
>> processus serveur en cours d'exécution. Si c'est le
>> cas, fermez-les.
>>
>> I was about to try restarting postgresql using the base backup I made
>> yesterday but since this means I'll have to copy my database again (700
>> GB takes a while...) I am looking for a better solution from more
>> experienced people.
>
>
>
> My suspicion is you copied at least partly over a running server.
>

Re: Cluster seems broken after pg_basebackup

From

Guillaume Lelarge

Date:

09 February 2015, 17:51:41

2015-02-09 18:40 GMT+01:00 Guillaume Drolet <droletguillaume@gmail.com>:

I tried starting the cluster again. Once again everything looked fine at the start (first three lines of this log, in English this time):

2015-02-09 11:40:55 EST LOG: database system was shut down at 2015-02-06 09:50:21 EST
2015-02-09 11:40:55 EST LOG: database system is ready to accept connections
2015-02-09 11:40:55 EST LOG: autovacuum launcher started

Since it seemed to work, I opened the terminal and tried connecting to the database:

C:\Users\admlocal>psql -U postgres -d mortalite

So far so good, I got a connection:

psql (9.3.5)
Attention : l'encodage console (850) diffère de l'encodage Windows (1252).
            Les caractères 8 bits peuvent ne pas fonctionner correctement.
            Voir la section « Notes aux utilisateurs de Windows » de la page
            référence de psql pour les détails.
Saisissez « help » pour l'aide.

mortalite=#

I tried the help command and it worked:

mortalite=# help
Vous utilisez psql, l'interface en ligne de commande de PostgreSQL.
Saisissez:
    \copyright pour les termes de distribution
    \h pour l'aide-mémoire des commandes SQL
    \? pour l'aide-mémoire des commandes psql
    \g ou point-virgule en fin d'instruction pour exécuter la requête
    \q pour quitter

But then when I tried to query the db, it crashed:

mortalite=# \dt
la connexion au serveur a été coupée de façon inattendue
        Le serveur s'est peut-être arrêté anormalement avant ou durant le
        traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
!>

And here's the rest of the log file after the crash:

2015-02-09 12:29:19 EST LOG: server process (PID 2240) was terminated by exception 0xC0000005
2015-02-09 12:29:19 EST DETAIL: Failed process was running: SELECT n.nspname as "Schema",
    c.relname as "Name",
    CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' END as "Type",
    pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
    FROM pg_catalog.pg_class c
         LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
    WHERE c.relkind IN ('r','')
          AND n.nspname <> 'pg_catalog'
          AND n.nspname <> 'information_schema'
          AND n.nspname !~ '^pg_toast'
    AND pg_catalog.pg_table_is_visible(c.oid)
    ORDER BY 1,2;
2015-02-09 12:29:19 EST HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
2015-02-09 12:29:19 EST LOG: terminating any other active server processes
2015-02-09 12:29:19 EST WARNING: terminating connection because of crash of another server process
2015-02-09 12:29:19 EST DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2015-02-09 12:29:19 EST HINT: In a moment you should be able to reconnect to the database and repeat your command.
2015-02-09 12:29:19 EST LOG: archiver process (PID 4576) exited with exit code 1
2015-02-09 12:29:19 EST LOG: all server processes terminated; reinitializing
2015-02-09 12:29:29 EST FATAL: pre-existing shared memory block is still in use
2015-02-09 12:29:29 EST HINT: Check if there are any old server processes still running, and terminate them.

According to this page, exception 0xC0000005 means STATUS_ACCESS_VIOLATION - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s. This is not of much help to me.

I hope these additional bits of information can help someone figuring out a solution to get my cluster up and running again.

PS. I was thinking of reinstalling PGSQL over my current install but keeping my PGDATA. I've done it in the past for fixing problems with starting the service and it worked. What do you think?

The SQL query you see in your log is the result of your \dt. It should work. I don't see why reinstalling PostgreSQL will fix anything here.

You said you did a pg_basebackup. The cluster you try to start is the cluster used with pg_basebackup or a restore done with the pg_basebackup?

Cheers,

Guillaume

2015-02-09 11:37 GMT-05:00 Guillaume Drolet <droletguillaume@gmail.com>:

2015-02-07 1:24 GMT-05:00 Guillaume Lelarge <guillaume@lelarge.info>:
Le 6 févr. 2015 17:31, "Adrian Klaver" <adrian.klaver@aklaver.com> a écrit :
>
> On 02/06/2015 05:03 AM, Guillaume Drolet wrote:
>>
>> Hi,
>>
>> Yesterday I ran a pg_basebackup of my cluster. Since it has completed,
>> my cluster doesn't work properly. I tried restarting the computer (or
>> service) a few times but I always get the same messages in my logs (it's
>> in French. If someone is willing to help me I can try to translate the
>> logs. Just ask):
>
>
> Enter Google Translate:)
>
But first, Guillaume, do yourself and everyone else a favor: turn the dam log into English. Set lc_messages to 'C' in postgresql.conf.
Thanks for this! I didn't know about this great feature.

> First some questions:
>
> 1) What Postgres version?
>
> 2) What OS(s)? I am assuming Windows from the log info below, but we all know what assuming gets you.
>
> 3) Where were you backing up from and to?
>
> 4) Which cluster does not start, the master or the child you created with pg_basebackup?
>
>
>>
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données a été
>> interrompu ; dernier lancement connu à 2015-02-06 07:05:05 EST
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données n'a pas été
>> arrêté proprement ; restauration
>> automatique en cours
>> 2015-02-06 07:11:38 EST LOG: record with zero length at 24B/2C000160
>> 2015-02-06 07:11:38 EST LOG: la ré-exécution n'est pas nécessaire
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données est prêt
>> pour accepter les connexions
>> 2015-02-06 07:11:38 EST LOG: lancement du processus autovacuum
>> 2015-02-06 07:11:38 EST FATAL: le rôle « 208375PT$ » n'existe pas
>
>
> So where is role 208375PT$ supposed to come from?
>
>
>>
>> Then if I start pgAdmin I get a series of pop-ups I have to click OK to
>> to continue:
>>
>> An error has ocurred: Column not found in pgSet: "datlastsysoid"
>> An error has ocurred: Column not found in pgSet: datlastsysoid
>> An error has ocurred: Column not found in pgSet: oid
>> An error has ocurred: Column not found in pgSet: encoding
>> An error has ocurred: Column not found in pgSet: Connection to database
>> broken
>
>
> Not sure about that this, someone more versed in pgAdmin will have to answer.
>
Usually you see these messages when you're using a pgadmin major release older than a PostgreSQL make release. For a 9.3 release, that would mean a pgadmin older than 1.18.

I'm running pgadmin 1.18.1
>
>>
>> And after that, I went back to the log file and there's new information
>> added:
>>
>> 2015-02-06 07:51:05 EST LOG: processus serveur (PID 184) a été arrêté
>> par l'exception 0x80000004
>> 2015-02-06 07:51:05 EST DÉTAIL: Le processus qui a échoué exécutait :
>> SELECT version();
>> 2015-02-06 07:51:05 EST ASTUCE : Voir le fichier d'en-tête C «
>> ntstatus.h » pour une description de la valeur
>> hexadécimale.
>
>
> Well according to here:
>
> https://msdn.microsoft.com/en-us/library/cc704588.aspx
>
> 0x80000004
> STATUS_SINGLE_STEP
>
>
> {EXCEPTION} Single Step A single step or trace operation has just been completed.
>
> A developer is going to have explain what that means.
>
>
>
>> 2015-02-06 07:51:05 EST LOG: arrêt des autres processus serveur actifs
>> 2015-02-06 07:51:05 EST ATTENTION: arrêt de la connexion à cause de
>> l'arrêt brutal d'un autre processus serveur
>> 2015-02-06 07:51:05 EST DÉTAIL: Le postmaster a commandé à ce processus
>> serveur d'annuler la transaction
>> courante et de quitter car un autre processus serveur a quitté
>> anormalement
>> et qu'il existe probablement de la mémoire partagée corrompue.
>> 2015-02-06 07:51:05 EST ASTUCE : Dans un moment, vous devriez être
>> capable de vous reconnecter à la base de
>> données et de relancer votre commande.
>> 2015-02-06 07:51:05 EST LOG: processus d'archivage (PID 692) quitte
>> avec le code de sortie 1
>> 2015-02-06 07:51:05 EST LOG: tous les processus serveur se sont
>> arrêtés, réinitialisation
>> 2015-02-06 07:51:15 EST FATAL: le bloc de mémoire partagé pré-existant
>> est toujours en cours d'utilisation
>> 2015-02-06 07:51:15 EST ASTUCE : Vérifier s'il n'y a pas de vieux
>> processus serveur en cours d'exécution. Si c'est le
>> cas, fermez-les.
>>
>> I was about to try restarting postgresql using the base backup I made
>> yesterday but since this means I'll have to copy my database again (700
>> GB takes a while...) I am looking for a better solution from more
>> experienced people.
>
>
>
> My suspicion is you copied at least partly over a running server.
>

Guillaume.

http://blog.guillaume.lelarge.info

http://www.dalibo.com

Re: Cluster seems broken after pg_basebackup

From

Guillaume Drolet

Date:

09 February 2015, 18:39:52

Guillaume: the cluster I try to start is the one used with pg_basebackup, not the result of the backup.

2015-02-09 12:51 GMT-05:00 Guillaume Lelarge <guillaume@lelarge.info>:

2015-02-09 18:40 GMT+01:00 Guillaume Drolet <droletguillaume@gmail.com>:
I tried starting the cluster again. Once again everything looked fine at the start (first three lines of this log, in English this time):

2015-02-09 11:40:55 EST LOG: database system was shut down at 2015-02-06 09:50:21 EST
2015-02-09 11:40:55 EST LOG: database system is ready to accept connections
2015-02-09 11:40:55 EST LOG: autovacuum launcher started

Since it seemed to work, I opened the terminal and tried connecting to the database:

C:\Users\admlocal>psql -U postgres -d mortalite

So far so good, I got a connection:

psql (9.3.5)
Attention : l'encodage console (850) diffère de l'encodage Windows (1252).
            Les caractères 8 bits peuvent ne pas fonctionner correctement.
            Voir la section « Notes aux utilisateurs de Windows » de la page
            référence de psql pour les détails.
Saisissez « help » pour l'aide.

mortalite=#

I tried the help command and it worked:

mortalite=# help
Vous utilisez psql, l'interface en ligne de commande de PostgreSQL.
Saisissez:
    \copyright pour les termes de distribution
    \h pour l'aide-mémoire des commandes SQL
    \? pour l'aide-mémoire des commandes psql
    \g ou point-virgule en fin d'instruction pour exécuter la requête
    \q pour quitter

But then when I tried to query the db, it crashed:

mortalite=# \dt
la connexion au serveur a été coupée de façon inattendue
        Le serveur s'est peut-être arrêté anormalement avant ou durant le
        traitement de la requête.
La connexion au serveur a été perdue. Tentative de réinitialisation : Échec.
!>

And here's the rest of the log file after the crash:

2015-02-09 12:29:19 EST LOG: server process (PID 2240) was terminated by exception 0xC0000005
2015-02-09 12:29:19 EST DETAIL: Failed process was running: SELECT n.nspname as "Schema",
    c.relname as "Name",
    CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' END as "Type",
    pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
    FROM pg_catalog.pg_class c
         LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
    WHERE c.relkind IN ('r','')
          AND n.nspname <> 'pg_catalog'
          AND n.nspname <> 'information_schema'
          AND n.nspname !~ '^pg_toast'
    AND pg_catalog.pg_table_is_visible(c.oid)
    ORDER BY 1,2;
2015-02-09 12:29:19 EST HINT: See C include file "ntstatus.h" for a description of the hexadecimal value.
2015-02-09 12:29:19 EST LOG: terminating any other active server processes
2015-02-09 12:29:19 EST WARNING: terminating connection because of crash of another server process
2015-02-09 12:29:19 EST DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2015-02-09 12:29:19 EST HINT: In a moment you should be able to reconnect to the database and repeat your command.
2015-02-09 12:29:19 EST LOG: archiver process (PID 4576) exited with exit code 1
2015-02-09 12:29:19 EST LOG: all server processes terminated; reinitializing
2015-02-09 12:29:29 EST FATAL: pre-existing shared memory block is still in use
2015-02-09 12:29:29 EST HINT: Check if there are any old server processes still running, and terminate them.

According to this page, exception 0xC0000005 means STATUS_ACCESS_VIOLATION - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s. This is not of much help to me.

I hope these additional bits of information can help someone figuring out a solution to get my cluster up and running again.

PS. I was thinking of reinstalling PGSQL over my current install but keeping my PGDATA. I've done it in the past for fixing problems with starting the service and it worked. What do you think?

The SQL query you see in your log is the result of your \dt. It should work. I don't see why reinstalling PostgreSQL will fix anything here.

You said you did a pg_basebackup. The cluster you try to start is the cluster used with pg_basebackup or a restore done with the pg_basebackup?

Cheers,

Guillaume

2015-02-09 11:37 GMT-05:00 Guillaume Drolet <droletguillaume@gmail.com>:

2015-02-07 1:24 GMT-05:00 Guillaume Lelarge <guillaume@lelarge.info>:
Le 6 févr. 2015 17:31, "Adrian Klaver" <adrian.klaver@aklaver.com> a écrit :
>
> On 02/06/2015 05:03 AM, Guillaume Drolet wrote:
>>
>> Hi,
>>
>> Yesterday I ran a pg_basebackup of my cluster. Since it has completed,
>> my cluster doesn't work properly. I tried restarting the computer (or
>> service) a few times but I always get the same messages in my logs (it's
>> in French. If someone is willing to help me I can try to translate the
>> logs. Just ask):
>
>
> Enter Google Translate:)
>
But first, Guillaume, do yourself and everyone else a favor: turn the dam log into English. Set lc_messages to 'C' in postgresql.conf.
Thanks for this! I didn't know about this great feature.

> First some questions:
>
> 1) What Postgres version?
>
> 2) What OS(s)? I am assuming Windows from the log info below, but we all know what assuming gets you.
>
> 3) Where were you backing up from and to?
>
> 4) Which cluster does not start, the master or the child you created with pg_basebackup?
>
>
>>
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données a été
>> interrompu ; dernier lancement connu à 2015-02-06 07:05:05 EST
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données n'a pas été
>> arrêté proprement ; restauration
>> automatique en cours
>> 2015-02-06 07:11:38 EST LOG: record with zero length at 24B/2C000160
>> 2015-02-06 07:11:38 EST LOG: la ré-exécution n'est pas nécessaire
>> 2015-02-06 07:11:38 EST LOG: le système de bases de données est prêt
>> pour accepter les connexions
>> 2015-02-06 07:11:38 EST LOG: lancement du processus autovacuum
>> 2015-02-06 07:11:38 EST FATAL: le rôle « 208375PT$ » n'existe pas
>
>
> So where is role 208375PT$ supposed to come from?
>
>
>>
>> Then if I start pgAdmin I get a series of pop-ups I have to click OK to
>> to continue:
>>
>> An error has ocurred: Column not found in pgSet: "datlastsysoid"
>> An error has ocurred: Column not found in pgSet: datlastsysoid
>> An error has ocurred: Column not found in pgSet: oid
>> An error has ocurred: Column not found in pgSet: encoding
>> An error has ocurred: Column not found in pgSet: Connection to database
>> broken
>
>
> Not sure about that this, someone more versed in pgAdmin will have to answer.
>
Usually you see these messages when you're using a pgadmin major release older than a PostgreSQL make release. For a 9.3 release, that would mean a pgadmin older than 1.18.

I'm running pgadmin 1.18.1
>
>>
>> And after that, I went back to the log file and there's new information
>> added:
>>
>> 2015-02-06 07:51:05 EST LOG: processus serveur (PID 184) a été arrêté
>> par l'exception 0x80000004
>> 2015-02-06 07:51:05 EST DÉTAIL: Le processus qui a échoué exécutait :
>> SELECT version();
>> 2015-02-06 07:51:05 EST ASTUCE : Voir le fichier d'en-tête C «
>> ntstatus.h » pour une description de la valeur
>> hexadécimale.
>
>
> Well according to here:
>
> https://msdn.microsoft.com/en-us/library/cc704588.aspx
>
> 0x80000004
> STATUS_SINGLE_STEP
>
>
> {EXCEPTION} Single Step A single step or trace operation has just been completed.
>
> A developer is going to have explain what that means.
>
>
>
>> 2015-02-06 07:51:05 EST LOG: arrêt des autres processus serveur actifs
>> 2015-02-06 07:51:05 EST ATTENTION: arrêt de la connexion à cause de
>> l'arrêt brutal d'un autre processus serveur
>> 2015-02-06 07:51:05 EST DÉTAIL: Le postmaster a commandé à ce processus
>> serveur d'annuler la transaction
>> courante et de quitter car un autre processus serveur a quitté
>> anormalement
>> et qu'il existe probablement de la mémoire partagée corrompue.
>> 2015-02-06 07:51:05 EST ASTUCE : Dans un moment, vous devriez être
>> capable de vous reconnecter à la base de
>> données et de relancer votre commande.
>> 2015-02-06 07:51:05 EST LOG: processus d'archivage (PID 692) quitte
>> avec le code de sortie 1
>> 2015-02-06 07:51:05 EST LOG: tous les processus serveur se sont
>> arrêtés, réinitialisation
>> 2015-02-06 07:51:15 EST FATAL: le bloc de mémoire partagé pré-existant
>> est toujours en cours d'utilisation
>> 2015-02-06 07:51:15 EST ASTUCE : Vérifier s'il n'y a pas de vieux
>> processus serveur en cours d'exécution. Si c'est le
>> cas, fermez-les.
>>
>> I was about to try restarting postgresql using the base backup I made
>> yesterday but since this means I'll have to copy my database again (700
>> GB takes a while...) I am looking for a better solution from more
>> experienced people.
>
>
>
> My suspicion is you copied at least partly over a running server.
>

--
Guillaume.
http://blog.guillaume.lelarge.info
http://www.dalibo.com

Re: Cluster seems broken after pg_basebackup

From

Adrian Klaver

Date:

09 February 2015, 19:58:57

On 02/09/2015 08:34 AM, Guillaume Drolet wrote:

CCing list so the information stays in the thread.
>
>
> 2015-02-06 18:44 GMT-05:00 Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>>:
>
>     On 02/06/2015 09:17 AM, Guillaume Drolet wrote:
>
>         Dear Adrian,
>
>         Thanks for helping me. Sorry for the lack of details, I had said to
>         myself I had to not forget to give these details but I hit the send
>         button too fast. You know how it is...
>
>         I added more info in your reply below.
>
>
>              First some questions:
>
>              1) What Postgres version?
>
>
>         9.3
>
>
>         Windows 7
>
>
>              3) Where were you backing up from and to?
>
>
>         Backing up from my only cluster (PGDATA) on disk E, to a backup
>         directory on an other disk (F:) using this command:
>
>         pg_basebackup -D "F:\\db_base_backup" -Fp -Xs  -R -P
>         --label="basebackup20150205" --username=postgres
>
>         What's weird is that I did some successful tests last week on
>         the same
>         system (backing up, archiving, recovering) using the same procedure.
>         Only difference was the cluster, which was much smaller for testing
>         purposes, but located at the same place (i.e. E:\data) and
>         PostgresSQL
>         installed in C:\Programs\...
>
>
>              4) Which cluster does not start, the master or the child
>         you created
>              with pg_basebackup?
>
>
>
>         The master. I haven't tried the child yet. But I saw that the
>         message
>         about role "208375PT$" is in logs from before the backup too.
>
>
>         This is the local domain of my machine. I log onto my machine with a
>         local admin account and using domain name 208375PT (I didn't set
>         this
>         part of my machine, the IT guys here at work did). The thing is:
>         I don't
>         understand why it's there in the log file??
>
>
>     Not sure.
>
>     What are you using for an authentication method for database login?
>

  At this moment, for my tests I use md5 for user 'postgres' and trust for
  user 'all'.

>
>
>
>
>                  And after that, I went back to the log file and there's new
>                  information
>                  added:
>
>                  2015-02-06 07:51:05 EST LOG:  processus serveur (PID
>         184) a été
>                  arrêté
>                  par l'exception 0x80000004
>                  2015-02-06 07:51:05 EST DÉTAIL:  Le processus qui a échoué
>                  exécutait :
>                  SELECT version();
>                  2015-02-06 07:51:05 EST ASTUCE :  Voir le fichier
>         d'en-tête C «
>                  ntstatus.h » pour une description de la valeur
>                        hexadécimale.
>
>
>              Well according to here:
>
>         https://msdn.microsoft.com/en-____us/library/cc704588.aspx
>         <https://msdn.microsoft.com/en-__us/library/cc704588.aspx>
>              <https://msdn.microsoft.com/__en-us/library/cc704588.aspx
>         <https://msdn.microsoft.com/en-us/library/cc704588.aspx>>
>
>              0x80000004
>              STATUS_SINGLE_STEP
>
>
>              {EXCEPTION} Single Step A single step or trace operation
>         has just
>              been completed.
>
>              A developer is going to have explain what that means.
>
>
>
>
>              My suspicion is you copied at least partly over a running
>         server.
>
>
>         How would that be possible? Using the pg_basebackup command I wrote
>         above, it is clear that I wrote the backup on disk F and not E.
>
>
>     I was just speculating, I would not put too much stock in it.
>
>
>
>         While writing this post, I started my backup using:
>
>         pg_ctl start -D "F:\db_basebackup"
>
>         Similar stuff happened with pgAdmin and the log (message about
>         symbolic
>         link is related to my post from yesterday. I don't know if this
>         could be
>         involved in the current problem):
>
>         2015-02-06 12:13:58 EST LOG:  le système de bases de données a été
>         interrompu ; dernier lancement connu à 2015-02-05 14:30:34 EST
>         2015-02-06 12:13:58 EST LOG:  création du répertoire manquant «
>         pg_xlog/archive_status » pour les journaux de transactions
>         2015-02-06 12:13:58 EST LOG:  la ré-exécution commence à
>         24B/28000090
>         2015-02-06 12:13:58 EST LOG:  n'a pas pu supprimer le lien
>         symbolique «
>         pg_tblspc/940585 » : No such file or directory
>         2015-02-06 12:13:58 EST CONTEXTE :  xlog redo drop tablespace:
>         940585
>         2015-02-06 12:13:58 EST LOG:  état de restauration cohérent
>         atteint à
>         24B/290000B8
>         2015-02-06 12:13:58 EST LOG:  ré-exécution faite à 24B/290000B8
>         2015-02-06 12:13:58 EST LOG:  la dernière transaction a eu lieu à
>         2015-02-05 09:06:04.892-05 (moment de la journalisation)
>         2015-02-06 12:13:59 EST LOG:  le système de bases de données est
>         prêt
>         pour accepter les connexions
>         2015-02-06 12:13:59 EST LOG:  lancement du processus autovacuum
>         2015-02-06 12:14:42 EST LOG:  processus serveur (PID 1784) a été
>         arrêté
>         par l'exception 0x80000004
>         2015-02-06 12:14:42 EST DÉTAIL:  Le processus qui a échoué
>         exécutait :
>         SELECT version();
>         2015-02-06 12:14:42 EST ASTUCE :  Voir le fichier d'en-tête C «
>         ntstatus.h » pour une description de la valeur
>               hexadécimale.
>         2015-02-06 12:14:42 EST LOG:  arrêt des autres processus serveur
>         actifs
>         2015-02-06 12:14:42 EST ATTENTION:  arrêt de la connexion à cause de
>         l'arrêt brutal d'un autre processus serveur
>         2015-02-06 12:14:42 EST DÉTAIL:  Le postmaster a commandé à ce
>         processus
>         serveur d'annuler la transaction
>               courante et de quitter car un autre processus serveur a quitté
>         anormalement
>               et qu'il existe probablement de la mémoire partagée corrompue.
>         2015-02-06 12:14:42 EST ASTUCE :  Dans un moment, vous devriez être
>         capable de vous reconnecter à la base de
>               données et de relancer votre commande.
>         2015-02-06 12:14:42 EST LOG:  tous les processus serveur se sont
>         arrêtés, réinitialisation
>
>
>         Any ideas where to go from here?
>
>
>     In both cases the database got to the point below, which would seem
>     to indicate everything was alright.
>
>     2015-02-06 7:11:38 ET LOG: the re-execution is not required
>     2015-02-06 7:11:38 ET LOG: the database system is ready for
>     accept connections
>
>     Also from what I can see the server crashed at this point:
>
>     2015-02-06 12:13:59 LOG IS: launch autovacuum processes
>     2015-02-06 12:14:42 EST LOG: server process (PID 1784) was arrested
>     by the exception 0x80000004
>
>
>     Now 0x80000004 is supposed to mean:
>
>     STATUS_SINGLE_STEP
>
>
>     {EXCEPTION} Single Step A single step or trace operation has just
>     been completed.
>
>     Some digging indicates this is the result of debugger command. Have
>     no idea how that would invoked in Postgres running production code.
>     This leads to my default question when I see unexplained behavior on
>     a Windows machine; do you have anti-virus machine running against
>     the drives?
>
>

  Yes I do and I'm not allowed to turn it off (I don't have such
  privileges). But the anti-virus software is running on my other machine
  (same setup) and I've never had such problems. Even on this machine
  that's giving me problems, I spent the two last weeks making tests with
  point-in-time-recovery and everything went fine.

>
>
>
>         Thanks a lot again.
>
>
>                  Thanks a lot for helping! Guillaume
>
>
>
>              --
>              Adrian Klaver
>         adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>         <mailto:adrian.klaver@aklaver.__com
>         <mailto:adrian.klaver@aklaver.com>>
>
>
>
>
>     --
>     Adrian Klaver
>     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Cluster seems broken after pg_basebackup

From

Jim Nasby

Date:

09 February 2015, 21:10:45

On 2/9/15 11:51 AM, Guillaume Lelarge wrote:
> According to this page
> <https://msdn.microsoft.com/en-ca/library/cc704588.aspx>, exception
> 0xC0000005 means STATUS_ACCESS_VIOLATION - The instruction at 0x%08lx
> referenced memory at 0x%08lx. The memory could not be %s. This is not of
> much help to me.

In my experience that means that your data is corrupted.

> I hope these additional bits of information can help someone figuring
> out a solution to get my cluster up and running again.
>
> PS. I was thinking of reinstalling PGSQL over my current install but
> keeping my PGDATA. I've done it in the past for fixing problems with
> starting the service and it worked. What do you think?

You could try it, but as Guillaume Drolet mentioned I don't see this
helping.

Since this is happening on your original database, I suspect that's
what's been corrupted. In my experience, this means you either have
faulty hardware, or there's a misconfiguration that means fsync isn't
doing what it's supposed to do.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Cluster seems broken after pg_basebackup

From

Guillaume Drolet

Date:

09 February 2015, 21:36:20

2015-02-09 16:10 GMT-05:00 Jim Nasby <Jim.Nasby@bluetreble.com>:

On 2/9/15 11:51 AM, Guillaume Lelarge wrote:
According to this page
<https://msdn.microsoft.com/en-ca/library/cc704588.aspx>, exception
0xC0000005 means STATUS_ACCESS_VIOLATION - The instruction at 0x%08lx
referenced memory at 0x%08lx. The memory could not be %s. This is not of
much help to me.

In my experience that means that your data is corrupted.

That wouldn't be too dramatic since I can start over with another base backup. It just takes several hours... :(

What I want to achieve in the end is to transfer my production database from one machine to another. So here's how I did before things started to go bad:

1. On the source machine, I took a base backup using pg_basebackup

2. I installed PGSQL on the destination machine

3. I copied the backup and extracted it on the destination machine

4. I created a PGDATA environment variable pointing to 'data' from the extracted backup. In this backup, most of my data was in a created tablespace so I updated the junction link (in data\pg_tblspc) to reflect the location of the tablespace on the new machine (It seems like when using pg_basebackup, junction links are not saved as junction links but as empty directories).

5. I verified I could connect to my database using psql or pgadmin. It worked.

6. I wanted to move my tablespace to pg_default so I ran ALTER DATABASE mydb SET TABLESPACE pg_default;

7. Ran pg_basebackup on my cluster

8. Tried to connect to database, no success...

It it the right way to proceed for this kind of operation?

Thanks!

I hope these additional bits of information can help someone figuring
out a solution to get my cluster up and running again.

PS. I was thinking of reinstalling PGSQL over my current install but
keeping my PGDATA. I've done it in the past for fixing problems with
starting the service and it worked. What do you think?

You could try it, but as Guillaume Drolet mentioned I don't see this helping.

Since this is happening on your original database, I suspect that's what's been corrupted. In my experience, this means you either have faulty hardware, or there's a misconfiguration that means fsync isn't doing what it's supposed to do.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Cluster seems broken after pg_basebackup

From

Guillaume Drolet

Date:

10 February 2015, 16:39:18

Adrian, in response to your question:

2015-02-06 07:11:38 EST FATAL: le rôle « 208375PT$ » n'existe pas

So where is role 208375PT$ supposed to come from?

I found that when I stop/start/restart pgsql through the services.msc application in Windows, this message is issued in the log file. This makes sense since the account I use to start the services.msc application is my admin account, 208375PT\Admlocal.

If I use instead the command line with "pg_ctl restart -U postgres", I don't get this message in the log file.

2015-02-06 11:28 GMT-05:00 Adrian Klaver <adrian.klaver@aklaver.com>:

On 02/06/2015 05:03 AM, Guillaume Drolet wrote:
Hi,

Yesterday I ran a pg_basebackup of my cluster. Since it has completed,
my cluster doesn't work properly. I tried restarting the computer (or
service) a few times but I always get the same messages in my logs (it's
in French. If someone is willing to help me I can try to translate the
logs. Just ask):

Enter Google Translate:)

First some questions:

1) What Postgres version?

2) What OS(s)? I am assuming Windows from the log info below, but we all know what assuming gets you.

3) Where were you backing up from and to?

4) Which cluster does not start, the master or the child you created with pg_basebackup?

2015-02-06 07:11:38 EST LOG: le système de bases de données a été
interrompu ; dernier lancement connu à 2015-02-06 07:05:05 EST
2015-02-06 07:11:38 EST LOG: le système de bases de données n'a pas été
arrêté proprement ; restauration
automatique en cours
2015-02-06 07:11:38 EST LOG: record with zero length at 24B/2C000160
2015-02-06 07:11:38 EST LOG: la ré-exécution n'est pas nécessaire
2015-02-06 07:11:38 EST LOG: le système de bases de données est prêt
pour accepter les connexions
2015-02-06 07:11:38 EST LOG: lancement du processus autovacuum
2015-02-06 07:11:38 EST FATAL: le rôle « 208375PT$ » n'existe pas

So where is role 208375PT$ supposed to come from?

Then if I start pgAdmin I get a series of pop-ups I have to click OK to
to continue:

An error has ocurred: Column not found in pgSet: "datlastsysoid"
An error has ocurred: Column not found in pgSet: datlastsysoid
An error has ocurred: Column not found in pgSet: oid
An error has ocurred: Column not found in pgSet: encoding
An error has ocurred: Column not found in pgSet: Connection to database
broken

Not sure about that this, someone more versed in pgAdmin will have to answer.

And after that, I went back to the log file and there's new information
added:

2015-02-06 07:51:05 EST LOG: processus serveur (PID 184) a été arrêté
par l'exception 0x80000004
2015-02-06 07:51:05 EST DÉTAIL: Le processus qui a échoué exécutait :
SELECT version();
2015-02-06 07:51:05 EST ASTUCE : Voir le fichier d'en-tête C «
ntstatus.h » pour une description de la valeur
hexadécimale.

Well according to here:

https://msdn.microsoft.com/en-us/library/cc704588.aspx

0x80000004
STATUS_SINGLE_STEP

{EXCEPTION} Single Step A single step or trace operation has just been completed.

A developer is going to have explain what that means.

2015-02-06 07:51:05 EST LOG: arrêt des autres processus serveur actifs
2015-02-06 07:51:05 EST ATTENTION: arrêt de la connexion à cause de
l'arrêt brutal d'un autre processus serveur
2015-02-06 07:51:05 EST DÉTAIL: Le postmaster a commandé à ce processus
serveur d'annuler la transaction
courante et de quitter car un autre processus serveur a quitté
anormalement
et qu'il existe probablement de la mémoire partagée corrompue.
2015-02-06 07:51:05 EST ASTUCE : Dans un moment, vous devriez être
capable de vous reconnecter à la base de
données et de relancer votre commande.
2015-02-06 07:51:05 EST LOG: processus d'archivage (PID 692) quitte
avec le code de sortie 1
2015-02-06 07:51:05 EST LOG: tous les processus serveur se sont
arrêtés, réinitialisation
2015-02-06 07:51:15 EST FATAL: le bloc de mémoire partagé pré-existant
est toujours en cours d'utilisation
2015-02-06 07:51:15 EST ASTUCE : Vérifier s'il n'y a pas de vieux
processus serveur en cours d'exécution. Si c'est le
cas, fermez-les.

I was about to try restarting postgresql using the base backup I made
yesterday but since this means I'll have to copy my database again (700
GB takes a while...) I am looking for a better solution from more
experienced people.

My suspicion is you copied at least partly over a running server.

Thanks a lot for helping! Guillaume

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Cluster seems broken after pg_basebackup

From

Adrian Klaver

Date:

11 February 2015, 15:26:46

On 02/10/2015 08:38 AM, Guillaume Drolet wrote:
>
> Adrian, in response to your question:
>
>     2015-02-06 07:11:38 EST FATAL:  le rôle « 208375PT$ » n'existe pas
>
>
> So where is role 208375PT$ supposed to come from?
>
> I found that when I stop/start/restart pgsql through the services.msc
> application in Windows, this message is issued in the log file. This
> makes sense since the account I use to start the services.msc
> application is my admin account, 208375PT\Admlocal.
>
> If I use instead the command line with "pg_ctl restart -U postgres", I
> don't get this message in the log file.
>

Thanks for the explanation. I would say at this point the error is just
noise, but it is nice to know where it came from. Seems a corruption
issue on the master, as others have pointed out, is at the heart of the
matter.


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Cluster seems broken after pg_basebackup

From

Guillaume Drolet

Date:

20 February 2015, 12:12:34

2015-02-09 16:10 GMT-05:00 Jim Nasby <Jim.Nasby@bluetreble.com>:

On 2/9/15 11:51 AM, Guillaume Lelarge wrote:
According to this page
<https://msdn.microsoft.com/en-ca/library/cc704588.aspx>, exception
0xC0000005 means STATUS_ACCESS_VIOLATION - The instruction at 0x%08lx
referenced memory at 0x%08lx. The memory could not be %s. This is not of
much help to me.

In my experience that means that your data is corrupted.

I hope these additional bits of information can help someone figuring
out a solution to get my cluster up and running again.

PS. I was thinking of reinstalling PGSQL over my current install but
keeping my PGDATA. I've done it in the past for fixing problems with
starting the service and it worked. What do you think?

You could try it, but as Guillaume Drolet mentioned I don't see this helping.

Since this is happening on your original database, I suspect that's what's been corrupted. In my experience, this means you either have faulty hardware, or there's a misconfiguration that means fsync isn't doing what it's supposed to do.

For those interested, I reinstalled the PGSQL binaries over, keeping my PGDATA and tablespace. It stopped the crashes mentioned in the previous posts, at least for now. We'll see if the crashes come back.

--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com