Thread: pg_basebackup: return value 1: reason?

pg_basebackup: return value 1: reason?

From
Andrej Vanek
Date:
Hello, 

I tried to run pg_basebackup. Return value is 1.

How to find out its reason?
(I suspect that some wal after backup is missing- but how to find out the real reason? How to fix it?)

thanks, Andrej 
--------------details:
environment: CentOS 6.7, postgres 9.5.1
( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit)

I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were issued from a script:
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr  >> ${LOG_FILE}
# echo $?
1             <--------------pg_basebackup failed!
# cat log.stderr
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E3000028 on timeline 1
WARNING:  skipping special file "./pg_hba.conf"
WARNING:  skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E30000F8
pg_basebackup: base backup completed            <------------------no reason for pg_basebackup failure!
# cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
# su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
# tail /pg_data/pg_log/postgresql-Fri.log
`pg_xlog/0000000100000000000000E2' -> `../backups/arc/0000000100000000000000E2'
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping special file "./pg_hba.conf"
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping special file "./pg_hba.conf.save"         <---------------recorded in pg_log on master node and copied by pg_basebackup (note time difference between two servers)
2016-04-15 23:15:02 CEST:@:[23321] LOG:  database system was interrupted; last known up at 2016-04-15 23:15:10 CEST
2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL:  the database system is starting up
2016-04-15 23:15:03 CEST:@:[23321] LOG:  entering standby mode
2016-04-15 23:15:03 CEST:@:[23321] LOG:  database system was not properly shut down; automatic recovery in progress <---------something missing from pg_basebackup
2016-04-15 23:15:03 CEST:@:[23321] LOG:  redo starts at 0/E3000028
2016-04-15 23:15:03 CEST:@:[23321] LOG:  consistent recovery state reached at 0/E4000000
2016-04-15 23:15:03 CEST:@:[23295] LOG:  database system is ready to accept read only connections
2016-04-15 23:15:03 CEST:@:[23356] LOG:  started streaming WAL from primary at 0/E4000000 on timeline 1
-------second trial
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X stream" 
# echo $?
1
#  cat /var/log/cluster/geo_repair.log.err 
transaction log start point: 0/E5000028 on timeline 1
pg_basebackup: starting background WAL receiver
WARNING:  skipping special file "./pg_hba.conf"
WARNING:  skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E50000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes        <----what does this mean? I think it failed to start process to fetching wal logs created during backup: but neither on master node neither on pg_basebackup output here is any information about reason.. (max_wal_senders on master is 10: I see no reason to fail).

postgres logs:
`pg_xlog/0000000100000000000000E4' -> `../backups/arc/0000000100000000000000E4'
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping special file "./pg_hba.conf"
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping special file "./pg_hba.conf.save"
2016-04-15 23:35:01 CEST:@:[28926] LOG:  database system was interrupted; last known up at 2016-04-15 23:35:09 CEST
2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL:  the database system is starting up
2016-04-15 23:35:02 CEST:@:[28926] LOG:  entering standby mode
2016-04-15 23:35:02 CEST:@:[28926] LOG:  database system was not properly shut down; automatic recovery in progress  <------------this means something missing from pg_basebackup
2016-04-15 23:35:02 CEST:@:[28926] LOG:  redo starts at 0/E5000028
2016-04-15 23:35:02 CEST:@:[28926] LOG:  consistent recovery state reached at 0/E6000000
2016-04-15 23:35:02 CEST:@:[28904] LOG:  database system is ready to accept read only connections
2016-04-15 23:35:02 CEST:@:[28989] LOG:  started streaming WAL from primary at 0/E6000000 on timeline 1

postgres params on master node:
log_line_prefix = '%t:%u@%d:[%p] '
logging_collector = on
wal_buffers = 16MB
max_wal_size = 200MB
log_temp_files = 1MB
max_connections = 170
shared_buffers = 512MB
effective_cache_size = 1500MB
work_mem = 48MB
log_lock_waits = on
log_min_duration_statement = 10000
shared_preload_libraries = 'pg_stat_statements'
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
wal_level = hot_standby
archive_mode = on
max_wal_senders = 10
hot_standby = on
wal_keep_segments = 128
archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
restart_after_crash = off
hot_standby_feedback = on

Re: pg_basebackup: return value 1: reason?

From
Adrian Klaver
Date:
On 04/15/2016 03:28 PM, Andrej Vanek wrote:
> Hello,
>
> I tried to run pg_basebackup. Return value is 1.
>
> How to find out its reason?
> (I suspect that some wal after backup is missing- but how to find
> out the real reason? How to fix it?)

First it is not clear to me where you are taking the backup from, the
master or the standby?

Second there is a lot of redirection going on. What happens if you run
the pg_basebackup directly (without doing  su - postgres ...) and use
hardcoded values instead of shell variables?

>
> thanks, Andrej
> --------------details:
> environment: CentOS 6.7, postgres 9.5.1
> ( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
> 20120313 (Red Hat 4.4.7-16), 64-bit)
>
> I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were
> issued from a script:
> # su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
> -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr
>   >> ${LOG_FILE}
> # echo $?
> 1             <--------------pg_basebackup failed!
> # cat log.stderr
> # cat /var/log/cluster/geo_repair.log.err
> transaction log start point: 0/E3000028 on timeline 1
> WARNING:  skipping special file "./pg_hba.conf"
> WARNING:  skipping special file "./pg_hba.conf.save"
> transaction log end point: 0/E30000F8
> pg_basebackup: base backup completed            <------------------no
> reason for pg_basebackup failure!
> # cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
> # su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
> # tail /pg_data/pg_log/postgresql-Fri.log
> `pg_xlog/0000000100000000000000E2' ->
> `../backups/arc/0000000100000000000000E2'
> 2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping
> special file "./pg_hba.conf"
> 2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping
> special file "./pg_hba.conf.save"         <---------------recorded in
> pg_log on master node and copied by pg_basebackup (note time difference
> between two servers)
> 2016-04-15 23:15:02 CEST:@:[23321] LOG:  database system was
> interrupted; last known up at 2016-04-15 23:15:10 CEST
> 2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL:  the database
> system is starting up
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  entering standby mode
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  database system was not
> properly shut down; automatic recovery in progress <---------something
> missing from pg_basebackup
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  redo starts at 0/E3000028
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  consistent recovery state
> reached at 0/E4000000
> 2016-04-15 23:15:03 CEST:@:[23295] LOG:  database system is ready to
> accept read only connections
> 2016-04-15 23:15:03 CEST:@:[23356] LOG:  started streaming WAL from
> primary at 0/E4000000 on timeline 1
> -------second trial
> # su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
> -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X stream"
> # echo $?
> 1
> #  cat /var/log/cluster/geo_repair.log.err
> transaction log start point: 0/E5000028 on timeline 1
> pg_basebackup: starting background WAL receiver
> WARNING:  skipping special file "./pg_hba.conf"
> WARNING:  skipping special file "./pg_hba.conf.save"
> transaction log end point: 0/E50000F8
> pg_basebackup: waiting for background process to finish streaming ...
> pg_basebackup: could not wait for child process: No child processes
>     <----what does this mean? I think it failed to start process to
> fetching wal logs created during backup: but neither on master node
> neither on pg_basebackup output here is any information about reason..
> (max_wal_senders on master is 10: I see no reason to fail).
>
> postgres logs:
> `pg_xlog/0000000100000000000000E4' ->
> `../backups/arc/0000000100000000000000E4'
> 2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping
> special file "./pg_hba.conf"
> 2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping
> special file "./pg_hba.conf.save"
> 2016-04-15 23:35:01 CEST:@:[28926] LOG:  database system was
> interrupted; last known up at 2016-04-15 23:35:09 CEST
> 2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL:  the database
> system is starting up
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  entering standby mode
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  database system was not
> properly shut down; automatic recovery in progress  <------------this
> means something missing from pg_basebackup
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  redo starts at 0/E5000028
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  consistent recovery state
> reached at 0/E6000000
> 2016-04-15 23:35:02 CEST:@:[28904] LOG:  database system is ready to
> accept read only connections
> 2016-04-15 23:35:02 CEST:@:[28989] LOG:  started streaming WAL from
> primary at 0/E6000000 on timeline 1
>
> postgres params on master node:
> log_line_prefix = '%t:%u@%d:[%p] '
> logging_collector = on
> wal_buffers = 16MB
> max_wal_size = 200MB
> log_temp_files = 1MB
> max_connections = 170
> shared_buffers = 512MB
> effective_cache_size = 1500MB
> work_mem = 48MB
> log_lock_waits = on
> log_min_duration_statement = 10000
> shared_preload_libraries = 'pg_stat_statements'
> include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
> wal_level = hot_standby
> archive_mode = on
> max_wal_senders = 10
> hot_standby = on
> wal_keep_segments = 128
> archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
> wal_receiver_status_interval = 2
> max_standby_streaming_delay = -1
> max_standby_archive_delay = -1
> restart_after_crash = off
> hot_standby_feedback = on
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: pg_basebackup: return value 1: reason?

From
Jerry Sievers
Date:
Andrej Vanek <andrej.vanek.sk@gmail.com> writes:

> Hello, 
>
> I tried to run pg_basebackup. Return value is 1.
>
> How to find out its reason?
> (I suspect that some wal after backup is missing- but how to find out the real reason? How to fix it?)

Well there are more than 80 cases of exit 1 in a couple .c files and
most/all are preceeded unsurprisingly with an fprintf(stderr.

Are you not getting something informative on your display and/or sending
stderr to somewher else?

$ echo $*
./src/bin/pg_basebackup/pg_basebackup.c ./src/backend/replication/basebackup.c

$ grep -h -B 4 -i 'exit.*(1' $* | egrep -hi 'fprintf\(stderr|exit'
            fprintf(stderr, _("%s: directory name too long\n"), progname);
            exit(1);
                fprintf(stderr, _("%s: multiple \"=\" signs in tablespace mapping\n"), progname);
                exit(1);
        fprintf(stderr,
        exit(1);
        fprintf(stderr, _("%s: old directory is not an absolute path in tablespace mapping: %s\n"),
        exit(1);
        fprintf(stderr, _("%s: new directory is not an absolute path in tablespace mapping: %s\n"),
        exit(1);
                fprintf(stderr, _("%s: could not read from ready pipe: %s\n"),
                exit(1);
                fprintf(stderr,
                exit(1);
        fprintf(stderr,
        disconnect_and_exit(1);
        fprintf(stderr,
        disconnect_and_exit(1);
        exit(1);
        fprintf(stderr,
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: could not create background process: %s\n"),
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: could not create background thread: %s\n"),
        disconnect_and_exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
            fprintf(stderr,
            disconnect_and_exit(1);
            fprintf(stderr, _("%s: could not access directory \"%s\": %s\n"),
            disconnect_and_exit(1);
        fprintf(stderr,
        exit(1);
        fprintf(stderr,
        exit(1);
        fprintf(stderr, _("%s: transfer rate must be greater than zero\n"),
        exit(1);
        fprintf(stderr,
        exit(1);
        fprintf(stderr,
        exit(1);
        fprintf(stderr,
        exit(1);
            fprintf(stderr,
            disconnect_and_exit(1);
            fprintf(stderr, _("%s: could not write to file \"%s\": %s\n"),
            disconnect_and_exit(1);
                    fprintf(stderr,
                    disconnect_and_exit(1);
                    fprintf(stderr,
                    disconnect_and_exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
            fprintf(stderr,
            disconnect_and_exit(1);
            fprintf(stderr, _("%s: could not create file \"%s\": %s\n"),
            disconnect_and_exit(1);
        fprintf(stderr, _("%s: could not get COPY data stream: %s"),
        disconnect_and_exit(1);
                    fprintf(stderr,
                    disconnect_and_exit(1);
                        fprintf(stderr,
                        disconnect_and_exit(1);
            fprintf(stderr, _("%s: could not read COPY data: %s"),
            disconnect_and_exit(1);
        fprintf(stderr, _("%s: could not get COPY data stream: %s"),
        disconnect_and_exit(1);
            fprintf(stderr, _("%s: could not read COPY data: %s"),
            disconnect_and_exit(1);
                fprintf(stderr, _("%s: invalid tar block header size: %d\n"),
                disconnect_and_exit(1);
                            fprintf(stderr,
                            disconnect_and_exit(1);
                        fprintf(stderr,
                        disconnect_and_exit(1);
                    fprintf(stderr,
                    disconnect_and_exit(1);
                fprintf(stderr, _("%s: could not create file \"%s\": %s\n"),
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: could not write to file \"%s\": %s\n"),
                disconnect_and_exit(1);
        fprintf(stderr,
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: out of memory\n"), progname);
        exit(1);
        fprintf(stderr, _("%s: out of memory\n"), progname);
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: out of memory\n"), progname);
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: out of memory\n"), progname);
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: could not create file \"%s\": %s\n"), progname, filename, strerror(errno));
        disconnect_and_exit(1);
        fprintf(stderr,
        disconnect_and_exit(1);
        exit(1);
        fprintf(stderr, _("%s: incompatible server version %s\n"),
        disconnect_and_exit(1);
        disconnect_and_exit(1);
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: could not send replication command \"%s\": %s"),
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: could not initiate base backup: %s"),
        disconnect_and_exit(1);
        fprintf(stderr,
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: could not get backup header: %s"),
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: no data returned from server\n"), progname);
        disconnect_and_exit(1);
        fprintf(stderr,
        disconnect_and_exit(1);
        fprintf(stderr,
        disconnect_and_exit(1);
        fprintf(stderr,
        disconnect_and_exit(1);
        fprintf(stderr, _("%s: final receive failed: %s"),
        disconnect_and_exit(1);
            fprintf(stderr,
            disconnect_and_exit(1);
            fprintf(stderr, _("%s: could not wait for child process: %s\n"),
            disconnect_and_exit(1);
            fprintf(stderr, _("%s: child %d died, expected %d\n"),
            disconnect_and_exit(1);
        if (!WIFEXITED(status))
            fprintf(stderr, _("%s: child process did not exit normally\n"),
            disconnect_and_exit(1);
        if (WEXITSTATUS(status) != 0)
            fprintf(stderr, _("%s: child process exited with error %d\n"),
                    progname, WEXITSTATUS(status));
            disconnect_and_exit(1);
            fprintf(stderr,
            disconnect_and_exit(1);
            fprintf(stderr, _("%s: could not wait for child thread: %s\n"),
            disconnect_and_exit(1);
            fprintf(stderr, _("%s: could not get child thread exit status: %s\n"),
            disconnect_and_exit(1);
            fprintf(stderr, _("%s: child thread exited with error %u\n"),
            disconnect_and_exit(1);
                    fprintf(stderr,
                    exit(1);
                    fprintf(stderr,
                    exit(1);
                    fprintf(stderr,
                    exit(1);
                    fprintf(stderr,
                    exit(1);
                    fprintf(stderr, _("%s: invalid compression level \"%s\"\n"),
                    exit(1);
                    fprintf(stderr, _("%s: invalid checkpoint argument \"%s\", must be \"fast\" or \"spread\"\n"),
                    exit(1);
                    fprintf(stderr, _("%s: invalid status interval \"%s\"\n"),
                    exit(1);
                fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
                exit(1);
        fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
        exit(1);
        fprintf(stderr, _("%s: no target directory specified\n"), progname);
        fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
        exit(1);
        fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
        exit(1);
        fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
        exit(1);
            fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
            exit(1);
            fprintf(stderr, _("%s: transaction log directory location must be "
            fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
            exit(1);
        fprintf(stderr,
        exit(1);
            fprintf(stderr, _("%s: could not create symbolic link \"%s\": %s\n"),
            exit(1);
        fprintf(stderr, _("%s: symlinks are not supported on this platform\n"));
        exit(1);


>
> thanks, Andrej 
> --------------details:
> environment: CentOS 6.7, postgres 9.5.1
> ( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit)
>
> I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were issued from a script:
> # su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X
fetch"2>${LOG_FILE}.stderr  >> ${LOG_FILE} 
> # echo $?
> 1             <--------------pg_basebackup failed!
> # cat log.stderr
> # cat /var/log/cluster/geo_repair.log.err
> transaction log start point: 0/E3000028 on timeline 1
> WARNING:  skipping special file "./pg_hba.conf"
> WARNING:  skipping special file "./pg_hba.conf.save"
> transaction log end point: 0/E30000F8
> pg_basebackup: base backup completed            <------------------no reason for pg_basebackup failure!
> # cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
> # su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
> # tail /pg_data/pg_log/postgresql-Fri.log
> `pg_xlog/0000000100000000000000E2' -> `../backups/arc/0000000100000000000000E2'
> 2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping special file "./pg_hba.conf"
> 2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping special file "./pg_hba.conf.save"        
<---------------recordedin pg_log on master node and 
> copied by pg_basebackup (note time difference between two servers)
> 2016-04-15 23:15:02 CEST:@:[23321] LOG:  database system was interrupted; last known up at 2016-04-15 23:15:10 CEST
> 2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL:  the database system is starting up
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  entering standby mode
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  database system was not properly shut down; automatic recovery in progress
<---------somethingmissing from pg_basebackup 
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  redo starts at 0/E3000028
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  consistent recovery state reached at 0/E4000000
> 2016-04-15 23:15:03 CEST:@:[23295] LOG:  database system is ready to accept read only connections
> 2016-04-15 23:15:03 CEST:@:[23356] LOG:  started streaming WAL from primary at 0/E4000000 on timeline 1
> -------second trial
> # su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X
stream" 
> # echo $?
> 1
> #  cat /var/log/cluster/geo_repair.log.err 
> transaction log start point: 0/E5000028 on timeline 1
> pg_basebackup: starting background WAL receiver
> WARNING:  skipping special file "./pg_hba.conf"
> WARNING:  skipping special file "./pg_hba.conf.save"
> transaction log end point: 0/E50000F8
> pg_basebackup: waiting for background process to finish streaming ...
> pg_basebackup: could not wait for child process: No child processes        <----what does this mean? I think it
failedto start process to fetching wal logs created 
> during backup: but neither on master node neither on pg_basebackup output here is any information about reason..
(max_wal_senderson master is 10: I see no reason to 
> fail).
>
> postgres logs:
> `pg_xlog/0000000100000000000000E4' -> `../backups/arc/0000000100000000000000E4'
> 2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping special file "./pg_hba.conf"
> 2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping special file "./pg_hba.conf.save"
> 2016-04-15 23:35:01 CEST:@:[28926] LOG:  database system was interrupted; last known up at 2016-04-15 23:35:09 CEST
> 2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL:  the database system is starting up
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  entering standby mode
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  database system was not properly shut down; automatic recovery in progress
 <------------thismeans something missing from 
> pg_basebackup
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  redo starts at 0/E5000028
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  consistent recovery state reached at 0/E6000000
> 2016-04-15 23:35:02 CEST:@:[28904] LOG:  database system is ready to accept read only connections
> 2016-04-15 23:35:02 CEST:@:[28989] LOG:  started streaming WAL from primary at 0/E6000000 on timeline 1
>
> postgres params on master node:
> log_line_prefix = '%t:%u@%d:[%p] '
> logging_collector = on
> wal_buffers = 16MB
> max_wal_size = 200MB
> log_temp_files = 1MB
> max_connections = 170
> shared_buffers = 512MB
> effective_cache_size = 1500MB
> work_mem = 48MB
> log_lock_waits = on
> log_min_duration_statement = 10000
> shared_preload_libraries = 'pg_stat_statements'
> include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
> wal_level = hot_standby
> archive_mode = on
> max_wal_senders = 10
> hot_standby = on
> wal_keep_segments = 128
> archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
> wal_receiver_status_interval = 2
> max_standby_streaming_delay = -1
> max_standby_archive_delay = -1
> restart_after_crash = off
> hot_standby_feedback = on
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800


Re: pg_basebackup: return value 1: reason?

From
Andrej Vanek
Date:
Hello,

my setup is:
1 master
1 synchronous slave (up and running)
2 asynchronous slave (up and running)
Now I'm setting up asynchronous slave: first step of this setup is pg_basebackup connecting to master.

Indirections are because it is encapsulated in a script. This script is aimed to serve for automated replication recovery. This script is launched by crm_mon -d daemon.

You are right to check without indirections: now I tried to run the command directly from command line without variables- it works fine...

This means: no problem in pg_basebackup itself

but some problem in environment of process launched from crm_mon daemon.
Thanks for your hint (remove indirections). Pg_basebackup works fine when launched from command-line..
Unfortenutely I have still no clue how to solve the issue- su works. I have to find out why pg_basebackup cannot fork when launched from crm_mon.
 
Best Regards, Andrej

2016-04-16 1:17 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com>:
On 04/15/2016 03:28 PM, Andrej Vanek wrote:
Hello,

I tried to run pg_basebackup. Return value is 1.

How to find out its reason?
(I suspect that some wal after backup is missing- but how to find
out the real reason? How to fix it?)

First it is not clear to me where you are taking the backup from, the master or the standby?

Second there is a lot of redirection going on. What happens if you run the pg_basebackup directly (without doing  su - postgres ...) and use hardcoded values instead of shell variables?



thanks, Andrej
--------------details:
environment: CentOS 6.7, postgres 9.5.1
( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-16), 64-bit)

I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were
issued from a script:
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
-D ${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr
  >> ${LOG_FILE}
# echo $?
1             <--------------pg_basebackup failed!
# cat log.stderr
# cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E3000028 on timeline 1
WARNING:  skipping special file "./pg_hba.conf"
WARNING:  skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E30000F8
pg_basebackup: base backup completed            <------------------no
reason for pg_basebackup failure!
# cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
# su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
# tail /pg_data/pg_log/postgresql-Fri.log
`pg_xlog/0000000100000000000000E2' ->
`../backups/arc/0000000100000000000000E2'
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping
special file "./pg_hba.conf"
2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping
special file "./pg_hba.conf.save"         <---------------recorded in
pg_log on master node and copied by pg_basebackup (note time difference
between two servers)
2016-04-15 23:15:02 CEST:@:[23321] LOG:  database system was
interrupted; last known up at 2016-04-15 23:15:10 CEST
2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL:  the database
system is starting up
2016-04-15 23:15:03 CEST:@:[23321] LOG:  entering standby mode
2016-04-15 23:15:03 CEST:@:[23321] LOG:  database system was not
properly shut down; automatic recovery in progress <---------something
missing from pg_basebackup
2016-04-15 23:15:03 CEST:@:[23321] LOG:  redo starts at 0/E3000028
2016-04-15 23:15:03 CEST:@:[23321] LOG:  consistent recovery state
reached at 0/E4000000
2016-04-15 23:15:03 CEST:@:[23295] LOG:  database system is ready to
accept read only connections
2016-04-15 23:15:03 CEST:@:[23356] LOG:  started streaming WAL from
primary at 0/E4000000 on timeline 1
-------second trial
# su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP}
-D ${GEO_STDBY_DATA} -U pgreplic -P -v -X stream"
# echo $?
1
#  cat /var/log/cluster/geo_repair.log.err
transaction log start point: 0/E5000028 on timeline 1
pg_basebackup: starting background WAL receiver
WARNING:  skipping special file "./pg_hba.conf"
WARNING:  skipping special file "./pg_hba.conf.save"
transaction log end point: 0/E50000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes
    <----what does this mean? I think it failed to start process to
fetching wal logs created during backup: but neither on master node
neither on pg_basebackup output here is any information about reason..
(max_wal_senders on master is 10: I see no reason to fail).

postgres logs:
`pg_xlog/0000000100000000000000E4' ->
`../backups/arc/0000000100000000000000E4'
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping
special file "./pg_hba.conf"
2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping
special file "./pg_hba.conf.save"
2016-04-15 23:35:01 CEST:@:[28926] LOG:  database system was
interrupted; last known up at 2016-04-15 23:35:09 CEST
2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL:  the database
system is starting up
2016-04-15 23:35:02 CEST:@:[28926] LOG:  entering standby mode
2016-04-15 23:35:02 CEST:@:[28926] LOG:  database system was not
properly shut down; automatic recovery in progress  <------------this
means something missing from pg_basebackup
2016-04-15 23:35:02 CEST:@:[28926] LOG:  redo starts at 0/E5000028
2016-04-15 23:35:02 CEST:@:[28926] LOG:  consistent recovery state
reached at 0/E6000000
2016-04-15 23:35:02 CEST:@:[28904] LOG:  database system is ready to
accept read only connections
2016-04-15 23:35:02 CEST:@:[28989] LOG:  started streaming WAL from
primary at 0/E6000000 on timeline 1

postgres params on master node:
log_line_prefix = '%t:%u@%d:[%p] '
logging_collector = on
wal_buffers = 16MB
max_wal_size = 200MB
log_temp_files = 1MB
max_connections = 170
shared_buffers = 512MB
effective_cache_size = 1500MB
work_mem = 48MB
log_lock_waits = on
log_min_duration_statement = 10000
shared_preload_libraries = 'pg_stat_statements'
include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
wal_level = hot_standby
archive_mode = on
max_wal_senders = 10
hot_standby = on
wal_keep_segments = 128
archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
wal_receiver_status_interval = 2
max_standby_streaming_delay = -1
max_standby_archive_delay = -1
restart_after_crash = off
hot_standby_feedback = on



--
Adrian Klaver
adrian.klaver@aklaver.com

Re: pg_basebackup: return value 1: reason?

From
Andrej Vanek
Date:
Hello,

Are you not getting something informative on your display and/or sending
stderr to somewher else?

- in case -X fetch no informative output
- in case -X stream following:
pg_basebackup: could not wait for child process: No child processes

Now after re-test directly from command-line it works. It fails just when launched from a script fired by crm_mon -d -E my-script

Regards, Andrej

2016-04-16 1:18 GMT+02:00 Jerry Sievers <gsievers19@comcast.net>:
Andrej Vanek <andrej.vanek.sk@gmail.com> writes:

> Hello, 
>
> I tried to run pg_basebackup. Return value is 1.
>
> How to find out its reason?
> (I suspect that some wal after backup is missing- but how to find out the real reason? How to fix it?)

Well there are more than 80 cases of exit 1 in a couple .c files and
most/all are preceeded unsurprisingly with an fprintf(stderr.

Are you not getting something informative on your display and/or sending
stderr to somewher else?

$ echo $*
./src/bin/pg_basebackup/pg_basebackup.c ./src/backend/replication/basebackup.c

$ grep -h -B 4 -i 'exit.*(1' $* | egrep -hi 'fprintf\(stderr|exit'
                        fprintf(stderr, _("%s: directory name too long\n"), progname);
                        exit(1);
                                fprintf(stderr, _("%s: multiple \"=\" signs in tablespace mapping\n"), progname);
                                exit(1);
                fprintf(stderr,
                exit(1);
                fprintf(stderr, _("%s: old directory is not an absolute path in tablespace mapping: %s\n"),
                exit(1);
                fprintf(stderr, _("%s: new directory is not an absolute path in tablespace mapping: %s\n"),
                exit(1);
                                fprintf(stderr, _("%s: could not read from ready pipe: %s\n"),
                                exit(1);
                                fprintf(stderr,
                                exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
                exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: could not create background process: %s\n"),
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: could not create background thread: %s\n"),
                disconnect_and_exit(1);
                                fprintf(stderr,
                                disconnect_and_exit(1);
                        fprintf(stderr,
                        disconnect_and_exit(1);
                        fprintf(stderr, _("%s: could not access directory \"%s\": %s\n"),
                        disconnect_and_exit(1);
                fprintf(stderr,
                exit(1);
                fprintf(stderr,
                exit(1);
                fprintf(stderr, _("%s: transfer rate must be greater than zero\n"),
                exit(1);
                fprintf(stderr,
                exit(1);
                fprintf(stderr,
                exit(1);
                fprintf(stderr,
                exit(1);
                        fprintf(stderr,
                        disconnect_and_exit(1);
                        fprintf(stderr, _("%s: could not write to file \"%s\": %s\n"),
                        disconnect_and_exit(1);
                                        fprintf(stderr,
                                        disconnect_and_exit(1);
                                        fprintf(stderr,
                                        disconnect_and_exit(1);
                                fprintf(stderr,
                                disconnect_and_exit(1);
                        fprintf(stderr,
                        disconnect_and_exit(1);
                        fprintf(stderr, _("%s: could not create file \"%s\": %s\n"),
                        disconnect_and_exit(1);
                fprintf(stderr, _("%s: could not get COPY data stream: %s"),
                disconnect_and_exit(1);
                                        fprintf(stderr,
                                        disconnect_and_exit(1);
                                                fprintf(stderr,
                                                disconnect_and_exit(1);
                        fprintf(stderr, _("%s: could not read COPY data: %s"),
                        disconnect_and_exit(1);
                fprintf(stderr, _("%s: could not get COPY data stream: %s"),
                disconnect_and_exit(1);
                        fprintf(stderr, _("%s: could not read COPY data: %s"),
                        disconnect_and_exit(1);
                                fprintf(stderr, _("%s: invalid tar block header size: %d\n"),
                                disconnect_and_exit(1);
                                                        fprintf(stderr,
                                                        disconnect_and_exit(1);
                                                fprintf(stderr,
                                                disconnect_and_exit(1);
                                        fprintf(stderr,
                                        disconnect_and_exit(1);
                                fprintf(stderr, _("%s: could not create file \"%s\": %s\n"),
                                disconnect_and_exit(1);
                                fprintf(stderr, _("%s: could not write to file \"%s\": %s\n"),
                                disconnect_and_exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: out of memory\n"), progname);
                exit(1);
                fprintf(stderr, _("%s: out of memory\n"), progname);
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: out of memory\n"), progname);
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: out of memory\n"), progname);
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: could not create file \"%s\": %s\n"), progname, filename, strerror(errno));
                disconnect_and_exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
                exit(1);
                fprintf(stderr, _("%s: incompatible server version %s\n"),
                disconnect_and_exit(1);
                disconnect_and_exit(1);
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: could not send replication command \"%s\": %s"),
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: could not initiate base backup: %s"),
                disconnect_and_exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: could not get backup header: %s"),
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: no data returned from server\n"), progname);
                disconnect_and_exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
                fprintf(stderr,
                disconnect_and_exit(1);
                fprintf(stderr, _("%s: final receive failed: %s"),
                disconnect_and_exit(1);
                        fprintf(stderr,
                        disconnect_and_exit(1);
                        fprintf(stderr, _("%s: could not wait for child process: %s\n"),
                        disconnect_and_exit(1);
                        fprintf(stderr, _("%s: child %d died, expected %d\n"),
                        disconnect_and_exit(1);
                if (!WIFEXITED(status))
                        fprintf(stderr, _("%s: child process did not exit normally\n"),
                        disconnect_and_exit(1);
                if (WEXITSTATUS(status) != 0)
                        fprintf(stderr, _("%s: child process exited with error %d\n"),
                                        progname, WEXITSTATUS(status));
                        disconnect_and_exit(1);
                        fprintf(stderr,
                        disconnect_and_exit(1);
                        fprintf(stderr, _("%s: could not wait for child thread: %s\n"),
                        disconnect_and_exit(1);
                        fprintf(stderr, _("%s: could not get child thread exit status: %s\n"),
                        disconnect_and_exit(1);
                        fprintf(stderr, _("%s: child thread exited with error %u\n"),
                        disconnect_and_exit(1);
                                        fprintf(stderr,
                                        exit(1);
                                        fprintf(stderr,
                                        exit(1);
                                        fprintf(stderr,
                                        exit(1);
                                        fprintf(stderr,
                                        exit(1);
                                        fprintf(stderr, _("%s: invalid compression level \"%s\"\n"),
                                        exit(1);
                                        fprintf(stderr, _("%s: invalid checkpoint argument \"%s\", must be \"fast\" or \"spread\"\n"),
                                        exit(1);
                                        fprintf(stderr, _("%s: invalid status interval \"%s\"\n"),
                                        exit(1);
                                fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
                                exit(1);
                fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
                exit(1);
                fprintf(stderr, _("%s: no target directory specified\n"), progname);
                fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
                exit(1);
                fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
                exit(1);
                fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
                exit(1);
                        fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
                        exit(1);
                        fprintf(stderr, _("%s: transaction log directory location must be "
                        fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
                        exit(1);
                fprintf(stderr,
                exit(1);
                        fprintf(stderr, _("%s: could not create symbolic link \"%s\": %s\n"),
                        exit(1);
                fprintf(stderr, _("%s: symlinks are not supported on this platform\n"));
                exit(1);


>
> thanks, Andrej 
> --------------details:
> environment: CentOS 6.7, postgres 9.5.1
> ( PostgreSQL 9.5.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit)
>
> I tried 2 forms of pg_basebackup (-X fetch and -X stream). Both were issued from a script:
> # su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X fetch" 2>${LOG_FILE}.stderr  >> ${LOG_FILE}
> # echo $?
> 1             <--------------pg_basebackup failed!
> # cat log.stderr
> # cat /var/log/cluster/geo_repair.log.err
> transaction log start point: 0/E3000028 on timeline 1
> WARNING:  skipping special file "./pg_hba.conf"
> WARNING:  skipping special file "./pg_hba.conf.save"
> transaction log end point: 0/E30000F8
> pg_basebackup: base backup completed            <------------------no reason for pg_basebackup failure!
> # cp /tmp/pg_hba.conf /tmp/postgresql.conf /pg_data/
> # su - postgres -c "/usr/pgsql-9.5/bin/pg_ctl -D /pg_data/ start"
> # tail /pg_data/pg_log/postgresql-Fri.log
> `pg_xlog/0000000100000000000000E2' -> `../backups/arc/0000000100000000000000E2'
> 2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping special file "./pg_hba.conf"
> 2016-04-15 23:15:10 CEST:pgreplic@[unknown]:[10667] WARNING:  skipping special file "./pg_hba.conf.save"         <---------------recorded in pg_log on master node and
> copied by pg_basebackup (note time difference between two servers)
> 2016-04-15 23:15:02 CEST:@:[23321] LOG:  database system was interrupted; last known up at 2016-04-15 23:15:10 CEST
> 2016-04-15 23:15:02 CEST:postgres@postgres:[23329] FATAL:  the database system is starting up
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  entering standby mode
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  database system was not properly shut down; automatic recovery in progress <---------something missing from pg_basebackup
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  redo starts at 0/E3000028
> 2016-04-15 23:15:03 CEST:@:[23321] LOG:  consistent recovery state reached at 0/E4000000
> 2016-04-15 23:15:03 CEST:@:[23295] LOG:  database system is ready to accept read only connections
> 2016-04-15 23:15:03 CEST:@:[23356] LOG:  started streaming WAL from primary at 0/E4000000 on timeline 1
> -------second trial
> # su - postgres -c "/usr/pgsql-9.5/bin/pg_basebackup -h ${DB_MASTER_IP} -D ${GEO_STDBY_DATA} -U pgreplic -P -v -X stream" 
> # echo $?
> 1
> #  cat /var/log/cluster/geo_repair.log.err 
> transaction log start point: 0/E5000028 on timeline 1
> pg_basebackup: starting background WAL receiver
> WARNING:  skipping special file "./pg_hba.conf"
> WARNING:  skipping special file "./pg_hba.conf.save"
> transaction log end point: 0/E50000F8
> pg_basebackup: waiting for background process to finish streaming ...
> pg_basebackup: could not wait for child process: No child processes        <----what does this mean? I think it failed to start process to fetching wal logs created
> during backup: but neither on master node neither on pg_basebackup output here is any information about reason.. (max_wal_senders on master is 10: I see no reason to
> fail).
>
> postgres logs:
> `pg_xlog/0000000100000000000000E4' -> `../backups/arc/0000000100000000000000E4'
> 2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping special file "./pg_hba.conf"
> 2016-04-15 23:35:09 CEST:pgreplic@[unknown]:[29035] WARNING:  skipping special file "./pg_hba.conf.save"
> 2016-04-15 23:35:01 CEST:@:[28926] LOG:  database system was interrupted; last known up at 2016-04-15 23:35:09 CEST
> 2016-04-15 23:35:01 CEST:postgres@postgres:[28938] FATAL:  the database system is starting up
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  entering standby mode
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  database system was not properly shut down; automatic recovery in progress  <------------this means something missing from
> pg_basebackup
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  redo starts at 0/E5000028
> 2016-04-15 23:35:02 CEST:@:[28926] LOG:  consistent recovery state reached at 0/E6000000
> 2016-04-15 23:35:02 CEST:@:[28904] LOG:  database system is ready to accept read only connections
> 2016-04-15 23:35:02 CEST:@:[28989] LOG:  started streaming WAL from primary at 0/E6000000 on timeline 1
>
> postgres params on master node:
> log_line_prefix = '%t:%u@%d:[%p] '
> logging_collector = on
> wal_buffers = 16MB
> max_wal_size = 200MB
> log_temp_files = 1MB
> max_connections = 170
> shared_buffers = 512MB
> effective_cache_size = 1500MB
> work_mem = 48MB
> log_lock_waits = on
> log_min_duration_statement = 10000
> shared_preload_libraries = 'pg_stat_statements'
> include '/var/lib/pgsql/tmp/rep_mode.conf' # added by pgsql RA
> wal_level = hot_standby
> archive_mode = on
> max_wal_senders = 10
> hot_standby = on
> wal_keep_segments = 128
> archive_command = '/opt/postgres/dbconf/archive_command.sh %p %f'
> wal_receiver_status_interval = 2
> max_standby_streaming_delay = -1
> max_standby_archive_delay = -1
> restart_after_crash = off
> hot_standby_feedback = on
>

--
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800

Re: pg_basebackup: return value 1: reason?

From
Adrian Klaver
Date:
On 04/16/2016 02:24 PM, Andrej Vanek wrote:
> Hello,
>
> my setup is:
> 1 master
> 1 synchronous slave (up and running)
> 2 asynchronous slave (up and running)
> Now I'm setting up asynchronous slave: first step of this setup is
> pg_basebackup connecting to master.
>
> Indirections are because it is encapsulated in a script. This script is
> aimed to serve for automated replication recovery. This script is
> launched by crm_mon -d daemon.
>
> You are right to check without indirections: now I tried to run the
> command directly from command line without variables- it works fine...
>
> This means: no problem in pg_basebackup itself
>
> but some problem in environment of process launched from crm_mon daemon.
> Thanks for your hint (remove indirections). Pg_basebackup works fine
> when launched from command-line..
> Unfortenutely I have still no clue how to solve the issue- su works. I

Is the su - even necessary?

pg_basebackup is a Postgres client program you can specify the user you
want it to connect to using -U.

Or do you need the script to run as postgres in order to get permissions
on wherever you are creating the backup directory?

> have to find out why pg_basebackup cannot fork when launched from crm_mon.


I assume crm_mon is this:

http://linux.die.net/man/8/crm_mon

from Pacemaker.

I do not use Pacemaker, but I am pretty sure that running what is a
monitoring program in daemon mode and then shelling out to another
program is not workable. The docs seem to bear this out:

http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#Installation

https://github.com/smbambling/pgsql_ha_cluster/wiki/Building-A-Highly-Available-Multi-Node-PostgreSQL-Cluster

> Best Regards, Andrej
>
> 2016-04-16 1:17 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>>:
>

--
Adrian Klaver
adrian.klaver@aklaver.com


Re: pg_basebackup: return value 1: reason?

From
Andrej Vanek
Date:
Hello Adrian,

I tried to use -U without "su"- launched directly by root: same behaviour.
Finally I reverted my script to use standard backup (pg_start_backup; rsync; pg_stop_backup)- this works- the only downside is possible collisions with on-line backup/synchronizaiton of other two nodes on master node...

Back to the pg_basebackup issue: it is clear to me that this is an issue of environment which launched pg_basebackup. 
Possibly either some privileges or  some kernel parameters/limits. Who knows? 
Summary: clusterlab's crm_mon launched a shell script starting pg_basebackup which fails to do some its work (pg_basebackup: could not wait for child process: No child processes)- probably due to some failing system call.

How can I report to clusterlabs: What system call fails in pg_basebackup?

Best Regards, Andrej



2016-04-17 1:09 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com>:

Is the su - even necessary?

pg_basebackup is a Postgres client program you can specify the user you want it to connect to using -U.

Or do you need the script to run as postgres in order to get permissions on wherever you are creating the backup directory?

have to find out why pg_basebackup cannot fork when launched from crm_mon.


I assume crm_mon is this:

http://linux.die.net/man/8/crm_mon

from Pacemaker.

I do not use Pacemaker, but I am pretty sure that running what is a monitoring program in daemon mode and then shelling out to another program is not workable. The docs seem to bear this out:

http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#Installation

https://github.com/smbambling/pgsql_ha_cluster/wiki/Building-A-Highly-Available-Multi-Node-PostgreSQL-Cluster

Re: pg_basebackup: return value 1: reason?

From
Adrian Klaver
Date:
On 04/17/2016 12:13 PM, Andrej Vanek wrote:
> Hello Adrian,
>
> I tried to use -U without "su"- launched directly by root: same behaviour.
> Finally I reverted my script to use standard backup (pg_start_backup;
> rsync; pg_stop_backup)- this works- the only downside is possible
> collisions with on-line backup/synchronizaiton of other two nodes on
> master node...
>
> Back to the pg_basebackup issue: it is clear to me that this is an issue
> of environment which launched pg_basebackup.
> Possibly either some privileges or  some kernel parameters/limits. Who
> knows?
> Summary: clusterlab's crm_mon launched a shell script starting
> pg_basebackup which fails to do some its work (pg_basebackup: could not
> wait for child process: No child processes)- probably due to some
> failing system call.
>
> How can I report to clusterlabs: What system call fails in pg_basebackup?

All I can to do is point you at:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

>
> Best Regards, Andrej
>
>
>
> 2016-04-17 1:09 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com
> <mailto:adrian.klaver@aklaver.com>>:
>
>
>     Is the su - even necessary?
>
>     pg_basebackup is a Postgres client program you can specify the user
>     you want it to connect to using -U.
>
>     Or do you need the script to run as postgres in order to get
>     permissions on wherever you are creating the backup directory?
>
>         have to find out why pg_basebackup cannot fork when launched
>         from crm_mon.
>
>
>
>     I assume crm_mon is this:
>
>     http://linux.die.net/man/8/crm_mon
>
>     from Pacemaker.
>
>     I do not use Pacemaker, but I am pretty sure that running what is a
>     monitoring program in daemon mode and then shelling out to another
>     program is not workable. The docs seem to bear this out:
>
>     http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#Installation
>
>     https://github.com/smbambling/pgsql_ha_cluster/wiki/Building-A-Highly-Available-Multi-Node-PostgreSQL-Cluster
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: pg_basebackup: return value 1: reason?

From
Andrej Vanek
Date:
Hello,
I've given a try once again. 
Two variants used in my script (launched by crm_mon):
1. /usr/pgsql-9.5/bin/pg_basebackup -U pgreplic -h db-other-site -w -D /opt/geo_stdby_data -c fast -vvv -X stream &>> /tmp/log
2. strace -o /tmp/pg_basebackup.log /usr/pgsql-9.5/bin/pg_basebackup -U pgreplic -h db-other-site -w -D /opt/geo_stdby_data -c fast -vvv -X stream &>> /tmp/log

Result:
variant 2. works fine with return code 0 (with strace)
variant 1. fails with error code 1 (without strace)

Any ideas?

Andrej
----------------------details
Output:
Variant 2:
DEBUG:  CommitTransaction
DEBUG:  name: unnamed; blockState:       STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: BASE_BACKUP LABEL 'pg_basebackup base backup'   FAST NOWAIT
-- Mon May 23 17:54:31 CEST 2016 [l1abrnch->l1abrnch:3122/27282:GEO] --INFO-- l1abrnch->l1abrnch (GEO-STDBY-DB / stop: 0): target/returned 0/0 (OK)
transaction log start point: 0/FA000028 on timeline 1
pg_basebackup: starting background WAL receiver
DEBUG:  CommitTransaction
DEBUG:  name: unnamed; blockState:       STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION 0/FA000000 TIMELINE 1
WARNING:  skipping special file "./pg_hba.conf"
DEBUG:  standby "pg_basebackup" has now caught up with primary
DEBUG:  write 0/FA000000 flush 0/0 apply 0/0
DEBUG:  removing transaction log backup history file "0000000100000000000000F8.00000028.backup"
transaction log end point: 0/FA0000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
RETVAL=0

Output
Variant 1:
DEBUG:  CommitTransaction
DEBUG:  name: unnamed; blockState:       STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: BASE_BACKUP LABEL 'pg_basebackup base backup'   FAST NOWAIT
-- Mon May 23 17:55:32 CEST 2016 [l1abrnch->l1abrnch:3122/28785:GEO] --INFO-- l1abrnch->l1abrnch (GEO-STDBY-DB / stop: 0): target/returned 0/0 (OK)
transaction log start point: 0/FC000028 on timeline 1
pg_basebackup: starting background WAL receiver
DEBUG:  CommitTransaction
DEBUG:  name: unnamed; blockState:       STARTED; state: INPROGR, xid/subid/cid: 0/1/0, nestlvl: 1, children:
DEBUG:  received replication command: IDENTIFY_SYSTEM
DEBUG:  received replication command: START_REPLICATION 0/FC000000 TIMELINE 1
WARNING:  skipping special file "./pg_hba.conf"
DEBUG:  standby "pg_basebackup" has now caught up with primary
DEBUG:  write 0/FC000000 flush 0/0 apply 0/0
DEBUG:  removing transaction log backup history file "0000000100000000000000FA.00000028.backup"
transaction log end point: 0/FC0000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: could not wait for child process: No child processes
RETVAL=1




2016-04-18 16:12 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com>:
On 04/17/2016 12:13 PM, Andrej Vanek wrote:
Hello Adrian,

I tried to use -U without "su"- launched directly by root: same behaviour.
Finally I reverted my script to use standard backup (pg_start_backup;
rsync; pg_stop_backup)- this works- the only downside is possible
collisions with on-line backup/synchronizaiton of other two nodes on
master node...

Back to the pg_basebackup issue: it is clear to me that this is an issue
of environment which launched pg_basebackup.
Possibly either some privileges or  some kernel parameters/limits. Who
knows?
Summary: clusterlab's crm_mon launched a shell script starting
pg_basebackup which fails to do some its work (pg_basebackup: could not
wait for child process: No child processes)- probably due to some
failing system call.

How can I report to clusterlabs: What system call fails in pg_basebackup?

All I can to do is point you at:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD


Best Regards, Andrej



2016-04-17 1:09 GMT+02:00 Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>>:


    Is the su - even necessary?

    pg_basebackup is a Postgres client program you can specify the user
    you want it to connect to using -U.

    Or do you need the script to run as postgres in order to get
    permissions on wherever you are creating the backup directory?

        have to find out why pg_basebackup cannot fork when launched
        from crm_mon.



    I assume crm_mon is this:

    http://linux.die.net/man/8/crm_mon

    from Pacemaker.

    I do not use Pacemaker, but I am pretty sure that running what is a
    monitoring program in daemon mode and then shelling out to another
    program is not workable. The docs seem to bear this out:

    http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#Installation

    https://github.com/smbambling/pgsql_ha_cluster/wiki/Building-A-Highly-Available-Multi-Node-PostgreSQL-Cluster




--
Adrian Klaver
adrian.klaver@aklaver.com