Re: WIP/PoC for parallel backup - Mailing list pgsql-hackers

From Kashif Zeeshan
Subject Re: WIP/PoC for parallel backup
Date
Msg-id CAKfXphqhzCr-8ggS9-o_ctMiLm7h+4bkcUP1un087K3sS2EPjw@mail.gmail.com
Whole thread Raw
In response to Re: WIP/PoC for parallel backup  (Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com>)
Responses Re: WIP/PoC for parallel backup  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi Asif

The backup failed with errors "error: could not connect to server: could not look up local user ID 1000: Too many open files" when the max_wal_senders was set to 2000.
The errors generated for the workers starting from backup worke=1017.
Please note that the backup directory was also not cleaned after the backup was failed.


Steps
=======
1) Generate data in DB
 ./pgbench -i -s 600 -h localhost  -p 5432 postgres
2) Set max_wal_senders = 2000 in postgresql.
3) Generate the backup


[edb@localhost bin]$
^[[A[edb@localhost bin]$
[edb@localhost bin]$ ./pg_basebackup -v -j 1990 -D  /home/edb/Desktop/backup/
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 1/F1000028 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_58692"
pg_basebackup: backup worker (0) created
….
…..
…..
pg_basebackup: backup worker (1017) created
pg_basebackup: error: could not connect to server: could not look up local user ID 1000: Too many open files
pg_basebackup: backup worker (1018) created
pg_basebackup: error: could not connect to server: could not look up local user ID 1000: Too many open files



pg_basebackup: error: could not connect to server: could not look up local user ID 1000: Too many open files
pg_basebackup: backup worker (1989) created
pg_basebackup: error: could not create file "/home/edb/Desktop/backup//global/4183": Too many open files
pg_basebackup: error: could not create file "/home/edb/Desktop/backup//global/3592": Too many open files
pg_basebackup: error: could not create file "/home/edb/Desktop/backup//global/4177": Too many open files
[edb@localhost bin]$


4) The backup directory is not cleaned


[edb@localhost bin]$
[edb@localhost bin]$ ls  /home/edb/Desktop/backup
base    pg_commit_ts  pg_logical    pg_notify    pg_serial     pg_stat      pg_subtrans  pg_twophase  pg_xact
global  pg_dynshmem   pg_multixact  pg_replslot  pg_snapshots  pg_stat_tmp  pg_tblspc    pg_wal
[edb@localhost bin]$


Kashif Zeeshan
EnterpriseDB


On Thu, Apr 2, 2020 at 2:58 PM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
Hi Asif,

My colleague Kashif Zeeshan reported an issue off-list, posting here, please take a look.

When executing two backups at the same time, getting FATAL error due to max_wal_senders and instead of exit  Backup got completed
And when tried to start the server from the backup cluster, getting error.

[edb@localhost bin]$ ./pgbench -i -s 200 -h localhost -p 5432 postgres
[edb@localhost bin]$ ./pg_basebackup -v -j 8 -D  /home/edb/Desktop/backup/
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/C2000270 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_57849"
pg_basebackup: backup worker (0) created
pg_basebackup: backup worker (1) created
pg_basebackup: backup worker (2) created
pg_basebackup: error: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (3) created
pg_basebackup: error: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (4) created
pg_basebackup: error: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (5) created
pg_basebackup: error: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (6) created
pg_basebackup: error: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (7) created
pg_basebackup: write-ahead log end point: 0/C3000050
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: syncing data to disk ...
pg_basebackup: base backup completed
[edb@localhost bin]$ ./pg_basebackup -v -j 8 -D  /home/edb/Desktop/backup1/
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/C20001C0 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_57848"
pg_basebackup: backup worker (0) created
pg_basebackup: backup worker (1) created
pg_basebackup: backup worker (2) created
pg_basebackup: error: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (3) created
pg_basebackup: error: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (4) created
pg_basebackup: error: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (5) created
pg_basebackup: error: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (6) created
pg_basebackup: error: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 10)
pg_basebackup: backup worker (7) created
pg_basebackup: write-ahead log end point: 0/C2000348
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: syncing data to disk ...
pg_basebackup: base backup completed

[edb@localhost bin]$ ./pg_ctl -D /home/edb/Desktop/backup1/  -o "-p 5438" start
pg_ctl: directory "/home/edb/Desktop/backup1" is not a database cluster directory

Thanks & Regards,
Rajkumar Raghuwanshi


On Mon, Mar 30, 2020 at 6:28 PM Ahsan Hadi <ahsan.hadi@gmail.com> wrote:


On Mon, Mar 30, 2020 at 3:44 PM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
Thanks Asif,

I have re-verified reported issue. expect standby backup, others are fixed.

Yes As Asif mentioned he is working on the standby issue and adding bandwidth throttling functionality to parallel backup.

It would be good to get some feedback on Asif previous email from Robert on the design considerations for stand-by server support and throttling. I believe all the other points mentioned by Robert in this thread are addressed by Asif so it would be good to hear about any other concerns that are not addressed. 

Thanks,

-- Ahsan


Thanks & Regards,
Rajkumar Raghuwanshi


On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman <asifr.rehman@gmail.com> wrote:


On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
Hi Asif,

While testing further I observed parallel backup is not able to take backup of standby server.

mkdir /tmp/archive_dir
echo "archive_mode='on'">> data/postgresql.conf
echo "archive_command='cp %p /tmp/archive_dir/%f'">> data/postgresql.conf

./pg_ctl -D data -l logs start
./pg_basebackup -p 5432 -Fp -R -D /tmp/slave

echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">> /tmp/slave/postgresql.conf
echo "restore_command='cp /tmp/archive_dir/%f %p'">> /tmp/slave/postgresql.conf
echo "promote_trigger_file='/tmp/failover.log'">> /tmp/slave/postgresql.conf

./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c

[edb@localhost bin]$ ./psql postgres -p 5432 -c "select pg_is_in_recovery();"
 pg_is_in_recovery
-------------------
 f
(1 row)

[edb@localhost bin]$ ./psql postgres -p 5433 -c "select pg_is_in_recovery();"
 pg_is_in_recovery
-------------------
 t
(1 row)

[edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 6
pg_basebackup: error: could not list backup files: ERROR:  the standby was promoted during online backup
HINT:  This means that the backup being taken is corrupt and should not be used. Try taking another online backup.
pg_basebackup: removing data directory "/tmp/bkp_s"


#same is working fine without parallel backup
[edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1
[edb@localhost bin]$ ls /tmp/bkp_s/PG_VERSION
/tmp/bkp_s/PG_VERSION

Thanks & Regards,
Rajkumar Raghuwanshi


On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
Hi Asif,

In another scenarios, bkp data is corrupted for tablespace. again this is not reproducible everytime,
but If I am running the same set of commands I am getting the same error.

[edb@localhost bin]$ ./pg_ctl -D data -l logfile start
waiting for server to start.... done
server started
[edb@localhost bin]$
[edb@localhost bin]$ mkdir /tmp/tblsp
[edb@localhost bin]$ ./psql postgres -p 5432 -c "create tablespace tblsp location '/tmp/tblsp';"
CREATE TABLESPACE
[edb@localhost bin]$ ./psql postgres -p 5432 -c "create database testdb tablespace tblsp;"
CREATE DATABASE
[edb@localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl (a text);"
CREATE TABLE
[edb@localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl values ('parallel_backup with tablespace');"
INSERT 0 1
[edb@localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T /tmp/tblsp=/tmp/tblsp_bkp --jobs 2
[edb@localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p 5555" start
waiting for server to start.... done
server started
[edb@localhost bin]$ ./psql postgres -p 5555 -c "select * from pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
  oid  |  spcname   | spcowner | spcacl | spcoptions
-------+------------+----------+--------+------------
  1663 | pg_default |       10 |        |
 16384 | tblsp      |       10 |        |
(2 rows)

[edb@localhost bin]$ ./psql testdb -p 5555 -c "select * from testtbl";
psql: error: could not connect to server: FATAL:  "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
DETAIL:  File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is missing.
[edb@localhost bin]$
[edb@localhost bin]$ ls data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
[edb@localhost bin]$ ls /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
ls: cannot access /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or directory


Thanks & Regards,
Rajkumar Raghuwanshi


On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
Hi Asif,

On testing further, I found when taking backup with -R, pg_basebackup crashed
this crash is not consistently reproducible.

[edb@localhost bin]$ ./psql postgres -p 5432 -c "create table test (a text);"
CREATE TABLE
[edb@localhost bin]$ ./psql postgres -p 5432 -c "insert into test values ('parallel_backup with -R recovery-conf');"
INSERT 0 1
[edb@localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp -R
Segmentation fault (core dumped)

stack trace looks the same as it was on earlier reported crash with tablespace.
--stack trace
[edb@localhost bin]$ gdb -q -c core.37915 pg_basebackup
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp -R'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004099ee in worker_get_files (wstate=0xc1e458) at pg_basebackup.c:3175
3175 backupinfo->curr = fetchfile->next;
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00000000004099ee in worker_get_files (wstate=0xc1e458) at pg_basebackup.c:3175
#1  0x0000000000408a9e in worker_run (arg=0xc1e458) at pg_basebackup.c:2715
#2  0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at pthread_create.c:301
#3  0x00000039212e8c4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb)

Thanks & Regards,
Rajkumar Raghuwanshi


On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi Asif,
 
Thanks Rajkumar. I have fixed the above issues and have rebased the patch to the latest master (b7f64c64).
(V9 of the patches are attached).

I had a further review of the patches and here are my few observations:
 
1.
+/*
+ * stop_backup() - ends an online backup
+ *
+ * The function is called at the end of an online backup. It sends out pg_control
+ * file, optionally WAL segments and ending WAL location.
+ */

Comments seem out-dated.

Fixed.
 

2. With parallel jobs, maxrate is now not supported. Since we are now asking
data in multiple threads throttling seems important here. Can you please
explain why have you disabled that?

3. As we are always fetching a single file and as Robert suggested, let rename
SEND_FILES to SEND_FILE instead.

Yes, we are fetching a single file. However, SEND_FILES is still capable of fetching multiple files in one
go, that's why the name.


4. Does this work on Windows? I mean does pthread_create() work on Windows?
I asked this as I see that pgbench has its own implementation for
pthread_create() for WIN32 but this patch doesn't.

patch is updated to add support for the Windows platform.


5. Typos:
tablspace => tablespace
safly => safely

Done.
 
6. parallel_backup_run() needs some comments explaining the states it goes
through PB_* states.

7.
+            case PB_FETCH_REL_FILES:    /* fetch files from server */
+                if (backupinfo->activeworkers == 0)
+                {
+                    backupinfo->backupstate = PB_STOP_BACKUP;
+                    free_filelist(backupinfo);
+                }
+                break;
+            case PB_FETCH_WAL_FILES:    /* fetch WAL files from server */
+                if (backupinfo->activeworkers == 0)
+                {
+                    backupinfo->backupstate = PB_BACKUP_COMPLETE;
+                }
+                break;
Done.
 

Why free_filelist() is not called in PB_FETCH_WAL_FILES case?
Done.

The corrupted tablespace and crash, reported by Rajkumar, have been fixed. A pointer
variable remained uninitialized which in turn caused the system to misbehave.

Attached is the updated set of patches. AFAIK, to complete parallel backup feature
set, there remain three sub-features:

1- parallel backup does not work with a standby server. In parallel backup, the server
spawns multiple processes and there is no shared state being maintained. So currently,
no way to tell multiple processes if the standby was promoted during the backup since
the START_BACKUP was called.

2- throttling. Robert previously suggested that we implement throttling on the client-side.
However, I found a previous discussion where it was advocated to be added to the
backend instead[1].

So, it was better to have a consensus before moving the throttle function to the client.
That’s why for the time being I have disabled it and have asked for suggestions on it
to move forward.

It seems to me that we have to maintain a shared state in order to support taking backup
from standby. Also, there is a new feature recently committed for backup progress
reporting in the backend (pg_stat_progress_basebackup). This functionality was recently
added via this commit ID: e65497df. For parallel backup to update these stats, a shared
state will be required.

Since multiple pg_basebackup can be running at the same time, maintaining a shared state
can become a little complex, unless we disallow taking multiple parallel backups.

So proceeding on with this patch, I will be working on:
- throttling to be implemented on the client-side.
- adding a shared state to handle backup from the standby.



--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca



--
Highgo Software (Canada/China/Pakistan)
URL : http://www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
EMAIL: mailto: ahsan.hadi@highgo.ca


--
Regards
====================================
Kashif Zeeshan
Lead Quality Assurance Engineer / Manager

EnterpriseDB Corporation
The Enterprise Postgres Company

pgsql-hackers by date:

Previous
From: Rajkumar Raghuwanshi
Date:
Subject: Re: WIP/PoC for parallel backup
Next
From: Robert Haas
Date:
Subject: Re: WIP/PoC for parallel backup