Re: WIP/PoC for parallel backup - Mailing list pgsql-hackers

From Rajkumar Raghuwanshi
Subject Re: WIP/PoC for parallel backup
Date
Msg-id CAKcux6=Wu91dyXWALOzQ7NGX1fkgWHPjZjxZEsFJfOKvrc8pBw@mail.gmail.com
Whole thread Raw
In response to Re: WIP/PoC for parallel backup  (Asif Rehman <asifr.rehman@gmail.com>)
Responses Re: WIP/PoC for parallel backup  (Ahsan Hadi <ahsan.hadi@gmail.com>)
List pgsql-hackers
Thanks Asif,

I have re-verified reported issue. expect standby backup, others are fixed.

Thanks & Regards,
Rajkumar Raghuwanshi


On Fri, Mar 27, 2020 at 11:04 PM Asif Rehman <asifr.rehman@gmail.com> wrote:


On Wed, Mar 25, 2020 at 12:22 PM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
Hi Asif,

While testing further I observed parallel backup is not able to take backup of standby server.

mkdir /tmp/archive_dir
echo "archive_mode='on'">> data/postgresql.conf
echo "archive_command='cp %p /tmp/archive_dir/%f'">> data/postgresql.conf

./pg_ctl -D data -l logs start
./pg_basebackup -p 5432 -Fp -R -D /tmp/slave

echo "primary_conninfo='host=127.0.0.1 port=5432 user=edb'">> /tmp/slave/postgresql.conf
echo "restore_command='cp /tmp/archive_dir/%f %p'">> /tmp/slave/postgresql.conf
echo "promote_trigger_file='/tmp/failover.log'">> /tmp/slave/postgresql.conf

./pg_ctl -D /tmp/slave -l /tmp/slave_logs -o "-p 5433" start -c

[edb@localhost bin]$ ./psql postgres -p 5432 -c "select pg_is_in_recovery();"
 pg_is_in_recovery
-------------------
 f
(1 row)

[edb@localhost bin]$ ./psql postgres -p 5433 -c "select pg_is_in_recovery();"
 pg_is_in_recovery
-------------------
 t
(1 row)

[edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 6
pg_basebackup: error: could not list backup files: ERROR:  the standby was promoted during online backup
HINT:  This means that the backup being taken is corrupt and should not be used. Try taking another online backup.
pg_basebackup: removing data directory "/tmp/bkp_s"


#same is working fine without parallel backup
[edb@localhost bin]$ ./pg_basebackup -p 5433 -D /tmp/bkp_s --jobs 1
[edb@localhost bin]$ ls /tmp/bkp_s/PG_VERSION
/tmp/bkp_s/PG_VERSION

Thanks & Regards,
Rajkumar Raghuwanshi


On Thu, Mar 19, 2020 at 4:11 PM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
Hi Asif,

In another scenarios, bkp data is corrupted for tablespace. again this is not reproducible everytime,
but If I am running the same set of commands I am getting the same error.

[edb@localhost bin]$ ./pg_ctl -D data -l logfile start
waiting for server to start.... done
server started
[edb@localhost bin]$
[edb@localhost bin]$ mkdir /tmp/tblsp
[edb@localhost bin]$ ./psql postgres -p 5432 -c "create tablespace tblsp location '/tmp/tblsp';"
CREATE TABLESPACE
[edb@localhost bin]$ ./psql postgres -p 5432 -c "create database testdb tablespace tblsp;"
CREATE DATABASE
[edb@localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl (a text);"
CREATE TABLE
[edb@localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl values ('parallel_backup with tablespace');"
INSERT 0 1
[edb@localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T /tmp/tblsp=/tmp/tblsp_bkp --jobs 2
[edb@localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p 5555" start
waiting for server to start.... done
server started
[edb@localhost bin]$ ./psql postgres -p 5555 -c "select * from pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
  oid  |  spcname   | spcowner | spcacl | spcoptions
-------+------------+----------+--------+------------
  1663 | pg_default |       10 |        |
 16384 | tblsp      |       10 |        |
(2 rows)

[edb@localhost bin]$ ./psql testdb -p 5555 -c "select * from testtbl";
psql: error: could not connect to server: FATAL:  "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
DETAIL:  File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is missing.
[edb@localhost bin]$
[edb@localhost bin]$ ls data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
[edb@localhost bin]$ ls /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
ls: cannot access /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or directory


Thanks & Regards,
Rajkumar Raghuwanshi


On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
Hi Asif,

On testing further, I found when taking backup with -R, pg_basebackup crashed
this crash is not consistently reproducible.

[edb@localhost bin]$ ./psql postgres -p 5432 -c "create table test (a text);"
CREATE TABLE
[edb@localhost bin]$ ./psql postgres -p 5432 -c "insert into test values ('parallel_backup with -R recovery-conf');"
INSERT 0 1
[edb@localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp -R
Segmentation fault (core dumped)

stack trace looks the same as it was on earlier reported crash with tablespace.
--stack trace
[edb@localhost bin]$ gdb -q -c core.37915 pg_basebackup
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp -R'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004099ee in worker_get_files (wstate=0xc1e458) at pg_basebackup.c:3175
3175 backupinfo->curr = fetchfile->next;
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00000000004099ee in worker_get_files (wstate=0xc1e458) at pg_basebackup.c:3175
#1  0x0000000000408a9e in worker_run (arg=0xc1e458) at pg_basebackup.c:2715
#2  0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at pthread_create.c:301
#3  0x00000039212e8c4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb)

Thanks & Regards,
Rajkumar Raghuwanshi


On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi Asif,
 
Thanks Rajkumar. I have fixed the above issues and have rebased the patch to the latest master (b7f64c64).
(V9 of the patches are attached).

I had a further review of the patches and here are my few observations:
 
1.
+/*
+ * stop_backup() - ends an online backup
+ *
+ * The function is called at the end of an online backup. It sends out pg_control
+ * file, optionally WAL segments and ending WAL location.
+ */

Comments seem out-dated.

Fixed.
 

2. With parallel jobs, maxrate is now not supported. Since we are now asking
data in multiple threads throttling seems important here. Can you please
explain why have you disabled that?

3. As we are always fetching a single file and as Robert suggested, let rename
SEND_FILES to SEND_FILE instead.

Yes, we are fetching a single file. However, SEND_FILES is still capable of fetching multiple files in one
go, that's why the name.


4. Does this work on Windows? I mean does pthread_create() work on Windows?
I asked this as I see that pgbench has its own implementation for
pthread_create() for WIN32 but this patch doesn't.

patch is updated to add support for the Windows platform.


5. Typos:
tablspace => tablespace
safly => safely

Done.
 
6. parallel_backup_run() needs some comments explaining the states it goes
through PB_* states.

7.
+            case PB_FETCH_REL_FILES:    /* fetch files from server */
+                if (backupinfo->activeworkers == 0)
+                {
+                    backupinfo->backupstate = PB_STOP_BACKUP;
+                    free_filelist(backupinfo);
+                }
+                break;
+            case PB_FETCH_WAL_FILES:    /* fetch WAL files from server */
+                if (backupinfo->activeworkers == 0)
+                {
+                    backupinfo->backupstate = PB_BACKUP_COMPLETE;
+                }
+                break;
Done.
 

Why free_filelist() is not called in PB_FETCH_WAL_FILES case?
Done.

The corrupted tablespace and crash, reported by Rajkumar, have been fixed. A pointer
variable remained uninitialized which in turn caused the system to misbehave.

Attached is the updated set of patches. AFAIK, to complete parallel backup feature
set, there remain three sub-features:

1- parallel backup does not work with a standby server. In parallel backup, the server
spawns multiple processes and there is no shared state being maintained. So currently,
no way to tell multiple processes if the standby was promoted during the backup since
the START_BACKUP was called.

2- throttling. Robert previously suggested that we implement throttling on the client-side.
However, I found a previous discussion where it was advocated to be added to the
backend instead[1].

So, it was better to have a consensus before moving the throttle function to the client.
That’s why for the time being I have disabled it and have asked for suggestions on it
to move forward.

It seems to me that we have to maintain a shared state in order to support taking backup
from standby. Also, there is a new feature recently committed for backup progress
reporting in the backend (pg_stat_progress_basebackup). This functionality was recently
added via this commit ID: e65497df. For parallel backup to update these stats, a shared
state will be required.

Since multiple pg_basebackup can be running at the same time, maintaining a shared state
can become a little complex, unless we disallow taking multiple parallel backups.

So proceeding on with this patch, I will be working on:
- throttling to be implemented on the client-side.
- adding a shared state to handle backup from the standby.



--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: WAL usage calculation patch
Next
From: Masahiko Sawada
Date:
Subject: Re: Some problems of recovery conflict wait events