Re: WIP/PoC for parallel backup - Mailing list pgsql-hackers

From Rajkumar Raghuwanshi
Subject Re: WIP/PoC for parallel backup
Date
Msg-id CAKcux6mBX+Y0wcmtqEQjW9Y7PvTwN0OyjhLiV78Sc8oCgpkrOw@mail.gmail.com
Whole thread Raw
In response to Re: WIP/PoC for parallel backup  (Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com>)
Responses Re: WIP/PoC for parallel backup  (Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com>)
List pgsql-hackers
Hi Asif,

In another scenarios, bkp data is corrupted for tablespace. again this is not reproducible everytime,
but If I am running the same set of commands I am getting the same error.

[edb@localhost bin]$ ./pg_ctl -D data -l logfile start
waiting for server to start.... done
server started
[edb@localhost bin]$
[edb@localhost bin]$ mkdir /tmp/tblsp
[edb@localhost bin]$ ./psql postgres -p 5432 -c "create tablespace tblsp location '/tmp/tblsp';"
CREATE TABLESPACE
[edb@localhost bin]$ ./psql postgres -p 5432 -c "create database testdb tablespace tblsp;"
CREATE DATABASE
[edb@localhost bin]$ ./psql testdb -p 5432 -c "create table testtbl (a text);"
CREATE TABLE
[edb@localhost bin]$ ./psql testdb -p 5432 -c "insert into testtbl values ('parallel_backup with tablespace');"
INSERT 0 1
[edb@localhost bin]$ ./pg_basebackup -p 5432 -D /tmp/bkp -T /tmp/tblsp=/tmp/tblsp_bkp --jobs 2
[edb@localhost bin]$ ./pg_ctl -D /tmp/bkp -l /tmp/bkp_logs -o "-p 5555" start
waiting for server to start.... done
server started
[edb@localhost bin]$ ./psql postgres -p 5555 -c "select * from pg_tablespace where spcname like 'tblsp%' or spcname = 'pg_default'";
  oid  |  spcname   | spcowner | spcacl | spcoptions
-------+------------+----------+--------+------------
  1663 | pg_default |       10 |        |
 16384 | tblsp      |       10 |        |
(2 rows)

[edb@localhost bin]$ ./psql testdb -p 5555 -c "select * from testtbl";
psql: error: could not connect to server: FATAL:  "pg_tblspc/16384/PG_13_202003051/16385" is not a valid data directory
DETAIL:  File "pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION" is missing.
[edb@localhost bin]$
[edb@localhost bin]$ ls data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
data/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
[edb@localhost bin]$ ls /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION
ls: cannot access /tmp/bkp/pg_tblspc/16384/PG_13_202003051/16385/PG_VERSION: No such file or directory


Thanks & Regards,
Rajkumar Raghuwanshi


On Mon, Mar 16, 2020 at 6:19 PM Rajkumar Raghuwanshi <rajkumar.raghuwanshi@enterprisedb.com> wrote:
Hi Asif,

On testing further, I found when taking backup with -R, pg_basebackup crashed
this crash is not consistently reproducible.

[edb@localhost bin]$ ./psql postgres -p 5432 -c "create table test (a text);"
CREATE TABLE
[edb@localhost bin]$ ./psql postgres -p 5432 -c "insert into test values ('parallel_backup with -R recovery-conf');"
INSERT 0 1
[edb@localhost bin]$ ./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp -R
Segmentation fault (core dumped)

stack trace looks the same as it was on earlier reported crash with tablespace.
--stack trace
[edb@localhost bin]$ gdb -q -c core.37915 pg_basebackup
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `./pg_basebackup -p 5432 -j 2 -D /tmp/test_bkp/bkp -R'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000004099ee in worker_get_files (wstate=0xc1e458) at pg_basebackup.c:3175
3175 backupinfo->curr = fetchfile->next;
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 openssl-1.0.1e-58.el6_10.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00000000004099ee in worker_get_files (wstate=0xc1e458) at pg_basebackup.c:3175
#1  0x0000000000408a9e in worker_run (arg=0xc1e458) at pg_basebackup.c:2715
#2  0x0000003921a07aa1 in start_thread (arg=0x7f72207c0700) at pthread_create.c:301
#3  0x00000039212e8c4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
(gdb)

Thanks & Regards,
Rajkumar Raghuwanshi


On Mon, Mar 16, 2020 at 2:14 PM Jeevan Chalke <jeevan.chalke@enterprisedb.com> wrote:
Hi Asif,
 
Thanks Rajkumar. I have fixed the above issues and have rebased the patch to the latest master (b7f64c64).
(V9 of the patches are attached).

I had a further review of the patches and here are my few observations:
 
1.
+/*
+ * stop_backup() - ends an online backup
+ *
+ * The function is called at the end of an online backup. It sends out pg_control
+ * file, optionally WAL segments and ending WAL location.
+ */

Comments seem out-dated.

2. With parallel jobs, maxrate is now not supported. Since we are now asking
data in multiple threads throttling seems important here. Can you please
explain why have you disabled that?

3. As we are always fetching a single file and as Robert suggested, let rename
SEND_FILES to SEND_FILE instead.

4. Does this work on Windows? I mean does pthread_create() work on Windows?
I asked this as I see that pgbench has its own implementation for
pthread_create() for WIN32 but this patch doesn't.

5. Typos:
tablspace => tablespace
safly => safely

6. parallel_backup_run() needs some comments explaining the states it goes
through PB_* states.

7.
+            case PB_FETCH_REL_FILES:    /* fetch files from server */
+                if (backupinfo->activeworkers == 0)
+                {
+                    backupinfo->backupstate = PB_STOP_BACKUP;
+                    free_filelist(backupinfo);
+                }
+                break;
+            case PB_FETCH_WAL_FILES:    /* fetch WAL files from server */
+                if (backupinfo->activeworkers == 0)
+                {
+                    backupinfo->backupstate = PB_BACKUP_COMPLETE;
+                }
+                break;

Why free_filelist() is not called in PB_FETCH_WAL_FILES case?

Thanks
--
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 66449694

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.

pgsql-hackers by date:

Previous
From: Atsushi Torikoshi
Date:
Subject: Re: Wait event that should be reported while waiting for WALarchiving to finish
Next
From: Peter Eisentraut
Date:
Subject: Re: [PATCH] Skip llvm bytecode generation if LLVM is missing