RE: [PATCH] Speedup truncates of relation forks - Mailing list pgsql-hackers

From Jamison, Kirk
Subject RE: [PATCH] Speedup truncates of relation forks
Date
Msg-id D09B13F772D2274BB348A310EE3027C6502672@g01jpexmbkw24
Whole thread Raw
In response to Re: [PATCH] Speedup truncates of relation forks  (Adrien Nayrat <adrien.nayrat@anayrat.info>)
Responses Re: [PATCH] Speedup truncates of relation forks
List pgsql-hackers
On Wednesday, June 26, 2019 6:10 PM(GMT+9), Adrien Nayrat wrote:
> As far as I remember, you should see "relation" wait events (type lock) on
> standby server. This is due to startup process acquiring AccessExclusiveLock
> for the truncation and other backend waiting to acquire a lock to read the
> table.

Hi Adrien, thank you for taking time to reply.

I understand that RelationTruncate() can block read-only queries on
standby during redo. However, it's difficult for me to reproduce the 
test case where I need to catch that wait for relation lock, because
one has to execute SELECT within the few milliseconds of redoing the
truncation of one table.

Instead, I just measured the whole recovery time, smgr_redo(),
to show the recovery improvement compared to head. Please refer below.

[Recovery Test]
I used the same stored functions and configurations in the previous email
& created "test" db.

$ createdb test
$ psql -d test

1. [Primary] Create 10,000 relations.
    test=# SELECT create_tables(10000);

2. [P] Insert one row in each table.
    test=# SELECT insert_tables(10000);

3. [P] Delete row of each table.
    test=# SELECT delfrom_tables(10000);

4. [Standby] WAL application is stopped at Standby server.
    test=# SELECT pg_wal_replay_pause();

5. [P] VACUUM is executed at Primary side, and measure its execution time.        
    test=# \timing on
    test=# VACUUM;

    Alternatively, you may use:
    $ time psql -d test -c 'VACUUM;'
    (Note: WAL has not replayed on standby because it's been paused.)

6. [P] Wait until VACUUM has finished execution. Then, stop primary server. 
    test=# pg_ctl stop -w

7. [S] Resume WAL replay, then promote standby (failover).
I used a shell script to execute recovery & promote standby server
because it's kinda difficult to measure recovery time. Please refer to the script below.
- "SELECT pg_wal_replay_resume();" is executed and the WAL application is resumed.
- "pg_ctl promote" to promote standby.
- The time difference of "select pg_is_in_recovery();" from "t" to "f" is measured.

shell script:

PGDT=/path_to_storage_directory/

if [ "$1" = "resume" ]; then
    psql -c "SELECT pg_wal_replay_resume();" test
    date +%Y/%m/%d_%H:%M:%S.%3N
    pg_ctl promote -D ${PGDT}
    set +x
    date +%Y/%m/%d_%H:%M:%S.%3N
    while [ 1 ]
    do
        RS=`psql -Atc "select pg_is_in_recovery();" test`        
        if [ ${RS} = "f" ]; then
            break
        fi
    done
    date +%Y/%m/%d_%H:%M:%S.%3N
    set -x
    exit 0
fi


[Test Results]
shared_buffers = 24GB

1. HEAD
(wal replay resumed)
2019/07/01_08:48:50.326
server promoted
2019/07/01_08:49:50.482
2019/07/01_09:02:41.051

 Recovery Time:
 13 min 50.725 s -> Time difference from WAL replay to complete recovery
 12 min 50.569 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"

2. PATCH
(wal replay resumed)
2019/07/01_07:34:26.766
server promoted
2019/07/01_07:34:57.790
2019/07/01_07:34:57.809

 Recovery Time:    
 31.043 s -> Time difference from WAL replay to complete recovery
 00.019 s -> Time difference of "select pg_is_in_recovery();" from "t" to "f"
 
[Conclusion]
The recovery time significantly improved compared to head
from 13 minutes to 30 seconds.

Any thoughts?
I'd really appreciate your comments/feedback about the patch and/or test.


Regards,
Kirk Jamison

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Optimization of some jsonb functions
Next
From: Dean Rasheed
Date:
Subject: Re: Choosing values for multivariate MCV lists