[Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers

From k.jamison@fujitsu.com
Subject [Patch] Optimize dropping of relation buffers using dlist
Date
Msg-id OSBPR01MB3207DCA7EC725FDD661B3EDAEF660@OSBPR01MB3207.jpnprd01.prod.outlook.com
Whole thread Raw
Responses RE: [Patch] Optimize dropping of relation buffers using dlist
List pgsql-hackers

Hi,

 

Currently, we need to scan the WHOLE shared buffers when VACUUM

truncated off any empty pages at end of transaction or when relation

is TRUNCATEd.

As for our customer case, we periodically truncate thousands of tables,

and it's possible to TRUNCATE single table per transaction. This can be

problematic later on during recovery which could take longer, especially

when a sudden failover happens after those TRUNCATEs and when we

have to scan a large-sized shared buffer. In the performance test below,

it took almost 12.5 minutes for recovery to complete for 100GB shared

buffers. But we want to keep failover very short (within 10 seconds).

 

Previously, I made an improvement in speeding the truncates of relation

forks from 3 scans to one scan.[1] This time, the aim of this patch is

to further speedup the invalidation of pages, by linking the cached pages

of the target relation in a doubly-linked list and just traversing it

instead of scanning the whole shared buffers. In DropRelFileNodeBuffers,

we just get the number of target buffers to invalidate for the relation.

There is a significant win in this patch, because we were able to

complete failover and recover in 3 seconds more or less.

 

I performed similar tests to what I did in the speedup truncates of

relations forks.[1][2] However, this time using 100GB shared_buffers.

 

[Machine spec used in testing]

Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz

CPU: 16, Number of cores per socket: 8

RHEL6.5, Memory: 256GB++

 

[Test]

1. (Master) Create table (ex. 10,000 tables). Insert data to tables.

2. (Master) DELETE FROM TABLE (ex. all rows of 10,000 tables)

(Standby) To test with failover, pause the WAL replay on standby server.

(SELECT pg_wal_replay_pause();)

3. (M) psql -c "\timing on" (measures total execution of SQL queries)

4. (M) VACUUM (whole db)

5. (M) Stop primary server. pg_ctl stop -D $PGDATA -w

6. (S) Resume wal replay and promote standby.[2]

 

[Results]

 

A. HEAD (origin/master branch)

A1. Vacuum execution on Primary server

    Time: 730932.408 ms (12:10.932) ~12min 11s

A2. Vacuum + Failover (WAL Recovery on Standby)

    waiting for server to promote...........................

    .................................... stopped waiting

    pg_ctl: server did not promote in time

    2019/10/25_12:13:09.692─┐

    2019/10/25_12:25:43.576─┘

    -->Total: 12min34s

 

B. PATCH

B1. Vacuum execution on Primary/Master

    Time: 6.518333s = 6518.333 ms

B2. Vacuum + Failover (WAL Recovery on Standby)

    2019/10/25_14:17:21.822

    waiting for server to promote...... done

    server promoted

    2019/10/25_14:17:24.827

    2019/10/25_14:17:24.833

    -->Total: 3.011s

 

[Other Notes]

Maybe one disadvantage is that we can have a variable number of

relations, and allocated the same number of relation structures as

the size of shared buffers. I tried to reduce the use of memory when

doing hash table lookup operation by having a fixed size array (100)

or threshold of target buffers to invalidate.

When doing CachedBufLookup() to scan the count of each buffer in the

dlist, I made sure to reduce the number of scans (2x at most).

First, we scan the dlist of cached buffers of relations.

Then store the target buffers in buf_id_array. Non-target buffers

would be removed from dlist but added to temporary dlist.

After reaching end of main dlist, we append the temporary dlist to

tail of main dlist.

I also performed pgbench buffer test, and this patch did not cause

overhead to normal DB access performance.

 

Another one that I'd need feedback of is the use of new dlist operations

for this cached buffer list. I did not use in this patch the existing

Postgres dlist architecture (ilist.h) because I want to save memory space

as much as possible especially when NBuffers become large. Both dlist_node

& dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.

 

Hope to hear your feedback and comments.

 

Thanks in advance,

Kirk Jamison

 

[1] https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C64E2067%40g01jpexmbkw24

[2] https://www.postgresql.org/message-id/D09B13F772D2274BB348A310EE3027C6502672%40g01jpexmbkw24

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: update ALTER TABLE with ATTACH PARTITION lock mode
Next
From: Amit Langote
Date:
Subject: Re: [PATCH] Do not use StdRdOptions in Access Methods