Home > mailing lists

Thread: [Patch] Optimize dropping of relation buffers using dlist

[Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

28 October 2019, 08:13:19

Hi,

Currently, we need to scan the WHOLE shared buffers when VACUUM

truncated off any empty pages at end of transaction or when relation

is TRUNCATEd.

As for our customer case, we periodically truncate thousands of tables,

and it's possible to TRUNCATE single table per transaction. This can be

problematic later on during recovery which could take longer, especially

when a sudden failover happens after those TRUNCATEs and when we

have to scan a large-sized shared buffer. In the performance test below,

it took almost 12.5 minutes for recovery to complete for 100GB shared

buffers. But we want to keep failover very short (within 10 seconds).

Previously, I made an improvement in speeding the truncates of relation

forks from 3 scans to one scan.[1] This time, the aim of this patch is

to further speedup the invalidation of pages, by linking the cached pages

of the target relation in a doubly-linked list and just traversing it

instead of scanning the whole shared buffers. In DropRelFileNodeBuffers,

we just get the number of target buffers to invalidate for the relation.

There is a significant win in this patch, because we were able to

complete failover and recover in 3 seconds more or less.

I performed similar tests to what I did in the speedup truncates of

relations forks.[1][2] However, this time using 100GB shared_buffers.

[Machine spec used in testing]

Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz

CPU: 16, Number of cores per socket: 8

RHEL6.5, Memory: 256GB++

[Test]

1. (Master) Create table (ex. 10,000 tables). Insert data to tables.

2. (Master) DELETE FROM TABLE (ex. all rows of 10,000 tables)

(Standby) To test with failover, pause the WAL replay on standby server.

(SELECT pg_wal_replay_pause();)

3. (M) psql -c "\timing on" (measures total execution of SQL queries)

4. (M) VACUUM (whole db)

5. (M) Stop primary server. pg_ctl stop -D $PGDATA -w

6. (S) Resume wal replay and promote standby.[2]

[Results]

A. HEAD (origin/master branch)

A1. Vacuum execution on Primary server

Time: 730932.408 ms (12:10.932) ~12min 11s

A2. Vacuum + Failover (WAL Recovery on Standby)

waiting for server to promote...........................

.................................... stopped waiting

pg_ctl: server did not promote in time

2019/10/25_12:13:09.692─┐

2019/10/25_12:25:43.576─┘

-->Total: 12min34s

B. PATCH

B1. Vacuum execution on Primary/Master

Time: 6.518333s = 6518.333 ms

B2. Vacuum + Failover (WAL Recovery on Standby)

2019/10/25_14:17:21.822

waiting for server to promote...... done

server promoted

2019/10/25_14:17:24.827

2019/10/25_14:17:24.833

-->Total: 3.011s

[Other Notes]

Maybe one disadvantage is that we can have a variable number of

relations, and allocated the same number of relation structures as

the size of shared buffers. I tried to reduce the use of memory when

doing hash table lookup operation by having a fixed size array (100)

or threshold of target buffers to invalidate.

When doing CachedBufLookup() to scan the count of each buffer in the

dlist, I made sure to reduce the number of scans (2x at most).

First, we scan the dlist of cached buffers of relations.

Then store the target buffers in buf_id_array. Non-target buffers

would be removed from dlist but added to temporary dlist.

After reaching end of main dlist, we append the temporary dlist to

tail of main dlist.

I also performed pgbench buffer test, and this patch did not cause

overhead to normal DB access performance.

Another one that I'd need feedback of is the use of new dlist operations

for this cached buffer list. I did not use in this patch the existing

Postgres dlist architecture (ilist.h) because I want to save memory space

as much as possible especially when NBuffers become large. Both dlist_node

& dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.

Hope to hear your feedback and comments.

Thanks in advance,

Kirk Jamison

[1] https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C64E2067%40g01jpexmbkw24

[2] https://www.postgresql.org/message-id/D09B13F772D2274BB348A310EE3027C6502672%40g01jpexmbkw24

Attachment

v1-Optimize-dropping-of-relation-buffers-using-dlist.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

05 November 2019, 09:58:22

Hi,

> Another one that I'd need feedback of is the use of new dlist operations

> for this cached buffer list. I did not use in this patch the existing

> Postgres dlist architecture (ilist.h) because I want to save memory space

> as much as possible especially when NBuffers become large. Both dlist_node

> & dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.

In cb_dlist_combine(), the code block below can impact performance

especially for cases when the doubly linked list is long (IOW, many cached buffers).

/* Point to the tail of main dlist */

while (curr_main->next != CACHEDBLOCK_END_OF_LIST)

curr_main = cb_dlist_next(curr_main);

Attached is an improved version of the previous patch, which adds a pointer

information of the TAIL field in order to speed up the abovementioned operation.

I stored the tail field in the prev pointer of the head entry (maybe not a typical

approach). A more typical one is by adding a tail field (int tail) to CachedBufferEnt,

but I didn’t do that because as I mentioned in previous email I want to avoid

using more memory as much as possible.

The patch worked as intended and passed the tests.

Any thoughts?

Regards,

Kirk Jamison

Attachment

v2-Optimize-dropping-of-relation-buffers-using-dlist.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Tomas Vondra

Date:

05 November 2019, 15:34:30

Hi Kirk,

On Tue, Nov 05, 2019 at 09:58:22AM +0000, k.jamison@fujitsu.com wrote:
>Hi,
>
>
>> Another one that I'd need feedback of is the use of new dlist operations
>
>> for this cached buffer list. I did not use in this patch the existing
>
>> Postgres dlist architecture (ilist.h) because I want to save memory space
>
>> as much as possible especially when NBuffers become large. Both dlist_node
>
>> & dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes.
>
>In cb_dlist_combine(), the code block below can impact performance
>especially for cases when the doubly linked list is long (IOW, many cached buffers).
>              /* Point to the tail of main dlist */
>              while (curr_main->next != CACHEDBLOCK_END_OF_LIST)
>                            curr_main = cb_dlist_next(curr_main);
>
>Attached is an improved version of the previous patch, which adds a pointer
>information of the TAIL field in order to speed up the abovementioned operation.
>I stored the tail field in the prev pointer of the head entry (maybe not a typical
>approach). A more typical one is by adding a tail field (int tail) to CachedBufferEnt,
>but I didn’t do that because as I mentioned in previous email I want to avoid
>using more memory as much as possible.
>The patch worked as intended and passed the tests.
>
>Any thoughts?
>

A couple of comments based on briefly looking at the patch.

1) I don't think you should / need to expose most of the ne stuff in
    buf_internals.h. It's only used from buf_internals.c and having all
    the various cb_dlist_* function in .h seems strange.

2) This adds another hashtable maintenance to BufferAlloc etc. but
    you've only done tests / benchmark for the case this optimizes. I
    think we need to see a benchmark for workload that allocates and
    invalidates lot of buffers. A pgbench with a workload that fits into
    RAM but not into shared buffers would be interesting.

3) I see this triggered a failure on cputube, in the commit_ts TAP test.
    That's a bit strange, someone should investigate I guess.
    
    https://travis-ci.org/postgresql-cfbot/postgresql/builds/607563900

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Robert Haas

Date:

06 November 2019, 16:27:23

On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> 2) This adds another hashtable maintenance to BufferAlloc etc. but
>     you've only done tests / benchmark for the case this optimizes. I
>     think we need to see a benchmark for workload that allocates and
>     invalidates lot of buffers. A pgbench with a workload that fits into
>     RAM but not into shared buffers would be interesting.

Yeah, it seems pretty hard to believe that this won't be bad for some
workloads. Not only do you have the overhead of the hash table
operations, but you also have locking overhead around that. A whole
new set of LWLocks where you have to take and release one of them
every time you allocate or invalidate a buffer seems likely to cause a
pretty substantial contention problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

12 November 2019, 10:49:49

On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
> wrote:
> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
> >     you've only done tests / benchmark for the case this optimizes. I
> >     think we need to see a benchmark for workload that allocates and
> >     invalidates lot of buffers. A pgbench with a workload that fits into
> >     RAM but not into shared buffers would be interesting.
> 
> Yeah, it seems pretty hard to believe that this won't be bad for some workloads.
> Not only do you have the overhead of the hash table operations, but you also
> have locking overhead around that. A whole new set of LWLocks where you have
> to take and release one of them every time you allocate or invalidate a buffer
> seems likely to cause a pretty substantial contention problem.

I'm sorry for the late reply. Thank you Tomas and Robert for checking this patch.
Attached is the v3 of the patch.
- I moved the unnecessary items from buf_internals.h to cached_buf.c since most of
  of those items are only used in that file.
- Fixed the bug of v2. Seems to pass both RT and TAP test now

Thanks for the advice on benchmark test. Please refer below for test and results.

[Machine spec]
CPU: 16, Number of cores per socket: 8
RHEL6.5, Memory: 240GB

scale: 3125 (about 46GB DB size)
shared_buffers = 8GB

[workload that fits into RAM but not into shared buffers]
pgbench -i -s 3125 cachetest
pgbench -c 16 -j 8 -T 600 cachetest

[Patched]
scaling factor: 3125
query mode: simple
number of clients: 16
number of threads: 8
duration: 600 s
number of transactions actually processed: 8815123
latency average = 1.089 ms
tps = 14691.436343 (including connections establishing)
tps = 14691.482714 (excluding connections establishing)

[Master/Unpatched]
...
number of transactions actually processed: 8852327
latency average = 1.084 ms
tps = 14753.814648 (including connections establishing)
tps = 14753.861589 (excluding connections establishing)


My patch caused a little overhead of about 0.42-0.46%, which I think is small.
Kindly let me know your opinions/comments about the patch or tests, etc.

Thanks,
Kirk Jamison

Attachment

v3-Optimize-dropping-of-relation-buffers-using-dlist.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Tomas Vondra

Date:

12 November 2019, 19:19:33

On Tue, Nov 12, 2019 at 10:49:49AM +0000, k.jamison@fujitsu.com wrote:
>On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
>> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
>> wrote:
>> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
>> >     you've only done tests / benchmark for the case this optimizes. I
>> >     think we need to see a benchmark for workload that allocates and
>> >     invalidates lot of buffers. A pgbench with a workload that fits into
>> >     RAM but not into shared buffers would be interesting.
>>
>> Yeah, it seems pretty hard to believe that this won't be bad for some workloads.
>> Not only do you have the overhead of the hash table operations, but you also
>> have locking overhead around that. A whole new set of LWLocks where you have
>> to take and release one of them every time you allocate or invalidate a buffer
>> seems likely to cause a pretty substantial contention problem.
>
>I'm sorry for the late reply. Thank you Tomas and Robert for checking this patch.
>Attached is the v3 of the patch.
>- I moved the unnecessary items from buf_internals.h to cached_buf.c since most of
>  of those items are only used in that file.
>- Fixed the bug of v2. Seems to pass both RT and TAP test now
>
>Thanks for the advice on benchmark test. Please refer below for test and results.
>
>[Machine spec]
>CPU: 16, Number of cores per socket: 8
>RHEL6.5, Memory: 240GB
>
>scale: 3125 (about 46GB DB size)
>shared_buffers = 8GB
>
>[workload that fits into RAM but not into shared buffers]
>pgbench -i -s 3125 cachetest
>pgbench -c 16 -j 8 -T 600 cachetest
>
>[Patched]
>scaling factor: 3125
>query mode: simple
>number of clients: 16
>number of threads: 8
>duration: 600 s
>number of transactions actually processed: 8815123
>latency average = 1.089 ms
>tps = 14691.436343 (including connections establishing)
>tps = 14691.482714 (excluding connections establishing)
>
>[Master/Unpatched]
>...
>number of transactions actually processed: 8852327
>latency average = 1.084 ms
>tps = 14753.814648 (including connections establishing)
>tps = 14753.861589 (excluding connections establishing)
>
>
>My patch caused a little overhead of about 0.42-0.46%, which I think is small.
>Kindly let me know your opinions/comments about the patch or tests, etc.
>

Now try measuring that with a read-only workload, with prepared
statements. I've tried that on a machine with 16 cores, doing

   # 16 clients
   pgbench -n -S -j 16 -c 16 -M prepared -T 60 test

   # 1 client
   pgbench -n -S -c 1 -M prepared -T 60 test

and average from 30 runs of each looks like this:

    # clients      master         patched         %
   ---------------------------------------------------------
    1              29690          27833           93.7%
    16            300935         283383           94.1%

That's quite significant regression, considering it's optimizing an
operation that is expected to be pretty rare (people are generally not
dropping dropping objects as often as they query them).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

28 November 2019, 03:18:59

On Wed, Nov 13, 2019 4:20AM (GMT +9), Tomas Vondra wrote:
> On Tue, Nov 12, 2019 at 10:49:49AM +0000, k.jamison@fujitsu.com wrote:
> >On Thurs, November 7, 2019 1:27 AM (GMT+9), Robert Haas wrote:
> >> On Tue, Nov 5, 2019 at 10:34 AM Tomas Vondra
> >> <tomas.vondra@2ndquadrant.com>
> >> wrote:
> >> > 2) This adds another hashtable maintenance to BufferAlloc etc. but
> >> >     you've only done tests / benchmark for the case this optimizes. I
> >> >     think we need to see a benchmark for workload that allocates and
> >> >     invalidates lot of buffers. A pgbench with a workload that fits into
> >> >     RAM but not into shared buffers would be interesting.
> >>
> >> Yeah, it seems pretty hard to believe that this won't be bad for some
> workloads.
> >> Not only do you have the overhead of the hash table operations, but
> >> you also have locking overhead around that. A whole new set of
> >> LWLocks where you have to take and release one of them every time you
> >> allocate or invalidate a buffer seems likely to cause a pretty substantial
> contention problem.
> >
> >I'm sorry for the late reply. Thank you Tomas and Robert for checking this
> patch.
> >Attached is the v3 of the patch.
> >- I moved the unnecessary items from buf_internals.h to cached_buf.c
> >since most of
> >  of those items are only used in that file.
> >- Fixed the bug of v2. Seems to pass both RT and TAP test now
> >
> >Thanks for the advice on benchmark test. Please refer below for test and
> results.
> >
> >[Machine spec]
> >CPU: 16, Number of cores per socket: 8
> >RHEL6.5, Memory: 240GB
> >
> >scale: 3125 (about 46GB DB size)
> >shared_buffers = 8GB
> >
> >[workload that fits into RAM but not into shared buffers] pgbench -i -s
> >3125 cachetest pgbench -c 16 -j 8 -T 600 cachetest
> >
> >[Patched]
> >scaling factor: 3125
> >query mode: simple
> >number of clients: 16
> >number of threads: 8
> >duration: 600 s
> >number of transactions actually processed: 8815123 latency average =
> >1.089 ms tps = 14691.436343 (including connections establishing) tps =
> >14691.482714 (excluding connections establishing)
> >
> >[Master/Unpatched]
> >...
> >number of transactions actually processed: 8852327 latency average =
> >1.084 ms tps = 14753.814648 (including connections establishing) tps =
> >14753.861589 (excluding connections establishing)
> >
> >
> >My patch caused a little overhead of about 0.42-0.46%, which I think is small.
> >Kindly let me know your opinions/comments about the patch or tests, etc.
> >
> 
> Now try measuring that with a read-only workload, with prepared statements.
> I've tried that on a machine with 16 cores, doing
> 
>    # 16 clients
>    pgbench -n -S -j 16 -c 16 -M prepared -T 60 test
> 
>    # 1 client
>    pgbench -n -S -c 1 -M prepared -T 60 test
> 
> and average from 30 runs of each looks like this:
> 
>     # clients      master         patched         %
>    ---------------------------------------------------------
>     1              29690          27833           93.7%
>     16            300935         283383           94.1%
> 
> That's quite significant regression, considering it's optimizing an
> operation that is expected to be pretty rare (people are generally not
> dropping dropping objects as often as they query them).

I updated the patch and reduced the lock contention of new LWLock,
with tunable definitions in the code and instead of using rnode as the hash key,
I also added the modulo of block number.
#define NUM_MAP_PARTITIONS_FOR_REL    128    /* relation-level */
#define NUM_MAP_PARTITIONS_IN_REL    4    /* block-level */
#define NUM_MAP_PARTITIONS \
    (NUM_MAP_PARTITIONS_FOR_REL * NUM_MAP_PARTITIONS_IN_REL) 

I executed again a benchmark for read-only workload,
but regression currently sits at 3.10% (reduced from v3's 6%).

Average of 10 runs, 16 clients
read-only, prepared query mode

[Master]
num of txn processed: 11,950,983.67
latency average = 0.080 ms
tps = 199,182.24
tps = 199,189.54

[V4 Patch]
num of txn processed: 11,580,256.36 
latency average = 0.083 ms
tps = 193,003.52
tps = 193,010.76


I checked the wait event statistics (non-impactful events omitted)
and got the following below.
I reset the stats before running the pgbench script,
Then showed the stats right after the run.

[Master]
 wait_event_type |      wait_event       |  calls   | microsec
-----------------+-----------------------+----------+----------
 Client          | ClientRead            |   25116  | 49552452
 IO              | DataFileRead          | 14467109 | 92113056
 LWLock          | buffer_mapping        |   204618 |  1364779

[Patch V4]
 wait_event_type |      wait_event       |  calls   | microsec
-----------------+-----------------------+----------+----------
 Client          | ClientRead            |  111393  | 68773946
 IO              | DataFileRead          | 14186773 | 90399833
 LWLock          | buffer_mapping        |   463844 |  4025198
 LWLock          | cached_buf_tranche_id |    83390 |   336080

It seems the buffer_mapping LWLock wait is 4x slower.
However, I'd like to continue working on this patch to next commitfest,
and further reduce its impact to read-only workloads.


Regards,
Kirk Jamison

Attachment

v4-Optimize-dropping-of-relation-buffers-using-dlist.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

13 December 2019, 10:18:46

Hi,

I have updated the patch (v5).
I tried to reduce the lock waiting times by using spinlock
when inserting/deleting buffers in the new hash table, and
exclusive lock when doing lookup for buffers to be dropped.
In summary, instead of scanning the whole buffer pool in 
shared buffers, we just traverse the doubly-linked list of linked
buffers for the target relation and block.

In order to understand how this patch affects performance,
I also measured the cache hit rates in addition to
benchmarking db with various shared buffer size settings.

Using the same machine specs, I used the default script
of pgbench for read-only workload with prepared statement,
and executed about 15 runs for varying shared buffer sizes.
  pgbench -i -s 3200 test  //(about 48GB db size)
  pgbench -S -n -M prepared -c 16 -j 16 -T 60 test

[TPS Regression]
 shbuf | tps(master) |   tps(patch)  | %reg  
---------+-----------------+-----------------+-------
  5GB   | 195,737.23  | 191,422.23 | 2.23
 10GB  | 197,067.93  | 194,011.66 | 1.55
 20GB  | 200,241.18  | 200,425.29 | -0.09
 40GB  | 208,772.81  | 209,807.38 | -0.50
 50GB  | 215,684.33  | 218,955.43 | -1.52

[CACHE HIT RATE]
 Shbuf  |   master   |  patch
----------+--------------+----------
 10GB  | 0.141536 | 0.141485
 20GB  | 0.330088 | 0.329894
 30GB  | 0.573383 | 0.573377
 40GB  | 0.819499 | 0.819264
 50GB  | 0.999237 | 0.999577

For this workload, the regression increases for below 20GB
shared_buffers size. However, the cache hit rate both for
master and patch is 32% (20 GB shbuf). Therefore, I think we
can consider this kind of workload with low shared buffers
size as a “special case”, because in terms of db performance
tuning we want as much as possible for the db to have a higher
cache hit rate (99.9%, or maybe let's say 80% is acceptable).
And in this workload, ideal shared_buffers size would be
around 40GB more or less to hit that acceptable cache hit rate.
Looking at this patch's performance result, if it's within the acceptable
cache hit rate, there would be at least no regression and the results als
 show almost similar tps compared to master.

Your feedback about the patch and tests are welcome.

Regards,
Kirk Jamison

Attachment

v5-Optimize-dropping-of-relation-buffers-using-dlist.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

04 February 2020, 09:57:26

Hi,

I have rebased the patch to keep the CFbot happy.
Apparently, in the previous patch there was a possibility of infinite loop
when allocating buffers, so I fixed that part and also removed some whitespaces.

Kindly check the attached V6 patch.
Any thoughts on this?

Regards,
Kirk Jamison

Attachment

v6-Optimize-dropping-of-relation-buffers-using-dlist.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Robert Haas

Date:

05 February 2020, 15:12:52

On Tue, Feb 4, 2020 at 4:57 AM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
> Kindly check the attached V6 patch.
> Any thoughts on this?

Unfortunately, I don't have time for detailed review of this. I am
suspicious that there are substantial performance regressions that you
just haven't found yet. I would not take the position that this is a
completely hopeless approach, or anything like that, but neither would
I conclude that the tests shown so far are anywhere near enough to be
confident that there are no problems.

Also, systems with very large shared_buffers settings are becoming
more common, and probably will continue to become more common, so I
don't think we can dismiss that as an edge case any more. People don't
want to run with an 8GB cache on a 1TB server.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

25 March 2020, 06:24:32

Hi,

I know this might already be late at end of CommitFest, but attached
is the latest version of the patch. The previous version only includes buffer
invalidation improvement for VACUUM. The new patch adds the same
routine for TRUNCATE WAL replay.

In summary, this patch aims to improve the buffer invalidation process
of VACUUM and  TRUNCATE. Although it may not be a common use
case, our customer uses these commands a lot. Recovery and WAL
replay of these commands can take time depending on the size of
database buffers. So this patch optimizes that using the newly-added
auxiliary cache and doubly-linked list on the shared memory, so that
we don't need to scan the shared buffers anymore.

As for the performance and how it affects the read-only workloads.
Using pgbench, I've kept the overload to a minimum, less than 1%.
I'll post follow-up results.

Although the additional hash table utilizes shared memory, there's
a significant performance gain for both TRUNCATE and VACUUM
from execution to recovery.

Regards,
Kirk Jamison

Attachment

v7-Optimize-dropping-of-relation-buffers-using-dlist.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

30 March 2020, 11:59:08

On Wednesday, March 25, 2020 3:25 PM, Kirk Jamison wrote:
> As for the performance and how it affects the read-only workloads.
> Using pgbench, I've kept the overload to a minimum, less than 1%.
> I'll post follow-up results.

Here's the follow-up results.
I executed the similar tests from top of the thread.
I hope the performance test results shown below would suffice.
If not, I'd appreciate any feedback with regards to test or the patch itself.

A. VACUUM execution + Failover test
- 100GB shared_buffers

1. 1000 tables (18MB)
1.1. Execution Time
- [MASTER] 77755.218 ms (01:17.755)
- [PATCH] Execution Time:   2147.914 ms (00:02.148)
1.2. Failover Time (Recovery WAL Replay):
- [MASTER] 01:37.084 (1 min 37.884 s)
- [PATCH] 1627 ms (1.627 s)

2. 10000 tables (110MB)
2.1. Execution Time
- [MASTER] 844174.572 ms (14:04.175) ~14 min 4.175 s
- [PATCH] 75678.559 ms (01:15.679) ~1 min 15.679 s

2.2. Failover Time:
- [MASTER] est. 14 min++
    (I didn't measure anymore because recovery takes
    as much as the execution time.)
- [PATCH] 01:25.559 (1 min 25.559 s)

Significant performance results for VACUUM.


B. TPS Regression for READ-ONLY workload
(PREPARED QUERY MODE, NO VACUUM)

# [16 Clients]
- pgbench -n -S -j 16 -c 16 -M prepared -T 60 cachetest

|shbuf    |Master      |Patch         |% reg    |
|----------|--------------|---------------|----------|
|128MB| 77,416.76 | 77,162.78 |0.33% |
|1GB     | 81,941.30 | 81,812.05 |0.16% |
|2GB     | 84,273.69 | 84,356.38 |-0.10%|
|100GB| 83,807.30 | 83,924.68 |-0.14%|

# [1 Client]
- pgbench -n -S -c 1 -M prepared -T 60 cachetest

|shbuf    |Master      |Patch         |% reg    |
|----------|--------------|---------------|----------|
|128MB| 12,044.54 | 12,037.13 |0.06% |
|1GB     | 12,736.57 | 12,774.77 |-0.30%|
|2GB     | 12,948.98 | 13,159.90 |-1.63%|
|100GB| 12,982.98 | 13,064.04 |-0.62%|

Both were run for 10 times and average tps and % regression are
shown above. At some point only minimal overload was caused by
the patch. As for other cases, it has higher tps compared to master.

If it does not make it this CF, I hope to receive feedback in the future
on how to proceed. Thanks in advance!

Regards,
Kirk Jamison

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

17 June 2020, 06:14:35

Hi,

Since the last posted version of the patch fails, attached is a rebased version.
Written upthread were performance results and some benefits and challenges.
I'd appreciate your feedback/comments.

Regards,
Kirk Jamison

Attachment

v8-Optimize-dropping-of-relation-buffers-using-dlist.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Konstantin Knizhnik

Date:

29 July 2020, 07:54:45

On 17.06.2020 09:14, k.jamison@fujitsu.com wrote:
> Hi,
>
> Since the last posted version of the patch fails, attached is a rebased version.
> Written upthread were performance results and some benefits and challenges.
> I'd appreciate your feedback/comments.
>
> Regards,
> Kirk Jamison
As far as i understand this patch can provide significant improvement of 
performance only in case of
recovery  of truncates of large number of tables. You have added shared 
hash of relation buffers and certainly if adds some
extra overhead. According to your latest results this overhead is quite 
small. But it will be hard to prove that there will be no noticeable 
regression
at some workloads.

I wonder if you have considered case of local hash (maintained only 
during recovery)?
If there is after-crash recovery, then there will be no concurrent 
access to shared buffers and this hash will be up-to-date.
in case of hot-standby replica we can use some simple invalidation (just 
one flag or counter which indicates that buffer cache was updated).
This hash also can be constructed on demand when DropRelFileNodeBuffers 
is called first time (so w have to scan all buffers once, but subsequent 
drop operation will be fast.

i have not thought much about it, but it seems to me that as far as this 
problem only affects recovery, we do not need shared hash for it.

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

30 July 2020, 07:57:40

On Wednesday, July 29, 2020 4:55 PM, Konstantin Knizhnik wrote:
> On 17.06.2020 09:14, k.jamison@fujitsu.com wrote:
> > Hi,
> >
> > Since the last posted version of the patch fails, attached is a rebased version.
> > Written upthread were performance results and some benefits and challenges.
> > I'd appreciate your feedback/comments.
> >
> > Regards,
> > Kirk Jamison

> As far as i understand this patch can provide significant improvement of
> performance only in case of recovery  of truncates of large number of tables. You
> have added shared hash of relation buffers and certainly if adds some extra
> overhead. According to your latest results this overhead is quite small. But it will
> be hard to prove that there will be no noticeable regression at some workloads.

Thank you for taking a look at this.

Yes, one of the aims is to speed up recovery of truncations, but at the same time the
patch also improves autovacuum, vacuum and relation truncate index executions. 
I showed results of pgbench results above for different types of workloads,
but I am not sure if those are validating enough...

> I wonder if you have considered case of local hash (maintained only during
> recovery)?
> If there is after-crash recovery, then there will be no concurrent access to shared
> buffers and this hash will be up-to-date.
> in case of hot-standby replica we can use some simple invalidation (just one flag
> or counter which indicates that buffer cache was updated).
> This hash also can be constructed on demand when DropRelFileNodeBuffers is
> called first time (so w have to scan all buffers once, but subsequent drop
> operation will be fast.
> 
> i have not thought much about it, but it seems to me that as far as this problem
> only affects recovery, we do not need shared hash for it.
> 

The idea of the patch is to mark the relation buffers to be dropped after scanning
the whole shared buffers, and store them into shared memory maintained in a dlist,
and traverse the dlist on the next scan.
But I understand the point that it is expensive and may cause overhead, that is why
I tried to define a macro to limit the number of pages that we can cache for cases
that lookup cost can be problematic (i.e. too many pages of relation).

#define BUF_ID_ARRAY_SIZE 100
int buf_id_array[BUF_ID_ARRAY_SIZE];
int forknum_indexes[BUF_ID_ARRAY_SIZE];

In DropRelFileNodeBuffers
do
{
    nbufs = CachedBlockLookup(..., forknum_indexes, buf_id_array, lengthof(buf_id_array));
    for (i = 0; i < nbufs; i++)
    {
        ...
    }
} while (nbufs == lengthof(buf_id_array));

Perhaps the patch affects complexities so we want to keep it simpler, or commit piece by piece?
I will look further into your suggestion of maintaining local hash only during recovery.
Thank you for the suggestion.

Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Konstantin Knizhnik

Date:

30 July 2020, 17:37:10

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

I have tested this patch at various workloads and hardware (including Power2 server with 384 virtual cores)
and didn't find performance regression.

The new status of this patch is: Ready for Committer

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

31 July 2020, 05:12:13

On Friday, July 31, 2020 2:37 AM, Konstantin Knizhnik wrote:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, passed
> Spec compliant:           not tested
> Documentation:            not tested
> 
> I have tested this patch at various workloads and hardware (including Power2
> server with 384 virtual cores) and didn't find performance regression.
> 
> The new status of this patch is: Ready for Committer

Thank you very much, Konstantin, for testing the patch for different workloads.
I wonder if I need to modify some documentations.
I'll leave the final review to the committer/s as well.

Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Tom Lane

Date:

31 July 2020, 17:39:37

Robert Haas <robertmhaas@gmail.com> writes:
> Unfortunately, I don't have time for detailed review of this. I am
> suspicious that there are substantial performance regressions that you
> just haven't found yet. I would not take the position that this is a
> completely hopeless approach, or anything like that, but neither would
> I conclude that the tests shown so far are anywhere near enough to be
> confident that there are no problems.

I took a quick look through the v8 patch, since it's marked RFC, and
my feeling is about the same as Robert's: it is just about impossible
to believe that doubling (or more) the amount of hashtable manipulation
involved in allocating a buffer won't hurt common workloads.  The
offered pgbench results don't reassure me; we've so often found that
pgbench fails to expose performance problems, except maybe when it's
used just so.

But aside from that, I noted a number of things I didn't like a bit:

* The amount of new shared memory this needs seems several orders
of magnitude higher than what I'd call acceptable: according to my
measurements it's over 10KB per shared buffer!  Most of that is going
into the CachedBufTableLock data structure, which seems fundamentally
misdesigned --- how could we be needing a lock per map partition *per
buffer*?  For comparison, the space used by buf_table.c is about 56
bytes per shared buffer; I think this needs to stay at least within
hailing distance of there.

* It is fairly suspicious that the new data structure is manipulated
while holding per-partition locks for the existing buffer hashtable.
At best that seems bad for concurrency, and at worst it could result
in deadlocks, because I doubt we can assume that the new hash table
has partition boundaries identical to the old one.

* More generally, it seems like really poor design that this has been
written completely independently of the existing buffer hash table.
Can't we get any benefit by merging them somehow?

* I do not like much of anything in the code details.  "CachedBuf"
is as unhelpful as could be as a data structure identifier --- what
exactly is not "cached" about shared buffers already?  "CombinedLock"
is not too helpful either, nor could I find any documentation explaining
why you need to invent new locking technology in the first place.
At best, CombinedLockAcquireSpinLock seems like a brute-force approach
to an undocumented problem.

* The commentary overall is far too sparse to be of any value ---
basically, any reader will have to reverse-engineer your entire design.
That's not how we do things around here.  There should be either a README,
or a long file header comment, explaining what's going on, how the data
structure is organized, and what the locking requirements are.
See src/backend/storage/buffer/README for the sort of documentation
that I think this needs.

Even if I were convinced that there's no performance gotchas,
I wouldn't commit this in anything like its current form.

Robert again:
> Also, systems with very large shared_buffers settings are becoming
> more common, and probably will continue to become more common, so I
> don't think we can dismiss that as an edge case any more. People don't
> want to run with an 8GB cache on a 1TB server.

I do agree that it'd be great to improve this area.  Just not convinced
that this is how.

            regards, tom lane

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Andres Freund

Date:

31 July 2020, 19:17:55

Hi,

On 2020-07-31 13:39:37 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Unfortunately, I don't have time for detailed review of this. I am
> > suspicious that there are substantial performance regressions that you
> > just haven't found yet. I would not take the position that this is a
> > completely hopeless approach, or anything like that, but neither would
> > I conclude that the tests shown so far are anywhere near enough to be
> > confident that there are no problems.
> 
> I took a quick look through the v8 patch, since it's marked RFC, and
> my feeling is about the same as Robert's: it is just about impossible
> to believe that doubling (or more) the amount of hashtable manipulation
> involved in allocating a buffer won't hurt common workloads.  The
> offered pgbench results don't reassure me; we've so often found that
> pgbench fails to expose performance problems, except maybe when it's
> used just so.

Indeed. The buffer mapping hashtable already is visible as a major
bottleneck in a number of workloads. Even in readonly pgbench if s_b is
large enough (so the hashtable is larger than the cache). Not to speak
of things like a cached sequential scan with a cheap qual and wide rows.

> Robert again:
> > Also, systems with very large shared_buffers settings are becoming
> > more common, and probably will continue to become more common, so I
> > don't think we can dismiss that as an edge case any more. People don't
> > want to run with an 8GB cache on a 1TB server.
> 
> I do agree that it'd be great to improve this area.  Just not convinced
> that this is how.

Wonder if the temporary fix is just to do explicit hashtable probes for
all pages iff the size of the relation is < s_b / 500 or so. That'll
address the case where small tables are frequently dropped - and
dropping large relations is more expensive from the OS and data loading
perspective, so it's not gonna happen as often.

Greetings,

Andres Freund

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Tom Lane

Date:

31 July 2020, 19:50:04

Andres Freund <andres@anarazel.de> writes:
> Indeed. The buffer mapping hashtable already is visible as a major
> bottleneck in a number of workloads. Even in readonly pgbench if s_b is
> large enough (so the hashtable is larger than the cache). Not to speak
> of things like a cached sequential scan with a cheap qual and wide rows.

To be fair, the added overhead is in buffer allocation not buffer lookup.
So it shouldn't add cost to fully-cached cases.  As Tomas noted upthread,
the potential trouble spot is where the working set is bigger than shared
buffers but still fits in RAM (so there's no actual I/O needed, but we do
still have to shuffle buffers a lot).

> Wonder if the temporary fix is just to do explicit hashtable probes for
> all pages iff the size of the relation is < s_b / 500 or so. That'll
> address the case where small tables are frequently dropped - and
> dropping large relations is more expensive from the OS and data loading
> perspective, so it's not gonna happen as often.

Oooh, interesting idea.  We'd need a reliable idea of how long the
relation had been (preferably without adding an lseek call), but maybe
that's do-able.

            regards, tom lane

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Andres Freund

Date:

31 July 2020, 20:23:32

Hi,

On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Indeed. The buffer mapping hashtable already is visible as a major
> > bottleneck in a number of workloads. Even in readonly pgbench if s_b is
> > large enough (so the hashtable is larger than the cache). Not to speak
> > of things like a cached sequential scan with a cheap qual and wide rows.
> 
> To be fair, the added overhead is in buffer allocation not buffer lookup.
> So it shouldn't add cost to fully-cached cases.  As Tomas noted upthread,
> the potential trouble spot is where the working set is bigger than shared
> buffers but still fits in RAM (so there's no actual I/O needed, but we do
> still have to shuffle buffers a lot).

Oh, right, not sure what I was thinking.

> > Wonder if the temporary fix is just to do explicit hashtable probes for
> > all pages iff the size of the relation is < s_b / 500 or so. That'll
> > address the case where small tables are frequently dropped - and
> > dropping large relations is more expensive from the OS and data loading
> > perspective, so it's not gonna happen as often.
> 
> Oooh, interesting idea.  We'd need a reliable idea of how long the
> relation had been (preferably without adding an lseek call), but maybe
> that's do-able.

IIRC we already do smgrnblocks nearby, when doing the truncation (to
figure out which segments we need to remove). Perhaps we can arrange to
combine the two? The layering probably makes that somewhat ugly :(

We could also just use pg_class.relpages. It'll probably mostly be
accurate enough?

Or we could just cache the result of the last smgrnblocks call...

One of the cases where this type of strategy is most intersting to me is
the partial truncations that autovacuum does... There we even know the
range of tables ahead of time.

Greetings,

Andres Freund

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

06 August 2020, 01:23:31

On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:

Hi,
Thank you for your constructive review and comments.
Sorry for the late reply.

> Hi,
>
> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > Indeed. The buffer mapping hashtable already is visible as a major
> > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > > is large enough (so the hashtable is larger than the cache). Not to
> > > speak of things like a cached sequential scan with a cheap qual and wide
> rows.
> >
> > To be fair, the added overhead is in buffer allocation not buffer lookup.
> > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > upthread, the potential trouble spot is where the working set is
> > bigger than shared buffers but still fits in RAM (so there's no actual
> > I/O needed, but we do still have to shuffle buffers a lot).
>
> Oh, right, not sure what I was thinking.
>
>
> > > Wonder if the temporary fix is just to do explicit hashtable probes
> > > for all pages iff the size of the relation is < s_b / 500 or so.
> > > That'll address the case where small tables are frequently dropped -
> > > and dropping large relations is more expensive from the OS and data
> > > loading perspective, so it's not gonna happen as often.
> >
> > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > relation had been (preferably without adding an lseek call), but maybe
> > that's do-able.
>
> IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out
> which segments we need to remove). Perhaps we can arrange to combine the
> two? The layering probably makes that somewhat ugly :(
>
> We could also just use pg_class.relpages. It'll probably mostly be accurate
> enough?
>
> Or we could just cache the result of the last smgrnblocks call...
>
>
> One of the cases where this type of strategy is most intersting to me is the partial
> truncations that autovacuum does... There we even know the range of tables
> ahead of time.

Konstantin tested it on various workloads and saw no regression.
But I understand the sentiment on the added overhead on BufferAlloc.
Regarding the case where the patch would potentially affect workloads that fit into
RAM but not into shared buffers, could one of Andres' suggested idea/s above address
that, in addition to this patch's possible shared invalidation fix? Could that settle
the added overhead in BufferAlloc() as temporary fix?
Thomas Munro is also working on caching relation sizes [1], maybe that way we
could get the latest known relation size. Currently, it's possible only during
recovery in smgrnblocks.

Tom Lane wrote:
> But aside from that, I noted a number of things I didn't like a bit:
>
> * The amount of new shared memory this needs seems several orders of
> magnitude higher than what I'd call acceptable: according to my measurements
> it's over 10KB per shared buffer!  Most of that is going into the
> CachedBufTableLock data structure, which seems fundamentally misdesigned ---
> how could we be needing a lock per map partition *per buffer*?  For comparison,
> the space used by buf_table.c is about 56 bytes per shared buffer; I think this
> needs to stay at least within hailing distance of there.
>
> * It is fairly suspicious that the new data structure is manipulated while holding
> per-partition locks for the existing buffer hashtable.
> At best that seems bad for concurrency, and at worst it could result in deadlocks,
> because I doubt we can assume that the new hash table has partition boundaries
> identical to the old one.
>
> * More generally, it seems like really poor design that this has been written
> completely independently of the existing buffer hash table.
> Can't we get any benefit by merging them somehow?

The original aim is to just shorten the recovery process, and eventually the speedup
on both vacuum and truncate process are just added bonus.
Given that we don't have a shared invalidation mechanism in place yet like radix tree
buffer mapping which is complex, I thought a patch like mine could be an alternative
approach to that. So I want to improve the patch further.
I hope you can help me clarify the direction, so that I can avoid going farther away
from what the community wants.
 1. Both normal operations and recovery process
 2. Improve recovery process only

For 1, the current patch aims to touch on that, but further design improvement is needed.
It would be ideal to modify the BufferDesc, but that cannot be expanded anymore because
it would exceed the CPU cache line size. So I added new data structures (hash table,
dlist, lock) instead of modifying the existing ones.
The new hash table ensures that it's identical to the old one with the use of the same
Relfilenode in the key and a lock when inserting and deleting buffers from buffer table,
as well as during lookups. As for the partition locking, I added it to reduce lock contention.
Tomas Vondra reported regression and mainly its due to buffer mapping locks in V4 and
previous patch versions. So from V5, I used spinlock when inserting/deleting buffers,
to prevent modification when concurrent lookup is happening. LWLock is acquired when
we're doing lookup operation.
If we want this direction, I hope to address Tom's comments in the next patch version.
I admit that this patch needs reworking on shmem resource consumption and clarifying
the design/approach more, i.e. how it affects the existing buffer allocation and
invalidation process, lock mechanism, etc.

If we're going for 2, Konstantin suggested an idea in the previous email:

> I wonder if you have considered case of local hash (maintained only during recovery)?
> If there is after-crash recovery, then there will be no concurrent
> access to shared buffers and this hash will be up-to-date.
> in case of hot-standby replica we can use some simple invalidation (just
> one flag or counter which indicates that buffer cache was updated).
> This hash also can be constructed on demand when DropRelFileNodeBuffers
> is called first time (so w have to scan all buffers once, but subsequent
> drop operation will be fast.

I'm examining this, but I am not sure if I got the correct understanding. Please correct
me if I'm wrong.
I think above is a suggestion wherein the postgres startup process uses local hash table
to keep track of the buffers of relations. Since there may be other read-only sessions which
read from disk, evict cached blocks, and modify the shared_buffers, the flag would be updated.
We could do it during recovery, then release it as recovery completes.

I haven't looked deeply yet into the source code but we maybe can modify the REDO
(main redo do-while loop) in StartupXLOG() once the read-only connections are consistent.
It would also be beneficial to construct this local hash when DropRefFileNodeBuffers()
is called for the first time, so the whole share buffers is scanned initially, then as
you mentioned subsequent dropping will be fast. (similar behavior to what the patch does)

Do you think this is feasible to be implemented? Or should we explore another approach?

I'd really appreciate your ideas, feedback, suggestions, and advice.
Thank you again for the review.

Regards
Kirk Jamison

[1] https://www.postgresql.org/message-id/CA%2BhUKGKEW7-9pq%2Bs2_4Q-Fcpr9cc7_0b3pkno5qzPKC3y2nOPA%40mail.gmail.com

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Tomas Vondra

Date:

06 August 2020, 21:33:34

On Thu, Aug 06, 2020 at 01:23:31AM +0000, k.jamison@fujitsu.com wrote:
>On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:
>
>Hi,
>Thank you for your constructive review and comments.
>Sorry for the late reply.
>
>> Hi,
>>
>> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
>> > Andres Freund <andres@anarazel.de> writes:
>> > > Indeed. The buffer mapping hashtable already is visible as a major
>> > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
>> > > is large enough (so the hashtable is larger than the cache). Not to
>> > > speak of things like a cached sequential scan with a cheap qual and wide
>> rows.
>> >
>> > To be fair, the added overhead is in buffer allocation not buffer lookup.
>> > So it shouldn't add cost to fully-cached cases.  As Tomas noted
>> > upthread, the potential trouble spot is where the working set is
>> > bigger than shared buffers but still fits in RAM (so there's no actual
>> > I/O needed, but we do still have to shuffle buffers a lot).
>>
>> Oh, right, not sure what I was thinking.
>>
>>
>> > > Wonder if the temporary fix is just to do explicit hashtable probes
>> > > for all pages iff the size of the relation is < s_b / 500 or so.
>> > > That'll address the case where small tables are frequently dropped -
>> > > and dropping large relations is more expensive from the OS and data
>> > > loading perspective, so it's not gonna happen as often.
>> >
>> > Oooh, interesting idea.  We'd need a reliable idea of how long the
>> > relation had been (preferably without adding an lseek call), but maybe
>> > that's do-able.
>>
>> IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out
>> which segments we need to remove). Perhaps we can arrange to combine the
>> two? The layering probably makes that somewhat ugly :(
>>
>> We could also just use pg_class.relpages. It'll probably mostly be accurate
>> enough?
>>
>> Or we could just cache the result of the last smgrnblocks call...
>>
>>
>> One of the cases where this type of strategy is most intersting to me is the partial
>> truncations that autovacuum does... There we even know the range of tables
>> ahead of time.
>
>Konstantin tested it on various workloads and saw no regression.

Unfortunately Konstantin did not share any details about what workloads
he tested, what config etc. But I find the "no regression" hypothesis
rather hard to believe, because we're adding non-trivial amount of code
to a place that can be quite hot.

And I can trivially reproduce measurable (and significant) regression
using a very simple pgbench read-only test, with amount of data that
exceeds shared buffers but fits into RAM.

The following numbers are from a x86_64 machine with 16 cores (32 w HT),
64GB of RAM, and 8GB shared buffers, using pgbench scale 1000 (so 16GB,
i.e. twice the SB size).

With simple "pgbench -S" tests (warmup and then 15 x 1-minute runs with
1, 8 and 16 clients - see the attached script for details) I see this:

                1 client    8 clients    16 clients
     ----------------------------------------------
     master        38249       236336        368591
     patched       35853       217259        349248
                     -6%          -8%           -5%

This is average of the runs, but the conclusions for medians are almost
exactly te same.

>But I understand the sentiment on the added overhead on BufferAlloc.
>Regarding the case where the patch would potentially affect workloads
>that fit into RAM but not into shared buffers, could one of Andres'
>suggested idea/s above address that, in addition to this patch's
>possible shared invalidation fix? Could that settle the added overhead
>in BufferAlloc() as temporary fix?

Not sure.

>Thomas Munro is also working on caching relation sizes [1], maybe that
>way we could get the latest known relation size. Currently, it's
>possible only during recovery in smgrnblocks.

It's not clear to me how would knowing the relation size help reducing
the overhead of this patch?

Can't we somehow identify cases when this optimization might help and
only actually enable it in those cases? Like in a recovery, with a lot
of truncates, or something like that.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote: 
> On Fri, Aug 7, 2020 at 9:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > On Sat, Aug 1, 2020 at 1:53 AM Andres Freund <andres@anarazel.de>
> wrote:
> > >> We could also just use pg_class.relpages. It'll probably mostly be
> > >> accurate enough?
> >
> > > Don't we need the accurate 'number of blocks' if we want to
> > > invalidate all the buffers? Basically, I think we need to perform
> > > BufTableLookup for all the blocks in the relation and then Invalidate all
> buffers.
> >
> > Yeah, there is no room for "good enough" here.  If a dirty buffer
> > remains in the system, the checkpointer will eventually try to flush
> > it, and fail (because there's no file to write it to), and then
> > checkpointing will be stuck.  So we cannot afford to risk missing any
> buffers.
> >
> 
> Today, again thinking about this point it occurred to me that during recovery
> we can reliably find the relation size and after Thomas's recent commit
> c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> even incur the cost of lseek. Why don't we fix this first for 'recovery' (by
> following something on the lines of what Andres suggested) and then later
> once we have a generic mechanism for "caching the relation size" [1], we can
> do it for non-recovery paths.
> I think that will at least address the reported use case with some minimal
> changes.
> 
> [1] -
> https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
> 

Attached is an updated V9 version with minimal code changes only and
avoids the previous overhead in the BufferAlloc. This time, I only updated
the recovery path as suggested by Amit, and followed Andres' suggestion
of referring to the cached blocks in smgrnblocks.
The layering is kinda tricky so the logic may be wrong. But as of now,
it passes the regression tests. I'll follow up with the performance results.
It seems there's regression for smaller shared_buffers. I'll update if I find bugs.
But I'd also appreciate your reviews in case I missed something.

Regards,
Kirk Jamison

Attachment

v9-Speedup-dropping-of-relation-buffers-during-recovery.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Kyotaro Horiguchi

Date:

02 September 2020, 01:31:22

Hello.

At Tue, 1 Sep 2020 13:02:28 +0000, "k.jamison@fujitsu.com" <k.jamison@fujitsu.com> wrote in 
> On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote: 
> > Today, again thinking about this point it occurred to me that during recovery
> > we can reliably find the relation size and after Thomas's recent commit
> > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> > even incur the cost of lseek. Why don't we fix this first for 'recovery' (by
> > following something on the lines of what Andres suggested) and then later
> > once we have a generic mechanism for "caching the relation size" [1], we can
> > do it for non-recovery paths.
> > I think that will at least address the reported use case with some minimal
> > changes.
> > 
> > [1] -
> > https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com

Isn't a relation always locked asscess-exclusively, at truncation
time?  If so, isn't even the result of lseek reliable enough? And if
we don't care the cost of lseek, we can do the same optimization also
for non-recovery paths. Since anyway we perform actual file-truncation
just after so I think the cost of lseek is negligible here.

> Attached is an updated V9 version with minimal code changes only and
> avoids the previous overhead in the BufferAlloc. This time, I only updated
> the recovery path as suggested by Amit, and followed Andres' suggestion
> of referring to the cached blocks in smgrnblocks.
> The layering is kinda tricky so the logic may be wrong. But as of now,
> it passes the regression tests. I'll follow up with the performance results.
> It seems there's regression for smaller shared_buffers. I'll update if I find bugs.
> But I'd also appreciate your reviews in case I missed something.

BUF_DROP_THRESHOLD seems to be misued. IIUC it defines the maximum
number of file pages that we make relation-targetted search for
buffers. Otherwise we scan through all buffers. On the other hand the
latest patch just leaves all buffers for relation forks longer than
the threshold.

I think we should determine whether to do targetted-scan or full-scan
based on the ratio of (expectedly maximum) total number of pages for
all (specified) forks in a relation against total number of buffers.

By the way

> #define BUF_DROP_THRESHOLD        500    /* NBuffers divided by 2 */

NBuffers is not a constant. Even if we wanted to set the macro as
described in the comment, we should have used (NBuffers/2) instead of
"500". But I suppose you might wanted to use (NBuffders / 500) as Tom
suggested upthread.  And the name of the macro seems too generic. I
think more specific names like BUF_DROP_FULLSCAN_THRESHOLD would be
better.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Kyotaro Horiguchi

Date:

02 September 2020, 01:36:13

I'd like make a subtle correction.

At Wed, 02 Sep 2020 10:31:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> By the way
> 
> > #define BUF_DROP_THRESHOLD        500    /* NBuffers divided by 2 */
> 
> NBuffers is not a constant. Even if we wanted to set the macro as
> described in the comment, we should have used (NBuffers/2) instead of
> "500". But I suppose you might wanted to use (NBuffders / 500) as Tom
> suggested upthread.  And the name of the macro seems too generic. I

Who made the suggestion is Andres, not Tom. Sorry for the mistake.

> think more specific names like BUF_DROP_FULLSCAN_THRESHOLD would be
> better.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

02 September 2020, 02:48:06

On Wed, Sep 2, 2020 at 7:01 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Hello.
>
> At Tue, 1 Sep 2020 13:02:28 +0000, "k.jamison@fujitsu.com" <k.jamison@fujitsu.com> wrote in
> > On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote:
> > > Today, again thinking about this point it occurred to me that during recovery
> > > we can reliably find the relation size and after Thomas's recent commit
> > > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not need to
> > > even incur the cost of lseek. Why don't we fix this first for 'recovery' (by
> > > following something on the lines of what Andres suggested) and then later
> > > once we have a generic mechanism for "caching the relation size" [1], we can
> > > do it for non-recovery paths.
> > > I think that will at least address the reported use case with some minimal
> > > changes.
> > >
> > > [1] -
> > > https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
>
> Isn't a relation always locked asscess-exclusively, at truncation
> time?  If so, isn't even the result of lseek reliable enough?
>

Even if the relation is locked, background processes like checkpointer
can still touch the relation which might cause problems. Consider a
case where we extend the relation but didn't flush the newly added
pages. Now during truncate operation, checkpointer can still flush
those pages which can cause trouble for truncate. But, I think in the
recovery path such cases won't cause a problem.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Tom Lane

Date:

02 September 2020, 03:47:23

Amit Kapila <amit.kapila16@gmail.com> writes:
> Even if the relation is locked, background processes like checkpointer
> can still touch the relation which might cause problems. Consider a
> case where we extend the relation but didn't flush the newly added
> pages. Now during truncate operation, checkpointer can still flush
> those pages which can cause trouble for truncate. But, I think in the
> recovery path such cases won't cause a problem.

I wouldn't count on that staying true ...

https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q@mail.gmail.com

            regards, tom lane

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

02 September 2020, 03:48:55

On Wednesday, September 2, 2020 10:31 AM, Kyotaro Horiguchi wrote:
> Hello.
>
> At Tue, 1 Sep 2020 13:02:28 +0000, "k.jamison@fujitsu.com"
> <k.jamison@fujitsu.com> wrote in
> > On Tuesday, August 18, 2020 3:05 PM (GMT+9), Amit Kapila wrote:
> > > Today, again thinking about this point it occurred to me that during
> > > recovery we can reliably find the relation size and after Thomas's
> > > recent commit
> > > c5315f4f44 (Cache smgrnblocks() results in recovery), we might not
> > > need to even incur the cost of lseek. Why don't we fix this first
> > > for 'recovery' (by following something on the lines of what Andres
> > > suggested) and then later once we have a generic mechanism for
> > > "caching the relation size" [1], we can do it for non-recovery paths.
> > > I think that will at least address the reported use case with some
> > > minimal changes.
> > >
> > > [1] -
> > >
> https://www.postgresql.org/message-id/CAEepm%3D3SSw-Ty1DFcK%3D1r
> > > U-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
>
> Isn't a relation always locked asscess-exclusively, at truncation time?  If so,
> isn't even the result of lseek reliable enough? And if we don't care the cost of
> lseek, we can do the same optimization also for non-recovery paths. Since
> anyway we perform actual file-truncation just after so I think the cost of lseek
> is negligible here.

The reason for that is when I read the comment in smgrnblocks in smgr.c
I thought that smgrnblocks can only be reliably used during recovery here
to ensure that we have the correct size.
Please correct me if my understanding is wrong, and I'll fix the patch.

     * For now, we only use cached values in recovery due to lack of a shared
     * invalidation mechanism for changes in file size.
     */
    if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
        return reln->smgr_cached_nblocks[forknum];

> > Attached is an updated V9 version with minimal code changes only and
> > avoids the previous overhead in the BufferAlloc. This time, I only
> > updated the recovery path as suggested by Amit, and followed Andres'
> > suggestion of referring to the cached blocks in smgrnblocks.
> > The layering is kinda tricky so the logic may be wrong. But as of now,
> > it passes the regression tests. I'll follow up with the performance results.
> > It seems there's regression for smaller shared_buffers. I'll update if I find
> bugs.
> > But I'd also appreciate your reviews in case I missed something.
>
> BUF_DROP_THRESHOLD seems to be misued. IIUC it defines the maximum
> number of file pages that we make relation-targetted search for buffers.
> Otherwise we scan through all buffers. On the other hand the latest patch just
> leaves all buffers for relation forks longer than the threshold.

Right, I missed the part or condition for that part. Fixed in the latest one.

> I think we should determine whether to do targetted-scan or full-scan based
> on the ratio of (expectedly maximum) total number of pages for all (specified)
> forks in a relation against total number of buffers.

> By the way
>
> > #define BUF_DROP_THRESHOLD        500    /* NBuffers divided
> by 2 */
>
> NBuffers is not a constant. Even if we wanted to set the macro as described
> in the comment, we should have used (NBuffers/2) instead of "500". But I
> suppose you might wanted to use (NBuffders / 500) as Tom suggested
> upthread.  And the name of the macro seems too generic. I think more
> specific names like BUF_DROP_FULLSCAN_THRESHOLD would be better.

Fixed.

Thank you for the review!
Attached is the v10 of the patch.

Best regards,
Kirk Jamison

Attachment

v10-Speedup-dropping-of-relation-buffers-during-recovery.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

02 September 2020, 08:49:00

On Wed, Sep 2, 2020 at 9:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > Even if the relation is locked, background processes like checkpointer
> > can still touch the relation which might cause problems. Consider a
> > case where we extend the relation but didn't flush the newly added
> > pages. Now during truncate operation, checkpointer can still flush
> > those pages which can cause trouble for truncate. But, I think in the
> > recovery path such cases won't cause a problem.
>
> I wouldn't count on that staying true ...
>
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFROBV-jCnacbYvtpptK2A9YYp9Q@mail.gmail.com
>

I don't think that proposal will matter after commit c5315f4f44
because we are caching the size/blocks for recovery while doing extend
(smgrextend). In the above scenario, we would have cached the blocks
which will be used at later point of time.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

07 September 2020, 08:03:05

On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> On Wed, Sep 2, 2020 at 9:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > Even if the relation is locked, background processes like
> > > checkpointer can still touch the relation which might cause
> > > problems. Consider a case where we extend the relation but didn't
> > > flush the newly added pages. Now during truncate operation,
> > > checkpointer can still flush those pages which can cause trouble for
> > > truncate. But, I think in the recovery path such cases won't cause a
> problem.
> >
> > I wouldn't count on that staying true ...
> >
> >
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> OBV-jC
> > nacbYvtpptK2A9YYp9Q@mail.gmail.com
> >
> 
> I don't think that proposal will matter after commit c5315f4f44 because we are
> caching the size/blocks for recovery while doing extend (smgrextend). In the
> above scenario, we would have cached the blocks which will be used at later
> point of time.
> 

Hi,

I'm guessing we can still pursue this idea of improving the recovery path first.

I'm working on an updated patch version, because the CFBot's telling
that postgres fails to build (one of the recovery TAP tests fails).
I'm still working on refactoring my patch, but have yet to find a proper solution at the moment.
So I'm going to continue my investigation.

Attached is an updated WIP patch.
I'd appreciate if you could take a look at the patch as well.

In addition, attached also are the regression logs for the failure and other logs
Accompanying it: wal_optimize_node_minimal and wal_optimize_node_replica.

The failures stated in my session was:
t/018_wal_optimize.pl ................ 18/34 Bailout called.
Further testing stopped:  pg_ctl start failed
FAILED--Further testing stopped: pg_ctl start failed

Best regards,
Kirk Jamison

On Tuesday, September 8, 2020 1:02 PM, Amit Kapila wrote:
Hello,
> On Mon, Sep 7, 2020 at 1:33 PM k.jamison@fujitsu.com
> <k.jamison@fujitsu.com> wrote:
> >
> > On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> > > On Wed, Sep 2, 2020 at 9:17 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > >
> > > > Amit Kapila <amit.kapila16@gmail.com> writes:
> > > > > Even if the relation is locked, background processes like
> > > > > checkpointer can still touch the relation which might cause
> > > > > problems. Consider a case where we extend the relation but
> > > > > didn't flush the newly added pages. Now during truncate
> > > > > operation, checkpointer can still flush those pages which can
> > > > > cause trouble for truncate. But, I think in the recovery path
> > > > > such cases won't cause a
> > > problem.
> > > >
> > > > I wouldn't count on that staying true ...
> > > >
> > > >
> > >
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> > > OBV-jC
> > > > nacbYvtpptK2A9YYp9Q@mail.gmail.com
> > > >
> > >
> > > I don't think that proposal will matter after commit c5315f4f44
> > > because we are caching the size/blocks for recovery while doing
> > > extend (smgrextend). In the above scenario, we would have cached the
> > > blocks which will be used at later point of time.
> > >
> >
> > I'm guessing we can still pursue this idea of improving the recovery path
> first.
> >
> 
> I think so.

Alright, so I've updated the patch which passes the regression and TAP tests.
It compiles and builds as intended.

> > I'm working on an updated patch version, because the CFBot's telling
> > that postgres fails to build (one of the recovery TAP tests fails).
> > I'm still working on refactoring my patch, but have yet to find a proper
> solution at the moment.
> > So I'm going to continue my investigation.
> >
> > Attached is an updated WIP patch.
> > I'd appreciate if you could take a look at the patch as well.
> >
> 
> So, I see the below log as one of the problems:
> 2020-09-07 06:20:33.918 UTC [10914] LOG:  redo starts at 0/15FFEC0
> 2020-09-07 06:20:33.919 UTC [10914] FATAL:  unexpected data beyond EOF
> in block 1 of relation base/13743/24581
> 
> This indicates that we missed invalidating some buffer which should have
> been invalidated. If you are able to reproduce this locally then I suggest to first
> write a simple patch without the check of the threshold, basically in recovery
> always try to use the new way to invalidate the buffer. That will reduce the
> scope of the code that can create a problem. Let us know if the problem still
> exists and share the logs. BTW, I think I see one problem in the code:
> 
> if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >= 
> + bufHdr->firstDelBlock[j])
> 
> Here, I think you need to use 'i' not 'j' for forkNum and 
> firstDelBlock as those are arrays w.r.t forks. That might fix the 
> problem but I am not sure as I haven't tried to reproduce it.

Thanks for advice. Right, that seems to be the cause of error,
and fixing that (using fork) solved the case.
I also followed the advice of Tsunakawa-san of using more meaningful iterator
Instead of using "i" & "j" for readability.

I also added a new function when relation fork is bigger than the threshold
    If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
(DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that function.
However, as expected in the previous discussions, this is a bit slower than the
standard buffer invalidation process, because the whole shared buffers are scanned nfork times.
Currently, I set the threshold to (NBuffers / 500)

Feedback on the patch/testing are very much welcome.

Best regards,
Kirk Jamison

Attachment

v12-Speedup-dropping-of-relation-buffers-during-recovery.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

15 September 2020, 11:11:26

Hi, 

> BTW, I think I see one problem in the code:
> >
> > if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> > + bufHdr->firstDelBlock[j])
> >
> > Here, I think you need to use 'i' not 'j' for forkNum and
> > firstDelBlock as those are arrays w.r.t forks. That might fix the
> > problem but I am not sure as I haven't tried to reproduce it.
> 
> Thanks for advice. Right, that seems to be the cause of error, and fixing that
> (using fork) solved the case.
> I also followed the advice of Tsunakawa-san of using more meaningful
> iterator Instead of using "i" & "j" for readability.
> 
> I also added a new function when relation fork is bigger than the threshold
>     If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
> (DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that
> function.
> However, as expected in the previous discussions, this is a bit slower than the
> standard buffer invalidation process, because the whole shared buffers are
> scanned nfork times.
> Currently, I set the threshold to (NBuffers / 500)

I made a mistake in the v12. I replaced the firstDelBlock[fork_num] with firstDelBlock[block_num],
In the for-loop code block of block_num, because we want to process the current block of per-block loop

OTOH, I used the firstDelBlock[fork_num] when relation fork is bigger than the threshold,
or if the cached blocks of small relations were already invalidated.

The logic could be either correct or wrong, so I'd appreciate feedback and comments/advice.

Regards,
Kirk Jamison

Attachment

v13-Speedup-dropping-of-relation-buffers-during-recovery.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Kyotaro Horiguchi

Date:

16 September 2020, 02:16:35

At Wed, 2 Sep 2020 08:18:06 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Wed, Sep 2, 2020 at 7:01 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > Isn't a relation always locked asscess-exclusively, at truncation
> > time?  If so, isn't even the result of lseek reliable enough?
> >
> 
> Even if the relation is locked, background processes like checkpointer
> can still touch the relation which might cause problems. Consider a
> case where we extend the relation but didn't flush the newly added
> pages. Now during truncate operation, checkpointer can still flush
> those pages which can cause trouble for truncate. But, I think in the
> recovery path such cases won't cause a problem.

I reconsided on this and still have a doubt.

Is this means lseek(SEEK_END) doesn't count blocks that are
write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
for clarity.) The nblocks cache is added just to reduce the number of
lseek()s and expected to always have the same value with what lseek()
is expected to return. The reason it is reliable only during recovery
is that the cache is not shared but the startup process is the only
process that changes the relation size during recovery.

If any other process can extend the relation while smgrtruncate is
running, the current DropRelFileNodeBuffers should have the chance
that a new buffer for extended area is allocated at a buffer location
where the function already have passed by, which is a disaster.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"tsunakawa.takay@fujitsu.com"

Date:

16 September 2020, 02:38:17

The code doesn't seem to be working correctly.


(1)
+                for (block_num = 0; block_num <= nblocks; block_num++)

should be

+                for (block_num = firstDelBlock[fork_num]; block_num < nblocks; block_num++)

because:

* You only want to invalidate blocks >= firstDelBlock[fork_num], don't you?
* The relation's block number ranges from 0 to nblocks - 1.


(2)
+                    INIT_BUFFERTAG(newTag, rnode.node, forkNum[fork_num],
+                                   firstDelBlock[block_num]);

Replace firstDelBlock[fork_num] with block_num, because you want to process the current block of per-block loop.  Your
codeaccesses memory out of the bounds of the array, and doesn't invalidate any buffer.
 


(3)
+                    if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
+                        bufHdr->tag.forkNum == forkNum[fork_num] &&
+                        bufHdr->tag.blockNum >= firstDelBlock[block_num])
+                        InvalidateBuffer(bufHdr);    /* releases spinlock */
+                    else
+                        UnlockBufHdr(bufHdr, buf_state);

Replace
bufHdr->tag.blockNum >= firstDelBlock[fork_num]
with
bufHdr->tag.blockNum == block_num
because you want to check if the found buffer is for the current block of the loop.


(4)
+                /*
+                 * We've invalidated the nblocks already. Scan the shared buffers
+                 * for each fork.
+                 */
+                if (block_num > nblocks)
+                {
+                    DropRelFileNodeBuffersOfFork(rnode.node, forkNum[fork_num],
+                                                 firstDelBlock[fork_num]);
+                }

This part is unnecessary.  This invalidates all buffers that (2) failed to process, so the regression test succeeds.


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Kyotaro Horiguchi

Date:

16 September 2020, 02:56:29

Thanks for the new version. Jamison.

At Tue, 15 Sep 2020 11:11:26 +0000, "k.jamison@fujitsu.com" <k.jamison@fujitsu.com> wrote in 
> Hi, 
> 
> > BTW, I think I see one problem in the code:
> > >
> > > if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> > > + bufHdr->tag.forkNum == forkNum[j] && tag.blockNum >=
> > > + bufHdr->firstDelBlock[j])
> > >
> > > Here, I think you need to use 'i' not 'j' for forkNum and
> > > firstDelBlock as those are arrays w.r.t forks. That might fix the
> > > problem but I am not sure as I haven't tried to reproduce it.
> > 
> > Thanks for advice. Right, that seems to be the cause of error, and fixing that
> > (using fork) solved the case.
> > I also followed the advice of Tsunakawa-san of using more meaningful
> > iterator Instead of using "i" & "j" for readability.

(FWIW, I prefer short conventional names for short-term iterator variables.)


master> * XXX currently it sequentially searches the buffer pool, should be
master> * changed to more clever ways of searching.  However, this routine
master> * is used only in code paths that aren't very performance-critical,
master> * and we shouldn't slow down the hot paths to make it faster ...

This comment needs a rewrite.


+        for (fork_num = 0; fork_num < nforks; fork_num++)
         {
             if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
-                bufHdr->tag.forkNum == forkNum[j] &&
-                bufHdr->tag.blockNum >= firstDelBlock[j])
+                bufHdr->tag.forkNum == forkNum[fork_num] &&
+                bufHdr->tag.blockNum >= firstDelBlock[fork_num])

fork_num is not actually a fork number, but the index of forkNum[].
It should be fork_idx (or just i, which I prefer..).

-            for (j = 0; j < nforks; j++)
-                DropRelFileNodeLocalBuffers(rnode.node, forkNum[j],
-                                            firstDelBlock[j]);
+            for (fork_num = 0; fork_num < nforks; fork_num++)
+                DropRelFileNodeLocalBuffers(rnode.node, forkNum[fork_num],
+                                            firstDelBlock[fork_num]);

I think we don't need to include the irrelevant refactoring in this
patch. (And I think j is better there.)

+     * We only speedup this path during recovery, because that's the only
+     * timing when we can get a valid cached value of blocks for relation.
+     * See comment in smgrnblocks() in smgr.c. Otherwise, proceed to usual
+     * buffer invalidation process (scanning of whole shared buffers).

We need an explanation of why we do this optimizaton only for the
recovery case.

+            /* Get the number of blocks for the supplied relation's fork */
+            nblocks = smgrnblocks(smgr_reln, forkNum[fork_num]);
+            Assert(BlockNumberIsValid(nblocks));
+
+            if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)

As mentioned upthread, the criteria whether we do full-scan or
lookup-drop is how large portion of NBUFFERS this relation-drop can be
going to invalidate.  So the nblocks above sould be the sum of number
of blocks to be truncated (not just the total number of blocks) of all
designated forks.  Then once we decided to do loopup-drop method, we
do that for all forks.

+                for (block_num = 0; block_num <= nblocks; block_num++)
+                {

block_num is quite confusing with nblocks, at least for
me(:p). Likewise fork_num, I prefer that it is just j or iblk or
something else anyway not confusing with nblocks.  By the way, the
loop runs nblocks + 1 times, which seems wrong.  We can start the loop
from firstDelBlock[fork_num], instead of 0 and that makes the check
against firstDelBlock[] later useless.

+                    /* create a tag with respect to the block so we can lookup the buffer */
+                    INIT_BUFFERTAG(newTag, rnode.node, forkNum[fork_num],
+                                   firstDelBlock[block_num]);

Mmm. it is wrong that the tag is initialized using
firstDelBlock[block_num]. Why isn't is just block_num?


+                    if (buf_id < 0)
+                    {
+                        LWLockRelease(newPartitionLock);
+                        continue;
+                    }
+                    LWLockRelease(newPartitionLock);

We don't need two separate LWLockRelease()'s there.

+   /*
+    * We can make this a tad faster by prechecking the buffer tag before
+    * we attempt to lock the buffer; this saves a lot of lock
...
+    */
+   if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode.node))
+       continue;

In the original code, this is performed in order to avoid taking a
lock on bufHder for irrelevant buffers. We have identified the buffer
by looking up using the rnode, so I think we don't need to this
check. Note that we are doing the same check after lock aquisition.

+        else
+            UnlockBufHdr(bufHdr, buf_state);
+    }
+    /*
+     * We've invalidated the nblocks already. Scan the shared buffers
+     * for each fork.
+     */
+    if (block_num > nblocks)
+    {
+        DropRelFileNodeBuffersOfFork(rnode.node, forkNum[fork_num],
+                                     firstDelBlock[fork_num]);
+    }

Mmm? block_num is always larger than nblocks there. And the function
call runs a whole Nbuffers scan for the just-processed fork. What is
the point of this code?


> > I also added a new function when relation fork is bigger than the threshold
> >     If (nblocks > BUF_DROP_FULLSCAN_THRESHOLD)
> > (DropRelFileNodeBuffersOfFork) Perhaps there's a better name for that
> > function.
> > However, as expected in the previous discussions, this is a bit slower than the
> > standard buffer invalidation process, because the whole shared buffers are
> > scanned nfork times.
> > Currently, I set the threshold to (NBuffers / 500)
> 
> I made a mistake in the v12. I replaced the firstDelBlock[fork_num] with firstDelBlock[block_num],
> In the for-loop code block of block_num, because we want to process the current block of per-block loop
> OTOH, I used the firstDelBlock[fork_num] when relation fork is bigger than the threshold,
> or if the cached blocks of small relations were already invalidated.

Really? I believe that firstDelBlock is an array has only nforks elements.

> The logic could be either correct or wrong, so I'd appreciate feedback and comments/advice.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Kyotaro Horiguchi

Date:

16 September 2020, 03:00:14

At Wed, 16 Sep 2020 11:56:29 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
(Oops! Some of my comments duplicate with Tsunakawa-san, sorry.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

16 September 2020, 03:03:06

On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 2 Sep 2020 08:18:06 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > On Wed, Sep 2, 2020 at 7:01 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > Isn't a relation always locked asscess-exclusively, at truncation
> > > time?  If so, isn't even the result of lseek reliable enough?
> > >
> >
> > Even if the relation is locked, background processes like checkpointer
> > can still touch the relation which might cause problems. Consider a
> > case where we extend the relation but didn't flush the newly added
> > pages. Now during truncate operation, checkpointer can still flush
> > those pages which can cause trouble for truncate. But, I think in the
> > recovery path such cases won't cause a problem.
>
> I reconsided on this and still have a doubt.
>
> Is this means lseek(SEEK_END) doesn't count blocks that are
> write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
> for clarity.) The nblocks cache is added just to reduce the number of
> lseek()s and expected to always have the same value with what lseek()
> is expected to return.
>

See comments in ReadBuffer_common() which indicates such a possibility
("Unfortunately, we have also seen this case occurring because of
buggy Linux kernels that sometimes return an lseek(SEEK_END) result
that doesn't account for a recent write."). Also, refer my previous
email [1] on this and another email link in that email which has a
discussion on this point.

> The reason it is reliable only during recovery
> is that the cache is not shared but the startup process is the only
> process that changes the relation size during recovery.
>

Yes, that is why we are planning to do this optimization for recovery path.

> If any other process can extend the relation while smgrtruncate is
> running, the current DropRelFileNodeBuffers should have the chance
> that a new buffer for extended area is allocated at a buffer location
> where the function already have passed by, which is a disaster.
>

The relation might have extended before smgrtruncate but the newly
added pages can be flushed by checkpointer during smgrtruncate.

[1] - https://www.postgresql.org/message-id/CAA4eK1LH2uQWznwtonD%2Bnch76kqzemdTQAnfB06z_LXa6NTFtQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Kyotaro Horiguchi

Date:

16 September 2020, 03:32:22

At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > Is this means lseek(SEEK_END) doesn't count blocks that are
> > write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
> > for clarity.) The nblocks cache is added just to reduce the number of
> > lseek()s and expected to always have the same value with what lseek()
> > is expected to return.
> >
> 
> See comments in ReadBuffer_common() which indicates such a possibility
> ("Unfortunately, we have also seen this case occurring because of
> buggy Linux kernels that sometimes return an lseek(SEEK_END) result
> that doesn't account for a recent write."). Also, refer my previous
> email [1] on this and another email link in that email which has a
> discussion on this point.
>
> > The reason it is reliable only during recovery
> > is that the cache is not shared but the startup process is the only
> > process that changes the relation size during recovery.
> >
> 
> Yes, that is why we are planning to do this optimization for recovery path.
> 
> > If any other process can extend the relation while smgrtruncate is
> > running, the current DropRelFileNodeBuffers should have the chance
> > that a new buffer for extended area is allocated at a buffer location
> > where the function already have passed by, which is a disaster.
> >
> 
> The relation might have extended before smgrtruncate but the newly
> added pages can be flushed by checkpointer during smgrtruncate.
> 
> [1] - https://www.postgresql.org/message-id/CAA4eK1LH2uQWznwtonD%2Bnch76kqzemdTQAnfB06z_LXa6NTFtQ%40mail.gmail.com

Ah! I understood that! The reason we can rely on the cahce is that the
cached value is *not* what lseek returned but how far we intended to
extend. Thank you for the explanation.

By the way I'm not sure that actually happens, but if one smgrextend
call exnteded the relation by two or more blocks, the cache is
invalidated and succeeding smgrnblocks returns lseek()'s result. Don't
we need to guarantee the cache to be valid while recovery?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

16 September 2020, 04:35:32

On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > Is this means lseek(SEEK_END) doesn't count blocks that are
> > > write(2)'ed (by smgrextend) but not yet flushed? (I don't think so,
> > > for clarity.) The nblocks cache is added just to reduce the number of
> > > lseek()s and expected to always have the same value with what lseek()
> > > is expected to return.
> > >
> >
> > See comments in ReadBuffer_common() which indicates such a possibility
> > ("Unfortunately, we have also seen this case occurring because of
> > buggy Linux kernels that sometimes return an lseek(SEEK_END) result
> > that doesn't account for a recent write."). Also, refer my previous
> > email [1] on this and another email link in that email which has a
> > discussion on this point.
> >
> > > The reason it is reliable only during recovery
> > > is that the cache is not shared but the startup process is the only
> > > process that changes the relation size during recovery.
> > >
> >
> > Yes, that is why we are planning to do this optimization for recovery path.
> >
> > > If any other process can extend the relation while smgrtruncate is
> > > running, the current DropRelFileNodeBuffers should have the chance
> > > that a new buffer for extended area is allocated at a buffer location
> > > where the function already have passed by, which is a disaster.
> > >
> >
> > The relation might have extended before smgrtruncate but the newly
> > added pages can be flushed by checkpointer during smgrtruncate.
> >
> > [1] - https://www.postgresql.org/message-id/CAA4eK1LH2uQWznwtonD%2Bnch76kqzemdTQAnfB06z_LXa6NTFtQ%40mail.gmail.com
>
> Ah! I understood that! The reason we can rely on the cahce is that the
> cached value is *not* what lseek returned but how far we intended to
> extend. Thank you for the explanation.
>
> By the way I'm not sure that actually happens, but if one smgrextend
> call exnteded the relation by two or more blocks, the cache is
> invalidated and succeeding smgrnblocks returns lseek()'s result.
>

Can you think of any such case? I think in recovery we use
XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the page
which seems to be extending page-by-page but there could be some case
where that is not true. One idea is to run regressions and add an
Assert to see if we are extending more than a block during recovery.

> Don't
> we need to guarantee the cache to be valid while recovery?
>

One possibility could be that we somehow detect that the value we are
using is cached one and if so then only do this optimization.


-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Kyotaro Horiguchi

Date:

16 September 2020, 08:32:15

At Wed, 16 Sep 2020 10:05:32 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > By the way I'm not sure that actually happens, but if one smgrextend
> > call exnteded the relation by two or more blocks, the cache is
> > invalidated and succeeding smgrnblocks returns lseek()'s result.
> >
> 
> Can you think of any such case? I think in recovery we use
> XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the page
> which seems to be extending page-by-page but there could be some case
> where that is not true. One idea is to run regressions and add an
> Assert to see if we are extending more than a block during recovery.

I agree with you. Actually XLogReadBufferExtended is the only point to
read a page while recovery and seems calling ReadBufferWithoutRelcache
page by page up to the target page. The only case I found where the
cache is invalidated is ALTER TABLE SET TABLESPACE while
wal_level=minimal and not during recovery. smgrextend is called
without smgrnblocks called at the time.

Considering that the behavior of lseek can be a problem only just after
extending a file, an assertion in smgrextend seems to be
enough. Although, I'm not confident on the diagnosis.

--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -474,7 +474,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
     if (reln->smgr_cached_nblocks[forknum] == blocknum)
         reln->smgr_cached_nblocks[forknum] = blocknum + 1;
     else
+    {
+        /*
+         * DropRelFileNodeBuffers relies on the behavior that nblocks cache
+         * won't be invalidated by file extension while recoverying.
+         */
+        Assert(!InRecovery);
         reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+    }
 }

> > Don't
> > we need to guarantee the cache to be valid while recovery?
> >
> 
> One possibility could be that we somehow detect that the value we are
> using is cached one and if so then only do this optimization.

I basically like this direction.  But I'm not sure the additional
parameter for smgrnblocks is acceptable.

But on the contrary, it might be a better design that
DropRelFileNodeBuffers gives up the optimization when
smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.


@@ -544,9 +551,12 @@ smgrwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 /*
  *    smgrnblocks() -- Calculate the number of blocks in the
  *                     supplied relation.
+ *
+ *    Returns InvalidBlockNumber if must_accurate is true and smgr_cached_nblocks
+ *    is not available.
  */
 BlockNumber
-smgrnblocks(SMgrRelation reln, ForkNumber forknum)
+smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool must_accurate)
 {
     BlockNumber result;
 
@@ -561,6 +571,17 @@ smgrnblocks(SMgrRelation reln, ForkNumber forknum)
 
     reln->smgr_cached_nblocks[forknum] = result;
 
+    /*
+     * We cannot believe the result from smgr_nblocks is always accurate
+     * because lseek of buggy Linux kernels doesn't account for a recent
+     * write. However, we can rely on the result from lseek while recovering
+     * because the first call to this function is not happen just after a file
+     * extension. Return values on subsequent calls return cached nblocks,
+     * which should be accurate during recovery.
+     */
+    if (!InRecovery && must_accurate)
+        return InvalidBlockNumber;
+
     return result;
 }


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

16 September 2020, 13:07:33

On Wed, Sep 16, 2020 at 2:02 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 16 Sep 2020 10:05:32 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > >
> > > At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > > > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> > > > <horikyota.ntt@gmail.com> wrote:
> > > By the way I'm not sure that actually happens, but if one smgrextend
> > > call exnteded the relation by two or more blocks, the cache is
> > > invalidated and succeeding smgrnblocks returns lseek()'s result.
> > >
> >
> > Can you think of any such case? I think in recovery we use
> > XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the page
> > which seems to be extending page-by-page but there could be some case
> > where that is not true. One idea is to run regressions and add an
> > Assert to see if we are extending more than a block during recovery.
>
> I agree with you. Actually XLogReadBufferExtended is the only point to
> read a page while recovery and seems calling ReadBufferWithoutRelcache
> page by page up to the target page. The only case I found where the
> cache is invalidated is ALTER TABLE SET TABLESPACE while
> wal_level=minimal and not during recovery. smgrextend is called
> without smgrnblocks called at the time.
>
> Considering that the behavior of lseek can be a problem only just after
> extending a file, an assertion in smgrextend seems to be
> enough. Although, I'm not confident on the diagnosis.
>
> --- a/src/backend/storage/smgr/smgr.c
> +++ b/src/backend/storage/smgr/smgr.c
> @@ -474,7 +474,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
>         if (reln->smgr_cached_nblocks[forknum] == blocknum)
>                 reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>         else
> +       {
> +               /*
> +                * DropRelFileNodeBuffers relies on the behavior that nblocks cache
> +                * won't be invalidated by file extension while recoverying.
> +                */
> +               Assert(!InRecovery);
>                 reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
> +       }
>  }
>

Yeah, I have something like this in mind. I am not very sure at this
stage that we want to commit this but for verification purpose,
running regressions it is a good idea.

> > > Don't
> > > we need to guarantee the cache to be valid while recovery?
> > >
> >
> > One possibility could be that we somehow detect that the value we are
> > using is cached one and if so then only do this optimization.
>
> I basically like this direction.  But I'm not sure the additional
> parameter for smgrnblocks is acceptable.
>
> But on the contrary, it might be a better design that
> DropRelFileNodeBuffers gives up the optimization when
> smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.
>

I haven't thought about what is the best way to achieve this. Let us
see if Tsunakawa-San or Kirk-San has other ideas on this?

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

17 September 2020, 13:06:33

On Wednesday, September 16, 2020 5:32 PM, Kyotaro Horiguchi wrote:
> At Wed, 16 Sep 2020 10:05:32 +0530, Amit Kapila <amit.kapila16@gmail.com>
> wrote in
> > On Wed, Sep 16, 2020 at 9:02 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > >
> > > At Wed, 16 Sep 2020 08:33:06 +0530, Amit Kapila
> > > <amit.kapila16@gmail.com> wrote in
> > > > On Wed, Sep 16, 2020 at 7:46 AM Kyotaro Horiguchi
> > > > <horikyota.ntt@gmail.com> wrote:
> > > By the way I'm not sure that actually happens, but if one smgrextend
> > > call exnteded the relation by two or more blocks, the cache is
> > > invalidated and succeeding smgrnblocks returns lseek()'s result.
> > >
> >
> > Can you think of any such case? I think in recovery we use
> > XLogReadBufferExtended->ReadBufferWithoutRelcache for reading the
> page
> > which seems to be extending page-by-page but there could be some case
> > where that is not true. One idea is to run regressions and add an
> > Assert to see if we are extending more than a block during recovery.
>
> I agree with you. Actually XLogReadBufferExtended is the only point to read a
> page while recovery and seems calling ReadBufferWithoutRelcache page by
> page up to the target page. The only case I found where the cache is
> invalidated is ALTER TABLE SET TABLESPACE while wal_level=minimal and
> not during recovery. smgrextend is called without smgrnblocks called at the
> time.
>
> Considering that the behavior of lseek can be a problem only just after
> extending a file, an assertion in smgrextend seems to be enough. Although,
> I'm not confident on the diagnosis.
>
> --- a/src/backend/storage/smgr/smgr.c
> +++ b/src/backend/storage/smgr/smgr.c
> @@ -474,7 +474,14 @@ smgrextend(SMgrRelation reln, ForkNumber forknum,
> BlockNumber blocknum,
>      if (reln->smgr_cached_nblocks[forknum] == blocknum)
>          reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>      else
> +    {
> +        /*
> +         * DropRelFileNodeBuffers relies on the behavior that
> nblocks cache
> +         * won't be invalidated by file extension while recoverying.
> +         */
> +        Assert(!InRecovery);
>          reln->smgr_cached_nblocks[forknum] =
> InvalidBlockNumber;
> +    }
>  }
>
> > > Don't
> > > we need to guarantee the cache to be valid while recovery?
> > >
> >
> > One possibility could be that we somehow detect that the value we are
> > using is cached one and if so then only do this optimization.
>
> I basically like this direction.  But I'm not sure the additional parameter for
> smgrnblocks is acceptable.
>
> But on the contrary, it might be a better design that DropRelFileNodeBuffers
> gives up the optimization when smgrnblocks(,,must_accurate = true) returns
> InvalidBlockNumber.
>

Thank you for your thoughtful reviews and discussions Horiguchi-san, Tsunakawa-san and Amit-san.
Apologies for my carelessness. I've addressed the bugs in the previous version.
1. Getting the total number of blocks for all the specified forks
2. Hashtable probing conditions

I added the suggestion of putting an assert on smgrextend for the XLogReadBufferExtended case,
and I thought that would be enough. I think modifying the smgrnblocks with the addition of new
parameter would complicate the source code because a number of functions call it.
So I thought that maybe putting BlockNumberIsValid(nblocks) in the condition would suffice.
Else, we do full scan of buffer pool.

+                       if ((nblocks / (uint32)NBuffers) < BUF_DROP_FULLSCAN_THRESHOLD &&
+                               BlockNumberIsValid(nblocks))

+                       else
+                       {
                //full scan

Attached is the v14 of the patch. It compiles and passes the tests.
Hoping for your continuous reviews and feedback. Thank you very much.

Regards,
Kirk Jamison

Attachment

v14-Speedup-dropping-of-relation-buffers-during-recovery.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"tsunakawa.takay@fujitsu.com"

Date:

23 September 2020, 02:26:13

I looked at v14.


(1)
+        /* Get the total number of blocks for the supplied relation's fork */
+        for (j = 0; j < nforks; j++)
+        {
+            BlockNumber        block = smgrnblocks(smgr_reln, forkNum[j]);
+            nblocks += block;
+        }

Why do you sum all forks?


(2)
+            if ((nblocks / (uint32)NBuffers) < BUF_DROP_FULLSCAN_THRESHOLD &&
+                BlockNumberIsValid(nblocks))
+            {

The division by NBuffers is not necessary, because both sides of = are number of blocks.
Why is BlockNumberIsValid(nblocks)) call needed?


(3)
     if (reln->smgr_cached_nblocks[forknum] == blocknum)
         reln->smgr_cached_nblocks[forknum] = blocknum + 1;
     else
+    {
+        /*
+         * DropRelFileNodeBuffers relies on the behavior that cached nblocks
+         * won't be invalidated by file extension while recovering.
+         */
+        Assert(!InRecovery);
         reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+    }

I think this change is not directly related to this patch and can be a separate patch, but I want to leave the decision
upto a committer. 


Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"tsunakawa.takay@fujitsu.com"

Date:

23 September 2020, 02:34:45

From: Amit Kapila <amit.kapila16@gmail.com>
> > > > Don't
> > > > we need to guarantee the cache to be valid while recovery?
> > > >
> > >
> > > One possibility could be that we somehow detect that the value we
> > > are using is cached one and if so then only do this optimization.
> >
> > I basically like this direction.  But I'm not sure the additional
> > parameter for smgrnblocks is acceptable.
> >
> > But on the contrary, it might be a better design that
> > DropRelFileNodeBuffers gives up the optimization when
> > smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.
> >
> 
> I haven't thought about what is the best way to achieve this. Let us see if
> Tsunakawa-San or Kirk-San has other ideas on this?

I see no need for smgrnblocks() to add an argument as it returns the correct cached or measured value.



Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

23 September 2020, 04:23:29

On Wednesday, September 23, 2020 11:26 AM, Tsunakawa, Takayuki wrote:

> I looked at v14.
Thank you for checking it!

> (1)
> +        /* Get the total number of blocks for the supplied relation's
> fork */
> +        for (j = 0; j < nforks; j++)
> +        {
> +            BlockNumber        block =
> smgrnblocks(smgr_reln, forkNum[j]);
> +            nblocks += block;
> +        }
>
> Why do you sum all forks?

I revised the patch based from my understanding of Horiguchi-san's comment,
but I could be wrong.
Quoting:

"
+            /* Get the number of blocks for the supplied relation's fork */
+            nblocks = smgrnblocks(smgr_reln, forkNum[fork_num]);
+            Assert(BlockNumberIsValid(nblocks));
+
+            if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)

As mentioned upthread, the criteria whether we do full-scan or
lookup-drop is how large portion of NBUFFERS this relation-drop can be
going to invalidate.  So the nblocks above should be the sum of number
of blocks to be truncated (not just the total number of blocks) of all
designated forks.  Then once we decided to do lookup-drop method, we
do that for all forks."

> (2)
> +            if ((nblocks / (uint32)NBuffers) <
> BUF_DROP_FULLSCAN_THRESHOLD &&
> +                BlockNumberIsValid(nblocks))
> +            {
>
> The division by NBuffers is not necessary, because both sides of = are
> number of blocks.

Again I based it from my understanding of the comment above,
so nblocks is the sum of all blocks to be truncated for all forks.


> Why is BlockNumberIsValid(nblocks)) call needed?

I thought we need to ensure that nblocks is not invalid, so I also added

> (3)
>      if (reln->smgr_cached_nblocks[forknum] == blocknum)
>          reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>      else
> +    {
> +        /*
> +         * DropRelFileNodeBuffers relies on the behavior that
> cached nblocks
> +         * won't be invalidated by file extension while recovering.
> +         */
> +        Assert(!InRecovery);
>          reln->smgr_cached_nblocks[forknum] =
> InvalidBlockNumber;
> +    }
>
> I think this change is not directly related to this patch and can be a separate
> patch, but I want to leave the decision up to a committer.
>
This is noted. Once we clarified the above comments, I'll put it in a separate patch if it's necessary,

Thank you very much for the reviews.

Best regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

23 September 2020, 04:42:05

On Wed, Sep 23, 2020 at 7:56 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> (3)
>         if (reln->smgr_cached_nblocks[forknum] == blocknum)
>                 reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>         else
> +       {
> +               /*
> +                * DropRelFileNodeBuffers relies on the behavior that cached nblocks
> +                * won't be invalidated by file extension while recovering.
> +                */
> +               Assert(!InRecovery);
>                 reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
> +       }
>
> I think this change is not directly related to this patch and can be a separate patch, but I want to leave the
decisionup to a committer.
 
>

We have added this mainly for testing purpose, basically this
assertion should not fail during the regression tests. We can keep it
in a separate patch but need to ensure that. If this fails then we
can't rely on the caching behaviour during recovery which is actually
required for the correctness of patch.


-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

23 September 2020, 04:44:19

On Wed, Sep 23, 2020 at 8:04 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:
>
> From: Amit Kapila <amit.kapila16@gmail.com>
> > > > > Don't
> > > > > we need to guarantee the cache to be valid while recovery?
> > > > >
> > > >
> > > > One possibility could be that we somehow detect that the value we
> > > > are using is cached one and if so then only do this optimization.
> > >
> > > I basically like this direction.  But I'm not sure the additional
> > > parameter for smgrnblocks is acceptable.
> > >
> > > But on the contrary, it might be a better design that
> > > DropRelFileNodeBuffers gives up the optimization when
> > > smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.
> > >
> >
> > I haven't thought about what is the best way to achieve this. Let us see if
> > Tsunakawa-San or Kirk-San has other ideas on this?
>
> I see no need for smgrnblocks() to add an argument as it returns the correct cached or measured value.
>

The idea is that we can't use this optimization if the value is not
cached because we can't rely on lseek behavior. See all the discussion
between Horiguchi-San and me in the thread above. So, how would you
ensure that if we don't use Kirk-San's proposal?

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"tsunakawa.takay@fujitsu.com"

Date:

23 September 2020, 05:37:24

From: Jamison, Kirk/ジャミソン カーク <k.jamison@fujitsu.com>
> I revised the patch based from my understanding of Horiguchi-san's comment,
> but I could be wrong.
> Quoting:
>
> "
> +            /* Get the number of blocks for the supplied relation's
> fork */
> +            nblocks = smgrnblocks(smgr_reln,
> forkNum[fork_num]);
> +            Assert(BlockNumberIsValid(nblocks));
> +
> +            if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)
>
> As mentioned upthread, the criteria whether we do full-scan or
> lookup-drop is how large portion of NBUFFERS this relation-drop can be
> going to invalidate.  So the nblocks above should be the sum of number
> of blocks to be truncated (not just the total number of blocks) of all
> designated forks.  Then once we decided to do lookup-drop method, we
> do that for all forks."

One takeaway from Horiguchi-san's comment is to use the number of blocks to invalidate for comparison, instead of all
blocksin the fork.  That is, use 

nblocks = smgrnblocks(fork) - firstDelBlock[fork];

Does this make sense?

What do you think is the reason for summing up all forks?  I didn't understand why.  Typically, FSM and VM forks are
verysmall.  If the main fork is larger than NBuffers / 500, then v14 scans the entire shared buffers for the FSM and VM
forksas well as the main fork, resulting in three scans in total. 

Also, if you want to judge the criteria based on the total blocks of all forks, the following if should be placed
outsidethe for loop, right?  Because this if condition doesn't change inside the for loop. 

+            if ((nblocks / (uint32)NBuffers) < BUF_DROP_FULLSCAN_THRESHOLD &&
+                BlockNumberIsValid(nblocks))
+            {



> > (2)
> > +            if ((nblocks / (uint32)NBuffers) <
> > BUF_DROP_FULLSCAN_THRESHOLD &&
> > +                BlockNumberIsValid(nblocks))
> > +            {
> >
> > The division by NBuffers is not necessary, because both sides of = are
> > number of blocks.
>
> Again I based it from my understanding of the comment above,
> so nblocks is the sum of all blocks to be truncated for all forks.

But the left expression of "<" is a percentage, while the right one is a block count.  Two different units are
compared.


> > Why is BlockNumberIsValid(nblocks)) call needed?
>
> I thought we need to ensure that nblocks is not invalid, so I also added

When is it invalid?  smgrnblocks() seems to always return a valid block number.  Am I seeing a different source code (I
sawHEAD)? 




Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"tsunakawa.takay@fujitsu.com"

Date:

23 September 2020, 06:30:52

From: Amit Kapila <amit.kapila16@gmail.com>
> The idea is that we can't use this optimization if the value is not
> cached because we can't rely on lseek behavior. See all the discussion
> between Horiguchi-San and me in the thread above. So, how would you
> ensure that if we don't use Kirk-San's proposal?

Hmm, buggy Linux kernel...  (Until when should we be worried about the bug?)

According to the following Horiguchi-san's suggestion, it's during normal operation, not during recovery, when we
shouldbe careful, right?  Then, we can use the current smgrnblocks() as is?
 

+    /*
+     * We cannot believe the result from smgr_nblocks is always accurate
+     * because lseek of buggy Linux kernels doesn't account for a recent
+     * write. However, we can rely on the result from lseek while recovering
+     * because the first call to this function is not happen just after a file
+     * extension. Return values on subsequent calls return cached nblocks,
+     * which should be accurate during recovery.
+     */
+    if (!InRecovery && must_accurate)
+        return InvalidBlockNumber;
+
     return result;
} 

If smgrnblocks() could return a smaller value than the actual file size by one block even during recovery, how about
alwaysadding one to the return value of smgrnblocks() in DropRelFileNodeBuffers()?  When smgrnblocks() actually
returnedthe correct value, the extra one block is not found in the shared buffer, so DropRelFileNodeBuffers() does no
harm.

Or, add a new function like smgrnblocks_precise() to avoid adding an argument to smgrnblocks()?


Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

23 September 2020, 07:57:33

On Wednesday, September 23, 2020 2:37 PM, Tsunakawa, Takayuki wrote:
> > I revised the patch based from my understanding of Horiguchi-san's
> > comment, but I could be wrong.
> > Quoting:
> >
> > "
> > +            /* Get the number of blocks for the supplied
> relation's
> > fork */
> > +            nblocks = smgrnblocks(smgr_reln,
> > forkNum[fork_num]);
> > +            Assert(BlockNumberIsValid(nblocks));
> > +
> > +            if (nblocks <
> BUF_DROP_FULLSCAN_THRESHOLD)
> >
> > As mentioned upthread, the criteria whether we do full-scan or
> > lookup-drop is how large portion of NBUFFERS this relation-drop can be
> > going to invalidate.  So the nblocks above should be the sum of number
> > of blocks to be truncated (not just the total number of blocks) of all
> > designated forks.  Then once we decided to do lookup-drop method, we
> > do that for all forks."
>
> One takeaway from Horiguchi-san's comment is to use the number of blocks
> to invalidate for comparison, instead of all blocks in the fork.  That is, use
>
> nblocks = smgrnblocks(fork) - firstDelBlock[fork];
>
> Does this make sense?

Hmm. Ok, I think it got too much to my head that I misunderstood what it meant.
I'll debug again by using ereport just to check the values and behavior are correct.
Your comment about V14 patch has dawned on me that it reverted to previous
slower version where we scan NBuffers for each fork. Thank you for explaining it.

> What do you think is the reason for summing up all forks?  I didn't
> understand why.  Typically, FSM and VM forks are very small.  If the main
> fork is larger than NBuffers / 500, then v14 scans the entire shared buffers for
> the FSM and VM forks as well as the main fork, resulting in three scans in
> total.
>
> Also, if you want to judge the criteria based on the total blocks of all forks, the
> following if should be placed outside the for loop, right?  Because this if
> condition doesn't change inside the for loop.
>
> +            if ((nblocks / (uint32)NBuffers) <
> BUF_DROP_FULLSCAN_THRESHOLD &&
> +                BlockNumberIsValid(nblocks))
> +            {
>
>
>
> > > (2)
> > > +            if ((nblocks / (uint32)NBuffers) <
> > > BUF_DROP_FULLSCAN_THRESHOLD &&
> > > +                BlockNumberIsValid(nblocks))
> > > +            {
> > >
> > > The division by NBuffers is not necessary, because both sides of =
> > > are number of blocks.
> >
> > Again I based it from my understanding of the comment above, so
> > nblocks is the sum of all blocks to be truncated for all forks.
>
> But the left expression of "<" is a percentage, while the right one is a block
> count.  Two different units are compared.
>

Right. Makes sense. Fixed.

> > > Why is BlockNumberIsValid(nblocks)) call needed?
> >
> > I thought we need to ensure that nblocks is not invalid, so I also
> > added
>
> When is it invalid?  smgrnblocks() seems to always return a valid block
> number.  Am I seeing a different source code (I saw HEAD)?

It's based from the discussion upthread to guarantee the cache to be valid while recovery
and that we don't want to proceed with the optimization in case that nblocks is invalid.
It may not be needed so I already removed it, because the correct direction is ensuring that
smgrnblocks return the precise value.
Considering the test case that Horiguchi-san suggested (attached as separate patch),
then maybe there's no need to indicate it in the loop condition.
For now, I haven't modified the design (or created a new function) of smgrnblocks,
and I just updated the patches based from the recent comments.

Thank you very much again for the reviews.

Best regards,
Kirk Jamison

On Thursday, September 24, 2020 1:27 PM, Tsunakawa-san wrote:

> (1)
> +                for (cur_blk = firstDelBlock[j]; cur_blk <
> nblocks; cur_blk++)
>
> The right side of "cur_blk <" should not be nblocks, because nblocks is not
> the number of the relation fork anymore.

Right. Fixed. It should be the total number of (n)blocks of relation.

> (2)
> +            BlockNumber        nblocks;
> +            nblocks = smgrnblocks(smgr_reln, forkNum[j]) -
> firstDelBlock[j];
>
> You should either:
>
> * Combine the two lines into one: BlockNumber nblocks = ...;
>
> or
>
> * Put an empty line between the two lines to separate declarations and
> execution statements.

Right. I separated them in the updated patch. And to prevent confusion,
instead of nblocks, nTotalBlocks & nBlocksToInvalidate are used.

/* Get the total number of blocks for the supplied relation's fork */
nTotalBlocks = smgrnblocks(smgr_reln, forkNum[j]);

/* Get the total number of blocks to be invalidated for the specified fork */
nBlocksToInvalidate = nTotalBlocks - firstDelBlock[j];


> After correcting these, I think you can check the recovery performance.

I'll send performance measurement results in the next email. Thanks a lot for the reviews!

Regards,
Kirk Jamison

Attachment

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Kyotaro Horiguchi

Date:

24 September 2020, 08:48:59

Hello.

At Wed, 23 Sep 2020 05:37:24 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in 
> From: Jamison, Kirk/ジャミソン カーク <k.jamison@fujitsu.com>

# Wow. I'm surprised to read it..

> > I revised the patch based from my understanding of Horiguchi-san's comment,
> > but I could be wrong.
> > Quoting:
> > 
> > "
> > +            /* Get the number of blocks for the supplied relation's
> > fork */
> > +            nblocks = smgrnblocks(smgr_reln,
> > forkNum[fork_num]);
> > +            Assert(BlockNumberIsValid(nblocks));
> > +
> > +            if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)
> > 
> > As mentioned upthread, the criteria whether we do full-scan or
> > lookup-drop is how large portion of NBUFFERS this relation-drop can be
> > going to invalidate.  So the nblocks above should be the sum of number
> > of blocks to be truncated (not just the total number of blocks) of all
> > designated forks.  Then once we decided to do lookup-drop method, we
> > do that for all forks."
> 
> One takeaway from Horiguchi-san's comment is to use the number of blocks to invalidate for comparison, instead of all
blocksin the fork.  That is, use
 
> 
> nblocks = smgrnblocks(fork) - firstDelBlock[fork];
> 
> Does this make sense?
> 
> What do you think is the reason for summing up all forks?  I didn't understand why.  Typically, FSM and VM forks are
verysmall.  If the main fork is larger than NBuffers / 500, then v14 scans the entire shared buffers for the FSM and VM
forksas well as the main fork, resulting in three scans in total.
 

I thought of summing up smgrnblocks(fork) - firstDelBlock[fork] of all
folks. I don't mind omitting non-main forks but a comment to explain
the reason or reasoning would be needed.

reards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

25 September 2020, 08:18:55

Hi.

> I'll send performance measurement results in the next email. Thanks a lot for
> the reviews!

Below are the performance measurement results.
I was only able to use low-spec machine:
CPU 4v, Memory 8GB, RHEL, xfs filesystem.

[Failover/Recovery Test]
1. (Master) Create table (ex. 10,000 tables). Insert data to tables.
2. (M) DELETE FROM TABLE (ex. all rows of 10,000 tables)
3. (Standby) To test with failover, pause the WAL replay on standby server.
(SELECT pg_wal_replay_pause();)
4. (M) psql -c "\timing on" (measures total execution of SQL queries)
5. (M) VACUUM (whole db)
6. (M) After vacuum finishes, stop primary server: pg_ctl stop -w -mi
7. (S) Resume wal replay and promote standby.
Because it's difficult to measure recovery time I used the attached script (resume.sh)
that prints timestamp before and after promotion. It basically does the following
- "SELECT pg_wal_replay_resume();" is executed and the WAL application is resumed.
- "pg_ctl promote" to promote standby.
- The time difference of "select pg_is_in_recovery();" from "t" to "f" is measured.

[Results]
Recovery/Failover performance (in seconds). 3 trial runs.

| shared_buffers | master | patch  | %reg    |
|----------------|--------|--------|---------|
| 128MB          | 32.406 | 33.785 | 4.08%   |
| 1GB            | 36.188 | 32.747 | -10.51% |
| 2GB            | 41.996 | 32.88  | -27.73% |

There's a bit of small regression with the default shared_buffers (128MB),
but as for the recovery time when we have large NBuffers, it's now at least almost constant
so there's boosted performance. IOW, we enter the optimization most of the time
during recovery.

I also did similar benchmark performance as what Tomas did [1],
simple "pgbench -S" tests (warmup and then 15 x 1-minute runs with
1, 8 and 16 clients, but I'm not sure if my machine is reliable enough to
produce reliable results for 8 clients and more.

| #          | master      | patch       | %reg   |
|------------|-------------|-------------|--------|
| 1 client   | 1676.937825 | 1707.018029 | -1.79% |
| 8 clients  | 7706.835401 | 7529.089044 | 2.31%  |
| 16 clients | 9823.65254  | 9991.184206 | -1.71% |


If there's additional/necessary performance measurement, kindly advise me too.
Thank you in advance.

[1]
https://www.postgresql.org/message-id/flat/20200806213334.3bzadeirly3mdtzl%40development#473168a61e229de40eaf36326232f86c

Best regards,
Kirk Jamison

On Monday, September 28, 2020 5:08 PM, Tsunakawa-san wrote:

>     From: Jamison, Kirk/ジャミソン カーク <k.jamison@fujitsu.com>
> > Is my understanding above correct?
> 
> No.  I simply meant DropRelFileNodeBuffers() calls the following function,
> and avoids the optimization if it returns InvalidBlockNumber.
> 
> 
> BlockNumber
> smgrcachednblocks(SMgrRelation reln, ForkNumber forknum) {
>     return reln->smgr_cached_nblocks[forknum];
> }

Thank you for clarifying. 

So in the new function, it goes something like:
    if (InRecovery)
    {
        if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
            return reln->smgr_cached_nblocks[forknum];
        else
            return InvalidBlockNumber;
    }

I've revised the patch and added the new function accordingly in the attached file.
I also did not remove the duplicate code from smgrnblocks because Amit-san mentioned
that when the caching for non-recovery cases is implemented, we can use it
for non-recovery cases as well.

Although I am not sure if the way it's written in DropRelFileNodeBuffers is okay.
BlockNumberIsValid(nTotalBlocks)
 
            nTotalBlocks = smgrcachednblocks(smgr_reln, forkNum[j]);
            nBlocksToInvalidate = nTotalBlocks - firstDelBlock[j];

            if (BlockNumberIsValid(nTotalBlocks) &&
                nBlocksToInvalidate < BUF_DROP_FULLSCAN_THRESHOLD)
            {
                //enter optimization loop
            }
            else
            {
                //full scan for each fork  
            }

Regards,
Kirk Jamison

On Tuesday, September 29, 2020 10:35 AM, Horiguchi-san wrote:

> FWIW, I (and maybe Amit) am thinking that the property we need here is not it
> is cached or not but the accuracy of the returned file length, and that the
> "cached" property should be hidden behind the API.
>
> Another reason for not adding this function is the cached value is not really
> reliable on non-recovery environment.
>
> > So in the new function, it goes something like:
> >     if (InRecovery)
> >     {
> >         if (reln->smgr_cached_nblocks[forknum] !=
> InvalidBlockNumber)
> >             return reln->smgr_cached_nblocks[forknum];
> >         else
> >             return InvalidBlockNumber;
> >     }
>
> If we add the new function, it should reutrn InvalidBlockNumber without
> consulting smgr_nblocks().

So here's how I revised it
smgrcachednblocks(SMgrRelation reln, ForkNumber forknum)
{
    if (InRecovery)
    {
        if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
            return reln->smgr_cached_nblocks[forknum];
    }
    return InvalidBlockNumber;


> Hmm. The current loop in DropRelFileNodeBuffers looks like this:
>
>     if (InRecovery)
>        for (for each forks)
>           if (the fork meets the criteria)
>              <optimized dropping>
>           else
>              <full scan>
>
> I think this is somewhat different from the current discussion. Whether we
> sum-up the number of blcoks for all forks or just use that of the main fork, we
> should take full scan if we failed to know the accurate size for any one of the
> forks. (In other words, it is stupid that we run a full scan for more than one
> fork at a
> drop.)
>
> Come to think of that, we can naturally sum-up all forks' blocks since anyway
> we need to call smgrnblocks for all forks to know the optimzation is usable.

I understand. We really don't have to enter the optimization when we know the
file size is inaccurate. That also makes the patch simpler.

> So that block would be something like this:
>
>     for (forks of the rel)
>       /* the function returns InvalidBlockNumber if !InRecovery */
>       if (smgrnblocks returned InvalidBlockNumber)
>          total_blocks = InvalidBlockNumber;
>          break;
>       total_blocks += nbloks of this fork
>
>     /* <we could rely on the fact that InvalidBlockNumber is zero> */
>     if (total_blocks != InvalidBlockNumber && total_blocks < threshold)
>           for (forks of the rel)
>           for (blocks of the fork)
>              <try dropping the buffer for the block>
>     else
>        <full scan dropping>

I followed this logic in the attached patch.
Thank you very much for the thoughtful reviews.

Performance measurement for large shared buffers to follow.

Best regards,
Kirk Jamison

Attachment

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

01 October 2020, 01:55:10

Hi,

I revised the patch again. Attached is V19.
The previous patch's algorithm missed entering the optimization loop.
So I corrected that and removed the extra function I added
in the previous versions.

The revised patch goes something like this:
    for (forks of rel)
    {
        if (smgrcachednblocks() == InvalidBlockNumber)
            break; //go to full scan
        if (nBlocksToInvalidate < buf_full_scan_threshold)
            for (blocks of the fork)
        else
            break; //go to full scan
    }
    <execute full scan>

Recovery performance measurement results below.
But it seems there are overhead even with large shared buffers.

| s_b   | master | patched | %reg  |
|-------|--------|---------|-------|
| 128MB | 36.052 | 39.451  | 8.62% |
| 1GB   | 21.731 | 21.73   | 0.00% |
| 20GB  | 24.534 | 25.137  | 2.40% |
| 100GB | 30.54  | 31.541  | 3.17% |

I'll investigate further. Or if you have any feedback or advice, I'd appreciate it.

Machine specs used for testing:
RHEL7, 8 core, 256 GB RAM, xfs

Configuration:
wal_level = replica
autovacuum = off
full_page_writes = off

# For streaming replication from primary.
synchronous_commit = remote_write
synchronous_standby_names = ''

# For Standby.
#hot_standby = on
#primary_conninfo

shared_buffers = 128MB
# 1GB, 20GB, 100GB

Just in case it helps for some understanding,
I also attached the recovery log 018_wal_optimize_node_replica.log
with some ereport that prints whether we enter the optimization loop or do full scan.

Regards,
Kirk Jamison

On Thursday, October 1, 2020 4:52 PM, Tsunakawa-san wrote:

> From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> > I thought that the advantage of this optimization is that we don't
> > need to visit all buffers?  If we need to run a full-scan for any
> > reason, there's no point in looking-up already-visited buffers again.
> > That's just wastefull cycles.  Am I missing somethig?
> >
> > I don't understand. If we chose to the optimized dropping, the reason
> > is the number of buffer lookup is fewer than a certain threashold. Why
> > do you think that the fork kind a buffer belongs to is relevant to the
> > criteria?
>
> I rethought about this, and you certainly have a point, but...  OK, I think I
> understood.  I should have thought in a complicated way.  In other words,
> you're suggesting "Let's simply treat all forks as one relation to determine
> whether to optimize," right?  That is, the code simple becomes:
>
> Sums up the number of buffers to invalidate in all forks; if (the cached sizes
> of all forks are valid && # of buffers to invalidate < THRESHOLD) {
>     do the optimized way;
>     return;
> }
> do the traditional way;
>
> This will be simple, and I'm +1.

This is actually close to the v18 I posted trying Horiguchi-san's approach, but that
patch had bug. So attached is an updated version (v20) trying this approach again.
I hope it's bug-free this time.

Regards,
Kirk Jamison

On Friday, October 2, 2020 11:45 AM, Horiguchi-san wrote:

> Thaks for the new version.

Thank you for your thoughtful reviews!
I've attached an updated patch addressing the comments below.

1.
> The following description is found in the comment for FlushRelationBuffers.
>
> > *        XXX currently it sequentially searches the buffer pool, should
> be
> > *        changed to more clever ways of searching.  This routine is
> not
> > *        used in any performance-critical code paths, so it's not worth
> > *        adding additional overhead to normal paths to make it go
> faster;
> > *        but see also DropRelFileNodeBuffers.
>
> This looks like to me "We won't do that kind of optimization for
> FlushRelationBuffers, but DropRelFileNodeBuffers would need it".  If so,
> don't we need to revise the comment together?

Yes, but instead of combining, I just removed the comment in FlushRelationBuffers that mentions
referring to DropRelFileNodeBuffers. I think it meant the same of using more clever ways of searching.
But that comment s not applicable anymore in DropRelFileNodeBuffers due to the optimization.
- *             adding additional overhead to normal paths to make it go faster;
- *             but see also DropRelFileNodeBuffers.
+ *             adding additional overhead to normal paths to make it go faster.

2.
> - *        XXX currently it sequentially searches the buffer pool, should be
> - *        changed to more clever ways of searching.  However, this routine
> - *        is used only in code paths that aren't very performance-critical,
> - *        and we shouldn't slow down the hot paths to make it faster ...

I revised and removed most parts of this code comment in the DropRelFileNodeBuffers
because isn't it the point of the optimization, to make the path faster for some performance
cases we've tackled in the thread?

3.
> This should no longer be a XXX comment.
Alright. I've fixed it.

4.
> It seems to me somewhat
> describing too-detailed at this function's level. How about something like the
> follwoing? (excpet its syntax, or phrasing:p)
> ====
> If we are expected to drop buffers less enough, we locate individual buffers
> using BufTableLookup. Otherwise we scan through all buffers. Since we
> mustn't leave a buffer behind, we take the latter way unless the sizes of all the
> involved forks are known to be accurte. See smgrcachednblocks() for details.
> ====

Sure. I paraphrased it like below.

If the expected maximum number of buffers to be dropped is small
enough, individual buffer is located by BufTableLookup().  Otherwise,
the buffer pool is sequentially scanned. Since buffers must not be
left behind, the latter way is executed unless the sizes of all the
involved forks are known to be accurate. See smgrcachednblocks() for
more details.

5.
> (I'm still mildly opposed to the function name, which seems exposing  detail
> too much.)
I can't think of a better name, but smgrcachednblocks seems straightforward though.
Although I understand that it may be confused with the relation property smgr_cached_nblocks.
But isn't that what we're getting in the function?

6.
> +     * Get the total number of cached blocks and to-be-invalidated
> blocks
> +     * of the relation.  If a fork's nblocks is not valid, break the loop.
>
> The number of file blocks is not usually equal to the number of existing
> buffers for the file. We might need to explain that limitation here.

I revised that comment like below..

Get the total number of cached blocks and to-be-invalidated blocks
of the relation.  The cached value returned by smgrcachednblocks
could be smaller than the actual number of existing buffers of the
file.  This is caused by buggy Linux kernels that might not have
accounted the recent write.  If a fork's nblocks is invalid, exit loop.

7.
> +    for (j = 0; j < nforks; j++)
>
> Though I understand that j is considered to be in a connection with fork
> number, I'm a bit uncomfortable that j is used for the outmost loop..

I agree. We must use I for the outer loop for consistency.

8.
> +            for (curBlock = firstDelBlock[j]; curBlock <
> nTotalBlocks;
> +curBlock++)
>
> Mmm. We should compare curBlock with the number of blocks of the fork,
> not the total of all forks.

Oops. Yes. That should be nForkBlocks, so we have to call again smgrcachednblocks()
In the optimization loop for forks.

9.
> +                uint32        newHash;        /*hash value for newTag */
> +                BufferTag    newTag;     /* identity of requested block */
> +                LWLock           *newPartitionLock;     /* buffer partition lock for it */
>
> It seems to be copied from somewhere, but the buffer is not new at all.

Thanks for catching that. Yeah. Fixed.

10.
> +                if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
> +                    bufHdr->tag.forkNum == forkNum[j] &&
> +                    bufHdr->tag.blockNum == curBlock)
> +                    InvalidateBuffer(bufHdr);    /* releases spinlock */
>
> I think it cannot happen that the block is used for a different block of the
> same relation-fork, but it could be safer to check
> bufHdr->tag.blockNum >= firstDelBlock[j] instead.

Understood and that's fine with me. Updated.

11.
> + *    smgrcachednblocks() -- Calculate the number of blocks that are
> cached in
> + *                     the supplied relation.
> + *
> + * It is equivalent to calling smgrnblocks, but only used in recovery
> +for now
> + * when DropRelFileNodeBuffers() is called.  This ensures that only
> +cached value
> + * is used which is always valid in recovery, since there is no shared
> + * invalidation mechanism that is implemented yet for changes in file size.
> + *
> + * This returns an InvalidBlockNumber when smgr_cached_nblocks is not
> +available
> + * and when not in recovery.
>
> Isn't it too concrete? We need to mention the buggy-kernel issue here rahter
> than that of callers.
>
> And if the comment is correct, we should Assert(InRecovery) at the beggining
> of this function.

I did not add the assert because it causes the recovery tap test to fail.
However, I updated the function description like below.

It is equivalent to calling smgrnblocks, but only used in recovery for now.
The returned value of file size could be inaccurate because the lseek of buggy
Linux kernels might not have accounted the recent file extension or write.
However, this function ensures that cached values are only used in recovery,
since there is no shared invalidation mechanism that is implemented yet for
changes in file size.

This returns an InvalidBlockNumber when smgr_cached_nblocks is not available
and when not in recovery.

Thanks a lot for the reviews.
If there are any more comments, feedback, or points I might have missed please feel free to reply.

Best regards,
Kirk Jamison

On Monday, October 5, 2020 8:50 PM, Amit Kapila wrote:

> On Mon, Oct 5, 2020 at 3:04 PM k.jamison@fujitsu.com
> > > 2. Also, the other thing is I have asked for some testing to avoid
> > > the small regression we have for a smaller number of shared buffers
> > > which I don't see the results nor any change in the code. I think it
> > > is better if you post the pending/open items each time you post a new
> version of the patch.
> >
> > Ah. Apologies for forgetting to include updates about that, but since
> > I keep on updating the patch I've decided not to post results yet as
> > performance may vary per patch-update due to possible bugs.
> > But for the performance case of not using recovery check, I just removed it
> from below.
> > Does it meet the intention?
> >
> > BlockNumber smgrcachednblocks(SMgrRelation reln, ForkNumber
> forknum) {
> > -       if (InRecovery && reln->smgr_cached_nblocks[forknum] !=
> InvalidBlockNumber)
> > +       if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
> >                 return reln->smgr_cached_nblocks[forknum];
> >
> 
> Yes, we can do that for the purpose of testing.

With the latest patches attached, and removing the recovery check in smgrnblocks,
I tested the performance of vacuum.
(3 trial runs, 3.5 GB db populated with 1000 tables)

Execution Time (seconds)
| s_b   | master | patched | %reg     | 
|-------|--------|---------|----------| 
| 128MB | 15.265 | 15.260  | -0.03%   | 
| 1GB   | 14.808 | 15.009  | 1.34%    | 
| 20GB  | 24.673 | 11.681  | -111.22% | 
| 100GB | 74.298 | 11.724  | -533.73% |

These are good results and we can see the improvements for large shared buffers,
For small s_b, the performance is almost the same.

I repeated the recovery performance test from the previous mail,
and ran three trials for each shared_buffer setting.
We can also clearly see the improvement here.

Recovery Time (seconds)
| s_b   | master | patched | %reg   | 
|-------|--------|---------|--------| 
| 128MB | 3.043  | 3.010   | -1.10% | 
| 1GB   | 3.417  | 3.477   | 1.73%  | 
| 20GB  | 20.597 | 2.409   | -755%  | 
| 100GB | 66.862 | 2.409   | -2676% |

For default and small shared_buffers, the recovery performance is almost the same.
But for bigger shared_buffers, we can see the benefit and improvement.
For 20GB, from 20.597 s to 2.409 s. For 100GB s_b, from 66.862 s to 2.409 s.

I have updated the latest patches, with 0002 being the new one.
Instead of introducing a new API, I just added the bool parameter to smgrnblocks
and modified its callers.

Comments and feedback are highly appreciated.

Regards,
Kirk Jamison

On Thursday, October 8, 2020 3:38 PM, Tsunakawa-san wrote:

> Hi Kirk san,
Thank you for looking into my patches!

> (1)
> + * This returns an InvalidBlockNumber when smgr_cached_nblocks is not
> + * available and when not in recovery path.
> 
> +    /*
> +     * We cannot believe the result from smgr_nblocks is always accurate
> +     * because lseek of buggy Linux kernels doesn't account for a recent
> +     * write.
> +     */
> +    if (!InRecovery && result == InvalidBlockNumber)
> +        return InvalidBlockNumber;
> +
> 
> These are unnecessary, because mdnblocks() never returns
> InvalidBlockNumber and conseuently smgrnblocks() doesn't return
> InvalidBlockNumber.

Yes. Thanks for carefully looking into that. Removed.

> (2)
> +smgrnblocks(SMgrRelation reln, ForkNumber forknum, bool *isCached)
> 
> I think it's better to make the argument name iscached so that camel case
> aligns with forknum, which is not forkNum.

This is kinda tricky because of the surrounding code which follows inconsistent coding style too.
So I just followed the same like below and retained the change.

extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
extern void smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync);

> (3)
> +     * This is caused by buggy Linux kernels that might not have
> accounted
> +     * the recent write.  If a fork's nblocks is invalid, exit loop.
> 
> "accounted for" is the right English?
> I think The second sentence should be described in terms of its meaning, not
> the program logic.  For example, something like "Give up the optimization if
> the block count of any fork cannot be trusted."

Fixed.

> Likewise, express the following part in semantics.
> 
> +     * Do explicit hashtable lookup if the total of nblocks of relation's
> forks
> +     * is not invalid and the nblocks to be invalidated is less than the

I revised it like below:
"Look up the buffer in the hashtable if the block size is known to 
 be accurate and the total blocks to be invalidated is below the
 full scan threshold.  Otherwise, give up the optimization."

> (4)
> +        if (nForkBlocks[i] == InvalidBlockNumber)
> +        {
> +            nTotalBlocks = InvalidBlockNumber;
> +            break;
> +        }
> 
> Use isCached in if condition because smgrnblocks() doesn't return
> InvalidBlockNumber.

Fixed. if (!isCached)

> (5)
> +        nBlocksToInvalidate = nTotalBlocks - firstDelBlock[i];
> 
> should be
> 
> +        nBlocksToInvalidate += (nForkBlocks[i] - firstDelBlock[i]);

Fixed.

> (6)
> +                    bufHdr->tag.blockNum >=
> firstDelBlock[j])
> +                    InvalidateBuffer(bufHdr);    /*
> releases spinlock */
> 
> The right side of >= should be cur_block.

Fixed.


Attached are the updated patches.
Thank you again for the reviews.

Regards,
Kirk Jamison

Attachment

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

08 October 2020, 09:37:48

Hi, 
> Attached are the updated patches.

Sorry there was an error in the 3rd patch. So attached is a rebase one.

Regards,
Kirk Jamison

On Friday, October 9, 2020 11:12 AM, Horiguchi-san wrote:
> I have some comments on the latest patch.

Thank you for the feedback!
I've attached the latest patches.

>  visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)  {
>      BlockNumber newnblocks;
> +    bool    cached;
>
> All the added variables added by 0002 is useless because all the caller sites
> are not interested in the value. smgrnblocks should accept NULL as isCached.
> (I'm agree with Tsunakawa-san that the camel-case name is not common
> there.)
>
> +        nForkBlocks[i] = smgrnblocks(smgr_reln, forkNum[i],
> &isCached);
> +
> +        if (!isCached)
>
> "is cached" is not the property that code is interested in. No other callers to
> smgrnblocks are interested in that property. The need for caching is purely
> internal of smgrnblocks().
> On the other hand, we are going to utilize the property of "accuracy"
> that is a biproduct of reducing fseek calls, and, again, not interested in how it
> is achieved.
> So I suggest that the name should be "accurite" or something that is not
> suggest the mechanism used under the hood.

I changed the bool param to "accurate" per your suggestion.
And I also removed the additional variables "bool cached" from the modified functions.
Now NULL values are accepted for the new boolean parameter


> +    if (nTotalBlocks != InvalidBlockNumber &&
> +        nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD)
>
> I don't think nTotalBlocks is useful. What we need here is only total blocks for
> every forks (nForkBlocks[]) and the total number of buffers to be invalidated
> for all forks (nBlocksToInvalidate).

Alright. I also removed nTotalBlocks in v24-0003 patch.

for (i = 0; i < nforks; i++)
{
    if (nForkBlocks[i] != InvalidBlockNumber &&
        nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
    {
        Optimization loop
    }
    else
        break;
}
if (i >= nforks)
    return;
{ usual buffer invalidation process }


> > > > The right side of >= should be cur_block.
> > > Fixed.
> > >= should be =, shouldn't it?
>
> It's just from a paranoia. What we are going to invalidate is blocks blockNum
> of which >= curBlock. Although actually there's no chance of any other
> processes having replaced the buffer with another page (with lower blockid)
> of the same relation after BufTableLookup(), that condition makes it sure not
> to leave blocks to be invalidated left alone.
> Sorry. What we are going to invalidate is blocks that are blocNum >=
> firstDelBlock[i]. So what I wanted to suggest was the condition should be
>
> +                if (RelFileNodeEquals(bufHdr->tag.rnode,
> rnode.node) &&
> +                    bufHdr->tag.forkNum ==
> forkNum[j] &&
> +                    bufHdr->tag.blockNum >=
> firstDelBlock[j])

I used bufHdr->tag.blockNum >= firstDelBlock[i] in the latest patch.

> > Please measure and let us see just the recovery performance again because
> the critical part of the patch was modified.  If the performance is good as the
> previous one, and there's no review interaction with others in progress, I'll
> mark the patch as ready for committer in a few days.
>
> The performance is expected to be kept since smgrnblocks() is called in a
> non-hot code path and actually it is called at most four times per a buffer
> drop in this patch. But it's better making it sure.

Hmm. When I repeated the performance measurement for non-recovery,
I got almost similar execution results for both master and patched.

Execution Time (in seconds)
| s_b   | master | patched | %reg   |
|-------|--------|---------|--------|
| 128MB | 15.265 | 14.769  | -3.36% |
| 1GB   | 14.808 | 14.618  | -1.30% |
| 20GB  | 24.673 | 24.425  | -1.02% |
| 100GB | 74.298 | 74.813  | 0.69%  |

That is considering that I removed the recovery-related checks in the patch and just
executed the commands on a standalone server.
-       if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
+       if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)

OTOH, I also measured the recovery performance by having hot standby and executing failover.
The results were good and almost similar to the previously reported recovery performance.

Recovery Time (in seconds)
| s_b   | master | patched | %reg   |
|-------|--------|---------|--------|
| 128MB | 3.043  | 2.977   | -2.22% |
| 1GB   | 3.417  | 3.41    | -0.21% |
| 20GB  | 20.597 | 2.409   | -755%  |
| 100GB | 66.862 | 2.409   | -2676% |

For 20GB s_b, from 20.597 s (Master) to 2.409 s (Patched).
For 100GB s_b, from 66.862 s (Master) to 2.409 s (Patched).
This is mainly benefits for large shared_buffers setting,
without compromising when shared_buffers is set to default or lower value.

If you could take a look again and if you have additional feedback or comments, I'd appreciate it.
Thank you for your time

Regards,
Kirk Jamison

On Tuesday, October 13, 2020 10:09 AM, Tsunakawa-san wrote:
> Why do you do this way?  I think the previous patch was more correct (while
> agreeing with Horiguchi-san in that nTotalBlocks may be unnecessary.  What
> you want to do is "if the size of any fork could be inaccurate, do the traditional
> full buffer scan without performing any optimization for any fork," right?  But
> the above code performs optimization for forks until it finds a fork with
> inaccurate size.
>
> (2)
> +     * Get the total number of cached blocks and to-be-invalidated
> blocks
> +     * of the relation.  The cached value returned by smgrnblocks could
> be
> +     * smaller than the actual number of existing buffers of the file.
>
> As you changed the meaning of the smgrnblocks() argument from cached to
> accurate, and you nolonger calculate the total blocks, the comment should
> reflect them.
>
>
> (3)
> In smgrnblocks(), accurate is not set to false when mdnblocks() is called.
> The caller doesn't initialize the value either, so it can see garbage value.
>
>
> (4)
> +        if (nForkBlocks[i] != InvalidBlockNumber &&
> +            nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD)
> +        {
> ...
> +        }
> +        else
> +            break;
> +    }
>
> In cases like this, it's better to reverse the if and else.  Thus, you can reduce
> the nest depth.

Thank you for the review!
1. I have revised the patch addressing your comments/feedback. Attached are the latest set of patches.

2. Non-recovery Performance
I also included a debug version of the patch (0004) where I removed the recovery-related checks
to measure non-recovery performance.
However, I still can't seem to find the cause of why the non-recovery performance
does not change when compared to master. (1 min 15 s for the given test case below)

> -       if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
> +       if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)

Here's how I measured it:
0. postgresql.conf setting
shared_buffers = 100GB
autovacuum = off
full_page_writes = off
checkpoint_timeout = 30min
max_locks_per_transaction = 100
wal_log_hints = on
wal_keep_size = 100
max_wal_size = 20GB

1. createdb test

2. Create tables: SELECT create_tables(1000);

create or replace function create_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
    query_string := 'create table tab_' || i::text || ' (a int);';
    execute query_string;
  end loop;
end;
$$ language plpgsql;

3 Insert rows to tables (3.5 GB db): SELECT insert_tables(1000);

create or replace function insert_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
    query_string := 'insert into tab_' || i::text || ' SELECT generate_series(1, 100000);' ;
    execute query_string;
  end loop;
end;
$$ language plpgsql;

4. DELETE FROM tables: SELECT delfrom_tables(1000);

create or replace function delfrom_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
    query_string := 'delete from tab_' || i::text;
    execute query_string;
  end loop;
end;
$$ language plpgsql;

5. Measure VACUUM timing
\timing
VACUUM;

Using the debug version of the patch, I have confirmed that it enters the optimization path
when it meets the conditions. Here are some printed logs from 018_wal_optimize_node_replica.log:
> make world -j4 -s && make -C src/test/recovery/ check PROVE_TESTS=t/018_wal_optimize.pl

WARNING:  current fork 0, nForkBlocks[i] 1, accurate: 1
CONTEXT:  WAL redo at 0/162B4E0 for Storage/TRUNCATE: base/13751/24577 to 0 blocks flags 7
WARNING:  Optimization Loop.
buf_id = 41. nforks = 1. current fork = 0. forkNum: 0 == tag's forkNum: 0. curBlock: 0  <  nForkBlocks[i] = 1. tag
blockNum:0  >=  firstDelBlock[i]: 0. nBlocksToInvalidate = 1 < threshold = 32. 

--
3. Recovery Performance (hot standby, failover)
OTOH, when executing recovery performance (using 0003 patch), the results were great.

| s_b   | master | patched | %reg   |
|-------|--------|---------|--------|
| 128MB | 3.043  | 2.977   | -2.22% |
| 1GB   | 3.417  | 3.41    | -0.21% |
| 20GB  | 20.597 | 2.409   | -755%  |
| 100GB | 66.862 | 2.409   | -2676% |

To execute this on a hot standby setup (after inserting rows to tables)
1. [Standby] Pause WAL replay
   SELECT pg_wal_replay_pause();

2. [Master] Measure VACUUM timing. Then stop server.
\timing
VACUUM;
\q
pg_ctl stop -mi -w

3. [Standby] Use the attached script to promote standby and measure the performance.
# test.sh recovery


So the current issue I'm still investigating is why the performance for non-recovery is bad,
while OTOH it's good when InRecovery.

Regards,
Kirk Jamison

On Wednesday, October 21, 2020 4:37 PM, Tsunakawa-san wrote:
> RelationTruncate() invalidates the cached fork sizes as follows.  This causes
> smgrnblocks() return accurate=false, resulting in not running optimization.
> Try commenting out for non-recovery case.
>
>     /*
>      * Make sure smgr_targblock etc aren't pointing somewhere past new
> end
>      */
>     rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
>     for (int i = 0; i <= MAX_FORKNUM; ++i)
>         rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;

Hello, I have updated the set of patches which incorporated all your feedback in the previous email.
Thank you for also looking into it. The patch 0003 (DropRelFileNodeBuffers improvement)
is indeed for vacuum optimization and not for truncate.
I'll post a separate patch for the truncate optimization in the coming days.

1. Vacuum Optimization
I have confirmed that the above comment (commenting out the lines in RelationTruncate)
solves the issue for non-recovery case.
The attached 0004 patch is just for non-recovery testing and is not included in the
final set of patches to be committed for vacuum optimization.

The table below shows the vacuum execution time for non-recovery case.
I've also subtracted the execution time when VACUUM (truncate off) is set.

[NON-RECOVERY CASE - VACUUM execution Time in seconds]

| s_b   | master | patched | %reg      |
|-------|--------|---------|-----------|
| 128MB | 0.22   | 0.181   | -21.55%   |
| 1GB   | 0.701  | 0.712   | 1.54%     |
| 20GB  | 15.027 | 1.920   | -682.66%  |
| 100GB | 65.456 | 1.795   | -3546.57% |

[RECOVERY CASE, VACUUM execution + failover]
I've made a mistake in my writing of the previous email [1].
DELETE from was executed before pausing the WAL replay on standby.
In short, the procedure and results were correct. But I repeated the
performance measurement just in case. The results are still great and
almost the same as the previous measurement.

| s_b   | master | patched | %reg   |
|-------|--------|---------|--------|
| 128MB | 3.043  | 3.009   | -1.13% |
| 1GB   | 3.417  | 3.410   | -0.21% |
| 20GB  | 20.597 | 2.410   | -755%  |
| 100GB | 65.734 | 2.409   | -2629% |

Based from the results above, with the patches applied,
the performance for both recovery and non-recovery were relatively close.
For default and small shared_buffers (128MB, 1GB), the performance is
relatively the same as master. But we see the benefit when we have large shared_buffers setting.

I've tested using the same test case I indicated in the previous email,
Including the following additional setting:
vacuum_cost_delay = 0
vacuum_cost_limit = 10000

That's it for the vacuum optimization. Feedback and comments would be highly appreciated.

2. Truncate Optimization
I'll post a separate patch in the future for the truncate optimization which modifies the
DropRelFileNodesAllBuffers and related functions along the truncate path..

Thank you.

Regards,
Kirk Jamison

[1]
https://www.postgresql.org/message-id/OSBPR01MB2341672E9A95E5EC6D2E79B5EF020%40OSBPR01MB2341.jpnprd01.prod.outlook.com

On Thursday, October 22, 2020 10:34 AM, Tsunakwa-san wrote:
> > I have confirmed that the above comment (commenting out the lines in
> > RelationTruncate) solves the issue for non-recovery case.
> > The attached 0004 patch is just for non-recovery testing and is not
> > included in the final set of patches to be committed for vacuum
> optimization.
>
> I'm relieved to hear that.
>
> As for 0004:
> When testing TRUNCATE, remove the change to storage.c because it was
> intended to troubleshoot the VACUUM test.
I've removed it now.

> What's the change in bufmgr.c for?  Is it to be included in 0001 or 0002?

Right. But that should be in 0003. Fixed.

I also fixed the feedback from the previous email:
>(1)
>+     * as the total nblocks for a given fork. The cached value returned by
>
>nblocks -> blocks


> > The table below shows the vacuum execution time for non-recovery case.
> > I've also subtracted the execution time when VACUUM (truncate off) is set.
> >
> > [NON-RECOVERY CASE - VACUUM execution Time in seconds]
> (snip)
> > | 100GB | 65.456 | 1.795   | -3546.57% |
>
> So, the full shared buffer scan for 10,000 relations took about as long as 63
> seconds (= 6.3 ms per relation).  It's nice to shorten this long time.
>
> I'll review the patch soon.

Thank you very much for the reviews. Attached are the latest set of patches.

Regards,
Kirk Jamison

Hi everyone,

Attached are the updated set of patches (V28).
 0004 - Truncate optimization is a new patch, while the rest are similar to V27.
This passes the build, regression and TAP tests.

Apologies for the delay.
I'll post the benchmark test results on SSD soon, considering the suggested benchmark of Horiguchi-san: 
> Currently BUF_DROP_FULL_SCAN_THRESHOLD is set to Nbuffers / 512,
> which is quite arbitrary that comes from a wild guess.
> 
> Perhaps we need to run benchmarks that drops one relation with several
> different ratios between the number of buffers to-be-dropped and Nbuffers,
> and preferably both on spinning rust and SSD.

Regards,
Kirk Jamison

Attachment

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"tsunakawa.takay@fujitsu.com"

Date:

29 October 2020, 02:08:02

The patch looks almost good except for the minor ones:

(1)
+    for (i = 0; i < nnodes; i++)
+    {
+        RelFileNodeBackend rnode = smgr_reln[i]->smgr_rnode;
+
+        rnodes[i] = rnode;
+    }

You can write:

+    for (i = 0; i < nnodes; i++)
+        rnodes[i] = smgr_reln[i]->smgr_rnode;


(2)
+        if (!accurate || j >= MAX_FORKNUM ||

The correct condition would be:

+        if (j <= MAX_FORKNUM ||

because j becomes MAX_FORKNUM + 1 if accurate sizes for all forks could be obtained.  If any fork's size is inaccurate,
jis <= MAX_FORKNUM when exiting the loop, so you don't need to test for accurate flag.
 


(3)
+        {
+            goto buffer_full_scan;
+            return;
+        }

return after goto cannot be reached, so this should just be:

+            goto buffer_full_scan;



Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

04 November 2020, 02:58:27

Hi,

I've updated the patch 0004 (Truncate optimization) with the previous comments of
Tsunakawa-san already addressed in the patch. (Thank you very much for the review.) 
The change here compared to the previous version is that in DropRelFileNodesAllBuffers()
we don't check for the accurate flag anymore when deciding whether to optimize or not.
For relations with blocks that do not exceed the threshold for full scan, we call
DropRelFileNodeBuffers where the flag will be checked anyway. Otherwise, we proceed
to the traditional buffer scan. Thoughts?

I've done recovery performance for TRUNCATE.
Test case: 1 parent table with 100 child partitions. TRUNCATE each child partition (1 transaction per table).
Currently, it takes a while to recover when we have large shared_buffers setting, but with the patch applied
the recovery is almost constant (0.206 s below).

| s_b   | master | patched | 
|-------|--------|---------| 
| 128MB | 0.105  | 0.105   | 
| 1GB   | 0.205  | 0.205   | 
| 20GB  | 2.008  | 0.206   | 
| 100GB | 9.315  | 0.206   |

Method of Testing (assuming streaming replication is configured):
1. Create 1 parent table and 100 child partitions
2. Insert data to each table. 
3. Pause WAL replay on standby. ( SELECT pg_wal_replay_pause(); )
4. TRUNCATE each child partitions on primary (1 transaction per table). Stop the primary
5. Resume the WAL replay and promote standby. ( SELECT pg_wal_replay_resume(); pg_ctl promote)
I have confirmed that the relations became empty on standby.

Your thoughts, feedback are very much appreciated.

Regards,
Kirk Jamison

> From: k.jamison@fujitsu.com <k.jamison@fujitsu.com>
> On Thursday, October 22, 2020 3:15 PM, Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > I'm not sure about the exact steps of the test, but it can be expected
> > if we have many small relations to truncate.
> >
> > Currently BUF_DROP_FULL_SCAN_THRESHOLD is set to Nbuffers / 512,
> which
> > is quite arbitrary that comes from a wild guess.
> >
> > Perhaps we need to run benchmarks that drops one relation with several
> > different ratios between the number of buffers to-be-dropped and
> > Nbuffers, and preferably both on spinning rust and SSD.
>
> Sorry to get back to you on this just now.
> Since we're prioritizing the vacuum patch, we also need to finalize which
> threshold value to use.
> I proceeded testing with my latest set of patches because Amit-san's
> comments on the code, the ones we addressed, don't really affect the
> performance. I'll post the updated patches for 0002 & 0003 after we come up
> with the proper boolean parameter name for smgrnblocks and the buffer full
> scan threshold value.
>
> Test the VACUUM performance with the following thresholds:
>    NBuffers/512, NBuffers/256, NBuffers/128, and determine which of the
> ratio has the best performance in terms of speed.
>
> I tested this on my machine (CPU 4v, 8GB memory, ext4) running on SSD.
> Configure streaming replication environment.
> shared_buffers = 100GB
> autovacuum = off
> full_page_writes = off
> checkpoint_timeout = 30min
>
> [Steps]
> 1. Create TABLE
> 2. INSERT data
> 3. DELETE from TABLE
> 4. Pause WAL replay on standby
> 5. VACUUM. Stop the primary.
> 6. Resume WAL replay and promote standby.
>
> With 1 relation, there were no significant changes that we can observe:
> (In seconds)
> | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> |-------|--------|--------------|--------------|--------------|
> | 128MB | 0.106  | 0.105        | 0.105        | 0.105        |
> | 100GB | 0.106  | 0.105        | 0.105        | 0.105        |
>
> So I tested with 100 tables and got more convincing measurements:
>
> | s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
> |-------|--------|--------------|--------------|--------------|
> | 128MB | 1.006  | 1.007        | 1.006        | 0.107        |
> | 1GB   | 0.706  | 0.606        | 0.606        | 0.605        |
> | 20GB  | 1.907  | 0.606        | 0.606        | 0.605        |
> | 100GB | 7.013  | 0.706        | 0.606        | 0.607        |
>
> The threshold NBuffers/128 has the best performance for default
> shared_buffers (128MB) with 0.107 s, and equally performing with large
> shared_buffers up to 100GB.
>
> We can use NBuffers/128 for the threshold, although I don't have a
> measurement for HDD yet.
> However, I wonder if the above method would suffice to determine the final
> threshold that we can use. If anyone has suggestions on how we can come
> up with the final value, like if I need to modify some steps above, I'd
> appreciate it.
>
> Regarding the parameter name. Instead of accurate, we can use "cached" as
> originally intended from the early versions of the patch since it is the smgr
> that handles smgrnblocks to get the the block size of smgr_cached_nblocks..
> "accurate" may confuse us because the cached value may not be actually
> accurate..

Hi,

So I proceeded to update the patches using the "cached" parameter and updated
the corresponding comments to it in 0002.

I've addressed the suggestions and comments of Amit-san on 0003:
1. For readability, I moved the code block to a new static function FindAndDropRelFileNodeBuffers()
2. Initialize the bool cached with false.
3. It's also decided that we don't need the extra pre-checking of RelFileNode
when locking the bufhdr in FindAndDropRelFileNodeBuffers

I repeated the recovery performance test for vacuum. (I made a mistake previously in NBuffers/128)
The 3 kinds of thresholds are almost equally performant. I chose NBuffers/256 for this patch.

| s_b   | Master | NBuffers/512 | NBuffers/256 | NBuffers/128 |
|-------|--------|--------------|--------------|--------------|
| 128MB | 1.006  | 1.007        | 1.007        | 1.007        |
| 1GB   | 0.706  | 0.606        | 0.606        | 0.606        |
| 20GB  | 1.907  | 0.606        | 0.606        | 0.606        |
| 100GB | 7.013  | 0.706        | 0.606        | 0.606        |

Although we said that we'll prioritize vacuum optimization first, I've also updated the 0004 patch
(truncate optimization) which addresses the previous concern of slower truncate due to
redundant lookup of already-dropped buffers. In the new patch, we initially drop relation buffers
using the optimized DropRelFileNodeBuffers() if the buffers do not exceed the full-scan threshold,
then later on we drop the remaining buffers using full-scan.

Regards,
Kirk Jamison

On Tuesday, November 10, 2020 3:10 PM, Tsunakawa-san wrote:
> From: Jamison, Kirk/ジャミソン カーク <k.jamison@fujitsu.com>
> > So I proceeded to update the patches using the "cached" parameter and
> > updated the corresponding comments to it in 0002.
>
> OK, I'm in favor of the name "cached" now, although I first agreed with
> Horiguchi-san in that it's better to use a name that represents the nature
> (accurate) of information rather than the implementation (cached).  Having
> a second thought, since smgr is a component that manages relation files on
> storage (file system), lseek(SEEK_END) is the accurate value for smgr.  The
> cached value holds a possibly stale size up to which the relation has
> extended.
>
>
> The patch looks almost good except for the minor ones:

Thank you for the review!

> (1)
> +extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber
> forknum,
> +                               bool *accurate);
>
> It's still accurate here.

Already fixed in 0002.

> (2)
> + *        This is only called in recovery when the block count of any
> fork is
> + *        cached and the total number of to-be-invalidated blocks per
> relation
>
> count of any fork is
> -> counts of all forks are

Fixed in 0003/

> (3)
> In 0004, I thought you would add the invalidated block counts of all relations
> to determine if the optimization is done, as Horiguchi-san suggested.  But I
> find the current patch okay too.

Yeah, I found my approach easier to implement. The new change in 0004 is that
when entering the optimized path we now call FindAndDropRelFileNodeBuffers()
instead of DropRelFileNodeBuffers().

I have attached all the updated patches.
I'd appreciate your feedback.

Regards,
Kirk Jamison

On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> From: Andres Freund <andres@anarazel.de>
> > DropRelFileNodeBuffers() in recovery? The most common path is
> > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers(),
> > which 3/4 doesn't address and 4/4 doesn't mention.
> >
> > 4/4 seems to address DropRelationFiles(), but only talks about
> TRUNCATE?
>
> Yes.  DropRelationFiles() is used in the following two paths:
>
> [Replay of TRUNCATE during recovery]
> xact_redo_commit/abort() -> DropRelationFiles()  -> smgrdounlinkall() ->
> DropRelFileNodesAllBuffers()
>
> [COMMIT/ROLLBACK PREPARED]
> FinishPreparedTransaction() -> DropRelationFiles()  -> smgrdounlinkall()
> -> DropRelFileNodesAllBuffers()

Yes. The concern is that it was not clear in the function descriptions and commit logs
what the optimizations for the DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers()
are for. So I revised only the function description of DropRelFileNodeBuffers() and the
commit logs of the 0003-0004 patches. Please check if the brief descriptions would suffice.


> > I also don't get why 4/4 would be a good idea on its own. It uses
> > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > DropRelFileNodesAllBuffers() can be used for many relations at once,
> > this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD - 1
> lookups a lot
> > of times, once for each of nnodes relations?
>
> So, the threshold value should be compared with the total number of blocks
> of all target relations, not each relation.  You seem to be right, got it.

Fixed this in 0004 patch. Now we compare the total number of buffers-to-be-invalidated
For ALL relations to the BUF_DROP_FULL_SCAN_THRESHOLD.

> > Also, how is 4/4 safe - this is outside of recovery too?
>
> It seems that DropRelFileNodesAllBuffers() should trigger the new
> optimization path only when InRecovery == true, because it intentionally
> doesn't check the "accurate" value returned from smgrnblocks().

Fixed it in 0004 patch. Now we ensure that we only enter the optimization path
Iff during recovery.


> From: Amit Kapila <amit.kapila16@gmail.com>
> On Wed, Nov 18, 2020 at 11:43 PM Andres Freund <andres@anarazel.de>
> > I'm also worried about the cases where this could cause buffers left
> > in the buffer pool, without a crosscheck like Thomas' patch would
> > allow to add. Obviously other processes can dirty buffers in
> > hot_standby, so any leftover buffer could have bad consequences.
> >
>
> The problem can only arise if other processes extend the relation. The idea
> was that in recovery it extends relation by one process which helps to
> maintain the cache. Kirk seems to have done testing to cross-verify it by using
> his first patch (Prevent-invalidating-blocks-in-smgrextend-during). Which
> other crosscheck you are referring here?
>
> I agree that we can do a better job by expanding comments to clearly state
> why it is safe.

Yes, basically what Amit-san also mentioned above. The first patch prevents that.
And in the description of DropRelFileNodeBuffers in the 0003 patch, please check
If that would suffice.


> > Smaller comment:
> >
> > +static void
> > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber
> *forkNum,
> > int nforks,
> > +                              BlockNumber
> > *nForkBlocks, BlockNumber *firstDelBlock) ...
> > +            /* Check that it is in the buffer pool. If not, do
> nothing.
> > */
> > +            LWLockAcquire(bufPartitionLock, LW_SHARED);
> > +            buf_id = BufTableLookup(&bufTag, bufHash);
> > ...
> > +            bufHdr = GetBufferDescriptor(buf_id);
> > +
> > +            buf_state = LockBufHdr(bufHdr);
> > +
> > +            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode)
> &&
> > +                bufHdr->tag.forkNum == forkNum[i] &&
> > +                bufHdr->tag.blockNum >= firstDelBlock[i])
> > +                InvalidateBuffer(bufHdr);    /* releases
> > spinlock */
> > +            else
> > +                UnlockBufHdr(bufHdr, buf_state);
> >
> > I'm a bit confused about the check here. We hold a buffer partition
> > lock, and have done a lookup in the mapping table. Why are we then
> > rechecking the relfilenode/fork/blocknum? And why are we doing so
> > holding the buffer header lock, which is essentially a spinlock, so
> > should only ever be held for very short portions?
> >
> > This looks like it's copying logic from DropRelFileNodeBuffers() etc,
> > but there the situation is different: We haven't done a buffer mapping
> > lookup, and we don't hold a partition lock!
>
> That's because the buffer partition lock is released immediately after the hash
> table has been looked up.  As an aside, InvalidateBuffer() requires the caller
> to hold the buffer header spinlock and doesn't hold the buffer partition lock.

Yes. Holding the buffer header spinlock is necessary to invalidate the buffers.
As for buffer mapping partition lock, as mentioned by Tsunakawa-san, it is
released immediately after BufTableLookup, which is similar to lookup done in
PrefetchSharedBuffer. So I retained these changes.

I have attached the updated patches. Aside from descriptions, no other major
changes in the patch set except 0004. Feedbacks are welcome.

Regards,
Kirk Jamison

On Thursday, November 26, 2020 4:19 PM, Horiguchi-san wrote:
> Hello, Kirk. Thank you for the new version.

Apologies for the delay, but attached are the updated versions to simplify the patches.
The changes reflected most of your comments/suggestions.

Summary of changes in the latest versions.
1. Updated the function description of DropRelFileNodeBuffers in 0003.
2. Updated the commit logs of 0003 and 0004.
3, FindAndDropRelFileNodeBuffers is now called for each relation fork,
  instead of for all involved forks.
4. Removed the unnecessary palloc() and subscripts like forks[][],
   firstDelBlock[], nforks, as advised by Horiguchi-san. The memory
   allocation for block[][] was also simplified.
   So 0004 became simpler and more readable.


> At Thu, 26 Nov 2020 03:04:10 +0000, "k.jamison@fujitsu.com"
> <k.jamison@fujitsu.com> wrote in
> > On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> > > From: Andres Freund <andres@anarazel.de>
> > > > DropRelFileNodeBuffers() in recovery? The most common path is
> > > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers
> > > > (), which 3/4 doesn't address and 4/4 doesn't mention.
> > > >
> > > > 4/4 seems to address DropRelationFiles(), but only talks about
> > > TRUNCATE?
> > >
> > > Yes.  DropRelationFiles() is used in the following two paths:
> > >
> > > [Replay of TRUNCATE during recovery]
> > > xact_redo_commit/abort() -> DropRelationFiles()  ->
> > > smgrdounlinkall() ->
> > > DropRelFileNodesAllBuffers()
> > >
> > > [COMMIT/ROLLBACK PREPARED]
> > > FinishPreparedTransaction() -> DropRelationFiles()  ->
> > > smgrdounlinkall()
> > > -> DropRelFileNodesAllBuffers()
> >
> > Yes. The concern is that it was not clear in the function descriptions
> > and commit logs what the optimizations for the
> > DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers() are for. So
> > I revised only the function description of DropRelFileNodeBuffers() and the
> commit logs of the 0003-0004 patches. Please check if the brief descriptions
> would suffice.
>
> I read the commit message of 3/4.  (Though this is not involved literally in the
> final commit.)
>
> > While recovery, when WAL files of XLOG_SMGR_TRUNCATE from vacuum
> or
> > autovacuum are replayed, the buffers are dropped when the sizes of all
> > involved forks of a relation are already "cached". We can get
>
> This sentence seems missing "dropped by (or using) what".
>
> > a reliable size of nblocks for supplied relation's fork at that time,
> > and it's safe because DropRelFileNodeBuffers() relies on the behavior
> > that cached nblocks will not be invalidated by file extension during
> > recovery.  Otherwise, or if not in recovery, proceed to sequential
> > search of the whole buffer pool.
>
> This sentence seems involving confusion. It reads as if "we can rely on it
> because we're relying on it".  And "the cached value won't be invalidated"
> doesn't explain the reason precisely. The reason I think is that the cached
> value is guaranteed to be the maximum page we have in shared buffer at least
> while recovery, and that guarantee is holded by not asking fseek once we
> cached the value.

Fixed the commit log of 0003.

> > > > I also don't get why 4/4 would be a good idea on its own. It uses
> > > > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > > > DropRelFileNodesAllBuffers() can be used for many relations at
> > > > once, this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD
> - 1
> > > lookups a lot
> > > > of times, once for each of nnodes relations?
> > >
> > > So, the threshold value should be compared with the total number of
> > > blocks of all target relations, not each relation.  You seem to be right, got
> it.
> >
> > Fixed this in 0004 patch. Now we compare the total number of
> > buffers-to-be-invalidated For ALL relations to the
> BUF_DROP_FULL_SCAN_THRESHOLD.
>
> I didn't see the previous version, but the row of additional palloc/pfree's in
> this version looks uneasy.

Fixed this too.

>      int            i,
> +                j,
> +                *nforks,
>                  n = 0;
>
> Perhaps I think we don't define variable in different types at once.
> (I'm not sure about defining multple variables at once.)

Fixed this too.

> @@ -3110,7 +3125,10 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend
> *rnodes, int nnodes)
>
>     DropRelFileNodeAllLocalBuffers(rnodes[i].node);
>          }
>          else
> +        {
> +            rels[n] = smgr_reln[i];
>              nodes[n++] = rnodes[i].node;
> +        }
>      }
>
> We don't need to remember nodes and rnodes here since rnodes[n] is
> rels[n]->smgr_rnode here.  Or we don't even need to store rels since we can
> scan the smgr_reln later again.
>
> nodes is needed in the full-scan path but it is enough to collect it after finding
> that we do full-scan.

I followed your advice and removed the rnodes[] and rels[].
nodes[] is allocated later at full scan path.


> +    nforks = palloc(sizeof(int) * n);
> +    forks = palloc(sizeof(ForkNumber *) * n);
> +    blocks = palloc(sizeof(BlockNumber *) * n);
> +    firstDelBlocks = palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM
> + 1));
> +    for (i = 0; i < n; i++)
> +    {
> +        forks[i] = palloc(sizeof(ForkNumber) * (MAX_FORKNUM +
> 1));
> +        blocks[i] = palloc(sizeof(BlockNumber) * (MAX_FORKNUM
> + 1));
> +    }
>
> We can allocate the whole array at once like this.
>
>  BlockNumber (*blocks)[MAX_FORKNUM+1] =
>       (BlockNumber (*)[MAX_FORKNUM+1])
>       palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1))

Thank you for suggesting to reduce the lines for the 2d dynamic memory alloc.
I followed this way in 0004, but it's my first time to see it written this way.
I am very glad it works, though is it okay to write it this way since I cannot find
a similar code of declaring and allocating 2D arrays like this in Postgres source code?

> +            nBlocksToInvalidate += blocks[i][numForks];
> +
> +            forks[i][numForks++] = j;
>
> We can signal to the later code the absense of a fork by setting
> InvalidBlockNumber to blocks. Thus forks[], nforks and numForks can be
> removed.

Followed it in 0004.

> +    /* Zero the array of blocks because these will all be dropped anyway
> */
> +    MemSet(firstDelBlocks, 0, sizeof(BlockNumber) * n *
> (MAX_FORKNUM +
> +1));
>
> We don't need to prepare nforks, forks and firstDelBlocks for all relations
> before looping over relations.  In other words, we can fill in the arrays for a
> relation at every iteration of relations.

Followed your advice. Although I now drop the buffers per fork, which now
removes forks[][], nforks, firstDelBlocks[].

> +     * We enter the optimization iff we are in recovery and the number of
> +blocks to
>
> This comment ticks out of 80 columns. (I'm not sure whether that convention
> is still valid..)

Fixed.

> +    if (InRecovery && nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD)
>
> We don't need to check InRecovery here. DropRelFileNodeBuffers doesn't do
> that.


As for DropRelFileNodesAllBuffers use case, I used InRecovery
so that the optimization still works.
  Horiguchi-san also wrote in another mail:
> A bit different from the point, but if some tuples have been inserted to the
> truncated table, XLogReadBufferExtended() is called for the table and the
> length is cached.
I was wrong in my previous claim that the "cached" value always return false.
When I checked the recovery test log from recovery tap test, there was only
one example when "cached" became true (script below) and entered the
optimization path. However, in all other cases including the TRUNCATE test case
in my patch, the "cached" flag returns "false".

"cached" flag became true:
    # in different subtransaction patterns
    $node->safe_psql(
        'postgres', "
        BEGIN;
        CREATE TABLE spc_commit (id serial PRIMARY KEY, id2 int);
        INSERT INTO spc_commit VALUES (DEFAULT, generate_series(1,3000));
        TRUNCATE spc_commit;
        SAVEPOINT s; ALTER TABLE spc_commit SET TABLESPACE other; RELEASE s;
        COPY spc_commit FROM '$copy_file' DELIMITER ',';
        COMMIT;");
    $node->stop('immediate');
    $node->start;

So I used the InRecovery for the optimization case of DropRelFileNodesAllBuffers.
I retained the smgrnblocks' "cached" parameter as it is useful in
DropRelFileNodeBuffers.


> > > I agree that we can do a better job by expanding comments to clearly
> > > state why it is safe.
> >
> > Yes, basically what Amit-san also mentioned above. The first patch
> prevents that.
> > And in the description of DropRelFileNodeBuffers in the 0003 patch,
> > please check If that would suffice.
>
> + *        While in recovery, if the expected maximum number of
> buffers to be
> + *        dropped is small enough and the sizes of all involved forks
> are
> + *        already cached, individual buffer is located by
> BufTableLookup().
> + *        It is safe because cached blocks will not be invalidated by file
> + *        extension during recovery.  See smgrnblocks() and
> smgrextend() for
> + *        more details. Otherwise, if the conditions for optimization are
> not
> + *        met, the buffer pool is sequentially scanned so that no
> buffers are
> + *        left behind.
>
> I'm not confident on it, but it seems somewhat obscure.  How about
> something like this?
>
> We mustn't leave a buffer for the relations to be dropped.  We invalidate
> buffer blocks by locating using BufTableLookup() when we assure that we
> know up to what page of every fork we possiblly have a buffer for. We can
> know that by the "cached" flag returned by smgrblocks. It currently gets true
> only while recovery. See
> smgrnblocks() and smgrextend(). Otherwise we scan the whole buffer pool to
> find buffers for the relation, which is slower when a small part of buffers are
> to be dropped.

Followed your advice and modified it a bit.

I have changed the status to "Needs Review".
Feedbacks are always welcome.

Regards,
Kirk Jamison

On Thursday, December 10, 2020 12:27 PM, Amit Kapila wrote: 
> On Thu, Dec 10, 2020 at 7:11 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Wed, 9 Dec 2020 16:27:30 +0530, Amit Kapila
> > <amit.kapila16@gmail.com> wrote in
> > > On Wed, Dec 9, 2020 at 6:32 AM Kyotaro Horiguchi
> > > > Mmm. At least btree doesn't need to call smgrnblocks except at
> > > > expansion, so we cannot get to the optimized path in major cases
> > > > of truncation involving btree (and/or maybe other indexes).
> > > >
> > >
> > > AFAICS, btree insert should call smgrnblocks via
> > >
> btree_xlog_insert->XLogReadBufferForRedo->XLogReadBufferForRedoExte
> nded->XLogReadBufferExtended->smgrnblocks.
> > > Similarly delete should also call smgrnblocks. Can you be bit more
> > > specific related to the btree case you have in mind?
> >
> > Oh, sorry. I wrongly looked to non-recovery path. smgrnblocks is
> > called during buffer loading while recovery. So, smgrnblock is called
> > for indexes if any update happens on the heap relation.
> >
> 
> Okay, so this means that we can get the benefit of optimization in many cases
> in the Truncate code path as well even if we use 'cached'
> flag? If so, then I would prefer to keep the code consistent for both vacuum
> and truncate recovery code path.

Yes, I have tested that optimization works for index relations.

I have attached the V34, following the conditions that we use "cached" flag
for both DropRelFileNodesBuffers() and DropRelFileNodesBuffers() for
consistency.
I added comment in 0004 the limitation of optimization when there are TOAST
relations that use NON-PLAIN strategy. i.e. The optimization works if the data
types used are integers, OID, bytea, etc. But for TOAST-able data types like text,
the optimization will be skipped and force a full scan during recovery.

Regards,
Kirk Jamison

On Friday, December 11, 2020 10:27 AM, Amit Kapila wrote:
> On Fri, Dec 11, 2020 at 5:54 AM k.jamison@fujitsu.com
> <k.jamison@fujitsu.com> wrote:
> > So should I still not include that information?
> >
> 
> I think we can extend your existing comment like: "Otherwise if the size of a
> relation fork is not cached, we proceed to a full scan of the whole buffer pool.
> This can happen if there is no update to a particular fork during recovery."

Attached are the final updated patches.
I followed this advice and updated the source code comment a little bit.
There are no changes from the previous except that and the unnecessary
"cached" condition which Tsunakawa-san mentioned.

Below is also the updated recovery performance test results for TRUNCATE.
(1000 tables, 1MB per table, results measured in seconds)
| s_b   | Master | Patched | % Reg   | 
|-------|--------|---------|---------| 
| 128MB | 0.406  | 0.406   | 0%      | 
| 512MB | 0.506  | 0.406   | -25%    | 
| 1GB   | 0.806  | 0.406   | -99%    | 
| 20GB  | 15.224 | 0.406   | -3650%  | 
| 100GB | 81.506 | 0.406   | -19975% |

Because of the size of relation, it is expected to enter full-scan for
the 128MB shared_buffers setting. And there was no regression.
Similar to previous test results, the recovery time was constant
for all shared_buffers setting with the patches applied.

Regards,
Kirk Jamison

On Tue, Dec 22, 2020 at 8:30 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 22 Dec 2020 02:48:22 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
> > From: Amit Kapila <amit.kapila16@gmail.com>
> > > Why would all client backends wait for AccessExclusive lock on this
> > > relation? Say, a client needs a buffer for some other relation and
> > > that might evict this buffer after we release the lock on the
> > > partition. In StrategyGetBuffer, it is important to either have a pin
> > > on the buffer or the buffer header itself must be locked to avoid
> > > getting picked as victim buffer. Am I missing something?
> >
> > Ouch, right.  (The year-end business must be making me crazy...)
> >
> > So, there are two choices here:
> >
> > 1) The current patch.
> > 2) Acquire the buffer header spinlock before releasing the buffer mapping lwlock, and eliminate the buffer tag
comparisonas follows:

> >
> >   BufTableLookup();
> >   LockBufHdr();
> >   LWLockRelease();
> >   InvalidateBuffer();
> >
> > I think both are okay.  If I must choose either, I kind of prefer 1), because LWLockRelease() could take longer
timeto wake up other processes waiting on the lwlock, which is not very good to do while holding a spinlock.

>
> I like, as said before, the current patch.
>

Attached, please find the updated patch with the following
modifications, (a) updated comments at various places especially to
tell why this is a safe optimization, (b) merged the patch for
extending the smgrnblocks and vacuum optimization patch, (c) made
minor cosmetic changes and ran pgindent, and (d) updated commit
message. BTW, this optimization will help not only vacuum but also
truncate when it is done in the same transaction in which the relation
is created.  I would like to see certain tests to ensure that the
value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I see
that some testing has been done earlier [1] for this threshold but I
am not still able to conclude. The criteria to find the right
threshold should be what is the maximum size of relation to be
truncated above which we don't get benefit with this optimization.

One idea could be to remove "nBlocksToInvalidate <
BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it always
use optimized path for the tests. Then use the relation size as
NBuffers/128, NBuffers/256, NBuffers/512 for different values of
shared buffers as 128MB, 1GB, 20GB, 100GB.

Apart from tests, do let me know if you are happy with the changes in
the patch? Next, I'll look into DropRelFileNodesAllBuffers()
optimization patch.

[1] -
https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Attachment

v36-0001-Optimize-DropRelFileNodeBuffers-for-recovery.patch

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

22 December 2020, 12:11:01

On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Apart from tests, do let me know if you are happy with the changes in
> the patch? Next, I'll look into DropRelFileNodesAllBuffers()
> optimization patch.
>

Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]
========================================================
1.
DropRelFileNodesAllBuffers()
{
..
+buffer_full_scan:
+ pfree(block);
+ nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
+ for (i = 0; i < n; i++)
+ nodes[i] = smgr_reln[i]->smgr_rnode.node;
+
..
}

How is it correct to assign nodes array directly from smgr_reln? There
is no one-to-one correspondence. If you see the code before patch, the
passed array can have mixed of temp and non-temp relation information.

2.
+ for (i = 0; i < n; i++)
  {
- pfree(nodes);
+ for (j = 0; j <= MAX_FORKNUM; j++)
+ {
+ /*
+ * Assign InvalidblockNumber to a block if a relation
+ * fork does not exist, so that we can skip it later
+ * when dropping the relation buffers.
+ */
+ if (!smgrexists(smgr_reln[i], j))
+ {
+ block[i][j] = InvalidBlockNumber;
+ continue;
+ }
+
+ /* Get the number of blocks for a relation's fork */
+ block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);

Similar to above, how can we assume smgr_reln array has all non-local
relations? Have we tried the case with mix of temp and non-temp
relations?

In this code, I am slightly worried about the additional cost of each
time checking smgrexists. Consider a case where there are many
relations and only one or few of them have not cached the information,
in such a case we will pay the cost of smgrexists for many relations
without even going to the optimized path. Can we avoid that in some
way or at least reduce its usage to only when it is required? One idea
could be that we first check if the nblocks information is cached and
if so then we don't need to call smgrnblocks, otherwise, check if it
exists. For this, we need an API like smgrnblocks_cahced, something we
discussed earlier but preferred the current API. Do you have any
better ideas?

[1] -
https://www.postgresql.org/message-id/OSBPR01MB2341882F416A282C3F7D769DEFC70%40OSBPR01MB2341.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

23 December 2020, 01:00:35

On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote: 
> Attached, please find the updated patch with the following modifications, (a)
> updated comments at various places especially to tell why this is a safe
> optimization, (b) merged the patch for extending the smgrnblocks and
> vacuum optimization patch, (c) made minor cosmetic changes and ran
> pgindent, and (d) updated commit message. BTW, this optimization will help
> not only vacuum but also truncate when it is done in the same transaction in
> which the relation is created.  I would like to see certain tests to ensure that
> the value we choose for BUF_DROP_FULL_SCAN_THRESHOLD is correct. I
> see that some testing has been done earlier [1] for this threshold but I am not
> still able to conclude. The criteria to find the right threshold should be what is
> the maximum size of relation to be truncated above which we don't get
> benefit with this optimization.
> 
> One idea could be to remove "nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached &&
> nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it
> always use optimized path for the tests. Then use the relation size as
> NBuffers/128, NBuffers/256, NBuffers/512 for different values of shared
> buffers as 128MB, 1GB, 20GB, 100GB.

Alright. I will also repeat the tests with the different threshold settings, 
and thank you for the tip.

> Apart from tests, do let me know if you are happy with the changes in the
> patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch.

Thank you, Amit.
That looks more neat, combining the previous patches 0002-0003, so I am +1
with the changes because of the clearer explanations for the threshold and
optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch sets.
Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure that we
do it safely too and that we are not InRecovery.

> [1] -
> https://www.postgresql.org/message-id/OSBPR01MB234176B1829AECFE9
> FDDFCC2EFE90%40OSBPR01MB2341.jpnprd01.prod.outlook.com

Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

23 December 2020, 01:41:16

On Wed, Dec 23, 2020 at 6:30 AM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
>
> On Tuesday, December 22, 2020 6:25 PM, Amit Kapila wrote:
>
> > Apart from tests, do let me know if you are happy with the changes in the
> > patch? Next, I'll look into DropRelFileNodesAllBuffers() optimization patch.
>
> Thank you, Amit.
> That looks more neat, combining the previous patches 0002-0003, so I am +1
> with the changes because of the clearer explanations for the threshold and
> optimization path in DropRelFileNodeBuffers. Thanks for cleaning my patch sets.
> Hope we don't forget the 0001 patch's assertion in smgrextend() to ensure that we
> do it safely too and that we are not InRecovery.
>

I think the 0001 is mostly for test purposes but we will see once the
main patches are ready.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

23 December 2020, 03:57:53

On Tue, Dec 22, 2020 at 5:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Apart from tests, do let me know if you are happy with the changes in
> > the patch? Next, I'll look into DropRelFileNodesAllBuffers()
> > optimization patch.
> >
>
> Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]
> ========================================================
>
> In this code, I am slightly worried about the additional cost of each
> time checking smgrexists. Consider a case where there are many
> relations and only one or few of them have not cached the information,
> in such a case we will pay the cost of smgrexists for many relations
> without even going to the optimized path. Can we avoid that in some
> way or at least reduce its usage to only when it is required? One idea
> could be that we first check if the nblocks information is cached and
> if so then we don't need to call smgrnblocks, otherwise, check if it
> exists. For this, we need an API like smgrnblocks_cahced, something we
> discussed earlier but preferred the current API. Do you have any
> better ideas?
>

One more idea which is not better than what I mentioned above is that
we completely avoid calling smgrexists and rely on smgrnblocks. It
will throw an error in case the particular fork doesn't exist and we
can use try .. catch to handle it. I just mentioned it as it came
across my mind but I don't think it is better than the previous one.

One more thing about patch:
+ /* Get the number of blocks for a relation's fork */
+ block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
+
+ if (!cached)
+ goto buffer_full_scan;

Why do we need to use goto here? We can simply break from the loop and
then check if (cached && nBlocksToInvalidate <
BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
possible without much complexity.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"tsunakawa.takay@fujitsu.com"

Date:

23 December 2020, 04:22:19

From: Amit Kapila <amit.kapila16@gmail.com>
> + /* Get the number of blocks for a relation's fork */
> + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> +
> + if (!cached)
> + goto buffer_full_scan;
> 
> Why do we need to use goto here? We can simply break from the loop and
> then check if (cached && nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> possible without much complexity.

That's because two for loops are nested -- breaking there only exits the inner loop.  (I thought the same as you at
first...And I understand some people have alergy to goto, I think modest use of goto makes the code readable.)
 


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Kyotaro Horiguchi

Date:

23 December 2020, 05:12:41

At Wed, 23 Dec 2020 04:22:19 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in 
> From: Amit Kapila <amit.kapila16@gmail.com>
> > + /* Get the number of blocks for a relation's fork */
> > + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> > +
> > + if (!cached)
> > + goto buffer_full_scan;
> > 
> > Why do we need to use goto here? We can simply break from the loop and
> > then check if (cached && nBlocksToInvalidate <
> > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > possible without much complexity.
> 
> That's because two for loops are nested -- breaking there only exits the inner loop.  (I thought the same as you at
first...And I understand some people have alergy to goto, I think modest use of goto makes the code readable.)
 

I don't strongly oppose to goto's but in this case the outer loop can
break on the same condition with the inner loop, since cached is true
whenever the inner loop runs to the end. It is needed to initialize
the variable cache with true, instead of false, though.

The same pattern is seen in the tree.

Regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Zhihong Yu

Date:

23 December 2020, 05:51:59

Hi,

It is possible to come out of the nested loop without goto.

+ bool cached = true;

...

+ * to that fork during recovery.
+ */
+ for (i = 0; i < n && cached; i++)

...

+ if (!cached)

+. break;

Here I changed the initial value for cached to true so that we enter the outer loop.

In place of the original goto, we break out of inner loop and exit outer loop.

Cheers

On Tue, Dec 22, 2020 at 8:22 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote:

From: Amit Kapila <amit.kapila16@gmail.com>
> + /* Get the number of blocks for a relation's fork */
> + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> +
> + if (!cached)
> + goto buffer_full_scan;
>
> Why do we need to use goto here? We can simply break from the loop and
> then check if (cached && nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> possible without much complexity.

That's because two for loops are nested -- breaking there only exits the inner loop. (I thought the same as you at first... And I understand some people have alergy to goto, I think modest use of goto makes the code readable.)

Regards
Takayuki Tsunakawa

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

23 December 2020, 07:37:38

On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote:
> On Tue, Dec 22, 2020 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > Next, I'll look into DropRelFileNodesAllBuffers()
> > optimization patch.
> >
> 
> Review of v35-0004-Optimize-DropRelFileNodesAllBuffers-in-recovery [1]
> =================================================
> =======
> 1.
> DropRelFileNodesAllBuffers()
> {
> ..
> +buffer_full_scan:
> + pfree(block);
> + nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
> +for (i = 0; i < n; i++)  nodes[i] = smgr_reln[i]->smgr_rnode.node;
> +
> ..
> }
> 
> How is it correct to assign nodes array directly from smgr_reln? There is no
> one-to-one correspondence. If you see the code before patch, the passed
> array can have mixed of temp and non-temp relation information.

You are right. I mistakenly removed the array node that should have been
allocated for non-local relations. So I fixed that by doing:

    SMgrRelation    *rels;

    rels = palloc(sizeof(SMgrRelation) * nnodes);    /* non-local relations */

    /* If it's a local relation, it's localbuf.c's problem. */
    for (i = 0; i < nnodes; i++)
    {
        if (RelFileNodeBackendIsTemp(smgr_reln[i]->smgr_rnode))
        {
            if (smgr_reln[i]->smgr_rnode.backend == MyBackendId)
                DropRelFileNodeAllLocalBuffers(smgr_reln[i]->smgr_rnode.node);
        }
        else
            rels[n++] = smgr_reln[i];
    }
...
    if (n == 0)
    {
        pfree(rels);
        return;
    }
...
//traditional path:

    pfree(block);
    nodes = palloc(sizeof(RelFileNode) * n); /* non-local relations */
    for (i = 0; i < n; i++)
        nodes[i] = rels[i]->smgr_rnode.node;

> 2.
> + for (i = 0; i < n; i++)
>   {
> - pfree(nodes);
> + for (j = 0; j <= MAX_FORKNUM; j++)
> + {
> + /*
> + * Assign InvalidblockNumber to a block if a relation
> + * fork does not exist, so that we can skip it later
> + * when dropping the relation buffers.
> + */
> + if (!smgrexists(smgr_reln[i], j))
> + {
> + block[i][j] = InvalidBlockNumber;
> + continue;
> + }
> +
> + /* Get the number of blocks for a relation's fork */ block[i][j] =
> + smgrnblocks(smgr_reln[i], j, &cached);
> 
> Similar to above, how can we assume smgr_reln array has all non-local
> relations? Have we tried the case with mix of temp and non-temp relations?

Similar to above reply.

> In this code, I am slightly worried about the additional cost of each time
> checking smgrexists. Consider a case where there are many relations and only
> one or few of them have not cached the information, in such a case we will
> pay the cost of smgrexists for many relations without even going to the
> optimized path. Can we avoid that in some way or at least reduce its usage to
> only when it is required? One idea could be that we first check if the nblocks
> information is cached and if so then we don't need to call smgrnblocks,
> otherwise, check if it exists. For this, we need an API like
> smgrnblocks_cahced, something we discussed earlier but preferred the
> current API. Do you have any better ideas?

Right. I understand the point that let's say there are 100 relations, and
the first 99 non-local relations happen to enter the optimization path, but the last
one does not, calling smgrexist() would be too costly and waste of time in that case.
The only solution I could think of as you mentioned is to reintroduce the new API
which we discussed before: smgrnblocks_cached().
It's possible that we call smgrexist() only if smgrnblocks_cached() returns
InvalidBlockNumber.
So if everyone agrees, we can add that API smgrnblocks_cached() which will
Include the "cached" flag parameter, and remove the additional flag modifications
from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and
DropRelFileNodesAllBuffers() will use the new API.

Thoughts?


Regards,
Kirk Jamison

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

23 December 2020, 08:51:08

On Wed, Dec 23, 2020 at 1:07 PM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
>
> On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote:
>
> > In this code, I am slightly worried about the additional cost of each time
> > checking smgrexists. Consider a case where there are many relations and only
> > one or few of them have not cached the information, in such a case we will
> > pay the cost of smgrexists for many relations without even going to the
> > optimized path. Can we avoid that in some way or at least reduce its usage to
> > only when it is required? One idea could be that we first check if the nblocks
> > information is cached and if so then we don't need to call smgrnblocks,
> > otherwise, check if it exists. For this, we need an API like
> > smgrnblocks_cahced, something we discussed earlier but preferred the
> > current API. Do you have any better ideas?
>
> Right. I understand the point that let's say there are 100 relations, and
> the first 99 non-local relations happen to enter the optimization path, but the last
> one does not, calling smgrexist() would be too costly and waste of time in that case.
> The only solution I could think of as you mentioned is to reintroduce the new API
> which we discussed before: smgrnblocks_cached().
> It's possible that we call smgrexist() only if smgrnblocks_cached() returns
> InvalidBlockNumber.
> So if everyone agrees, we can add that API smgrnblocks_cached() which will
> Include the "cached" flag parameter, and remove the additional flag modifications
> from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and
> DropRelFileNodesAllBuffers() will use the new API.
>

Yeah, let's do it that way unless anyone has a better idea to suggest.
--
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

From

Amit Kapila

Date:

23 December 2020, 08:57:02

On Wed, Dec 23, 2020 at 10:42 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 23 Dec 2020 04:22:19 +0000, "tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> wrote in
> > From: Amit Kapila <amit.kapila16@gmail.com>
> > > + /* Get the number of blocks for a relation's fork */
> > > + block[i][j] = smgrnblocks(smgr_reln[i], j, &cached);
> > > +
> > > + if (!cached)
> > > + goto buffer_full_scan;
> > >
> > > Why do we need to use goto here? We can simply break from the loop and
> > > then check if (cached && nBlocksToInvalidate <
> > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > > possible without much complexity.
> >
> > That's because two for loops are nested -- breaking there only exits the inner loop.  (I thought the same as you at
first...And I understand some people have alergy to goto, I think modest use of goto makes the code readable.)
 
>
> I don't strongly oppose to goto's but in this case the outer loop can
> break on the same condition with the inner loop, since cached is true
> whenever the inner loop runs to the end. It is needed to initialize
> the variable cache with true, instead of false, though.
>

+1. I think it is better to avoid goto here as it can be done without
introducing any complexity or making code any less readable.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

From

"k.jamison@fujitsu.com"

Date:

23 December 2020, 12:57:24

On Wed, December 23, 2020 5:57 PM (GMT+9), Amit Kapila wrote:
> >
> > At Wed, 23 Dec 2020 04:22:19 +0000, "tsunakawa.takay@fujitsu.com"
> > <tsunakawa.takay@fujitsu.com> wrote in
> > > From: Amit Kapila <amit.kapila16@gmail.com>
> > > > + /* Get the number of blocks for a relation's fork */ block[i][j]
> > > > + = smgrnblocks(smgr_reln[i], j, &cached);
> > > > +
> > > > + if (!cached)
> > > > + goto buffer_full_scan;
> > > >
> > > > Why do we need to use goto here? We can simply break from the loop
> > > > and then check if (cached && nBlocksToInvalidate <
> > > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid
> goto
> > > > if possible without much complexity.
> > >
> > > That's because two for loops are nested -- breaking there only exits
> > > the inner loop.  (I thought the same as you at first... And I
> > > understand some people have alergy to goto, I think modest use of
> > > goto makes the code readable.)
> >
> > I don't strongly oppose to goto's but in this case the outer loop can
> > break on the same condition with the inner loop, since cached is true
> > whenever the inner loop runs to the end. It is needed to initialize
> > the variable cache with true, instead of false, though.
> >
> 
> +1. I think it is better to avoid goto here as it can be done without
> introducing any complexity or making code any less readable.

I also do not mind, so I have removed the goto and followed the advice
of all reviewers. It works fine in the latest attached patch 0003.

Attached herewith are the sets of patches. 0002 & 0003 have the following
changes:

1. I have removed the modifications in smgrnblocks(). So the modifications of 
other functions that uses smgrnblocks() in the previous patch versions were
also reverted.
2. Introduced a new API smgrnblocks_cached() instead which returns either
a cached size for the specified fork or an InvalidBlockNumber.
Since InvalidBlockNumber is used, I think it is logical not to use the additional
boolean parameter "cached" in the function as it will be redundant.
Although in 0003, I only used the "cached" as a Boolean variable to do the trick
of not using goto.
This function is called both in DropRelFileNodeBuffers() and DropRelFileNodesAllBuffers().
3. Modified some minor comments from the patch and commit logs.

It compiles. Passes the regression tests too.
Your feedbacks are definitely welcome.

Regards,
Kirk Jamison

On Wed, Dec 23, 2020 at 6:27 PM k.jamison@fujitsu.com
<k.jamison@fujitsu.com> wrote:
>
>
> It compiles. Passes the regression tests too.
> Your feedbacks are definitely welcome.
>

Thanks, the patches look good to me now. I have slightly edited the
patches for comments, commit messages, and removed the duplicate
code/check in smgrnblocks. I have changed the order of patches (moved
Assert related patch to last because as mentioned earlier, I am not
sure if we want to commit it.). We might still have to change the scan
threshold value based on your and Tang-San's results.

-- 
With Regards,
Amit Kapila.

On Wed, January 6, 2021 7:04 PM (JST), I wrote:
> I will resume the test similar to Tang, because she also executed the original
> failover test which I have been doing before.
> To avoid confusion and to check if the results from mine and Tang are
> consistent, I also did the recovery/failover test for VACUUM on single relation,
> which I will send in a separate email after this.

A. Test to find the right THRESHOLD

So below are the procedures and results of the VACUUM recovery performance
test on single relation.
I followed the advice below and applied the supplementary patch on top of V39:
    Test-for-threshold.patch
This will ensure that we'll always enter the optimized path.
We're gonna use the threshold then as the relation size.

> >One idea could be to remove "nBlocksToInvalidate < 
> >BUF_DROP_FULL_SCAN_THRESHOLD" part of check "if (cached && 
> >nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)" so that it 
> >always use optimized path for the tests. Then use the relation size 
> >as NBuffers/128, NBuffers/256, NBuffers/512 for different values of 
> >shared buffers as 128MB, 1GB, 20GB, 100GB.

Each relation size is NBuffers/XXX, so I used the attached "rel.sh" script
to test from NBuffers/512 until NBuffers/8 relation size per shared_buffers.
I did not go further beyond 8 because it took too much time, and I could
already observe significant results until that.

[Vacuum Recovery Performance on Single Relation]
1. Setup synchronous streaming replication. I used the configuration
  written at the bottom of this email.
2. [Primary] Create 1 table. (rel.sh create)
3. [Primary] Insert data of NBuffers/XXX size. Make sure to use the correct
   size for the set shared_buffers by commenting out the right size in "insert"
   of rel.sh script. (rel.sh insert)
4. [Primary] Delete table. (rel.sh delete)
5. [Standby] Optional: To double-check that DELETE is reflected on standby.
   SELECT count(*) FROM tableXXX;
  Make sure it returns 0.
6. [Standby] Pause WAL replay. (rel.sh pause)
  (This script will execute SELECT pg_wal_replay_pause(); .)
7. [Primary] VACUUM the single relation. (rel.sh vacuum)
8. [Primary] After the vacuum finishes, stop the server. (rel.sh stop)
   (The script will execute pg_ctl stop -D $PGDATA -w -mi)
9. [Standby] Resume WAL replay and promote the standby.
  (rel.sh resume)
  It basically prints a timestamp when resuming WAL replay,
  and prints another timestamp when the promotion is done.
  Compute the time difference.

[Results for VACUUM on single relation]
Average of 5 runs.

1. % REGRESSION
% Regression: (patched - master)/master

| rel_size | 128MB  | 1GB    | 20GB   | 100GB    | 
|----------|--------|--------|--------|----------| 
| NB/512   | 0.000% | 0.000% | 0.000% | -32.680% | 
| NB/256   | 0.000% | 0.000% | 0.000% | 0.000%   | 
| NB/128   | 0.000% | 0.000% | 0.000% | -16.502% | 
| NB/64    | 0.000% | 0.000% | 0.000% | -9.841%  | 
| NB/32    | 0.000% | 0.000% | 0.000% | -6.219%  | 
| NB/16    | 0.000% | 0.000% | 0.000% | 3.323%   | 
| NB/8     | 0.000% | 0.000% | 0.000% | 8.178%   |

For 100GB shared_buffers, we can observe regression
beyond NBuffers/32. So with this, we can conclude
that NBuffers/32 is the right threshold.
For NBuffers/16 and beyond, the patched performs
worse than master. In other words, the cost of for finding
to be invalidated buffers gets higher in the optimized path
than the traditional path.

So in attached V39 patches, I have updated the threshold
BUF_DROP_FULL_SCAN_THRESHOLD to NBuffers/32.

2. [PATCHED]
Units: Seconds

| rel_size | 128MB | 1GB   | 20GB  | 100GB | 
|----------|-------|-------|-------|-------| 
| NB/512   | 0.106 | 0.106 | 0.106 | 0.206 | 
| NB/256   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/128   | 0.106 | 0.106 | 0.206 | 0.506 | 
| NB/64    | 0.106 | 0.106 | 0.306 | 0.907 | 
| NB/32    | 0.106 | 0.106 | 0.406 | 1.508 | 
| NB/16    | 0.106 | 0.106 | 0.706 | 3.109 | 
| NB/8     | 0.106 | 0.106 | 1.307 | 6.614 |

3. MASTER
Units: Seconds

| rel_size | 128MB | 1GB   | 20GB  | 100GB | 
|----------|-------|-------|-------|-------| 
| NB/512   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/256   | 0.106 | 0.106 | 0.106 | 0.306 | 
| NB/128   | 0.106 | 0.106 | 0.206 | 0.606 | 
| NB/64    | 0.106 | 0.106 | 0.306 | 1.006 | 
| NB/32    | 0.106 | 0.106 | 0.406 | 1.608 | 
| NB/16    | 0.106 | 0.106 | 0.706 | 3.009 | 
| NB/8     | 0.106 | 0.106 | 1.307 | 6.114 |

I used the following configurations:
[postgesql.conf]
shared_buffers = 100GB #20GB,1GB,128MB
autovacuum = off
full_page_writes = off
checkpoint_timeout = 30min
max_locks_per_transaction = 10000
max_wal_size = 20GB

# For streaming replication from primary. Don't uncomment on Standby.
synchronous_commit = remote_write
synchronous_standby_names = 'walreceiver'

# For Standby. Don't uncomment on Primary.
# hot_standby = on
#primary_conninfo = 'host=... user=... port=... application_name=walreceiver'

----------
B. Regression Test using the NBuffers/32 Threshold (V39 Patches)

For this one, we do NOT need the supplementary Test-for-threshold.patch.
Apply only the V39 patches.
But instead of using "rel.sh" test script, please use the attached "test.sh".
Similar to the tests I did before for 1000 relations, I executed the recovery
performance test, now with the threshold NBuffers/32.
The configuration setting in postgresql.conf is similar to the test above.

Each relation has 1 block, 8kB size. Total of 1000 relations.

Test procedures is almost similar to A, so I'll just summarize it,
1. Setup synchronous streaming replication and config settings.
2. [Primary] test.sh create
 (The test.sh script will create 1000 tables)
3. [Primary] test.sh insert
4. [Primary] test.sh delete   (Skip step 4-5 for TRUNCATE test)
5. [Standby] Optional for VACUUM test: To double-check that DELETE
  is reflected on standby.  SELECT count(*) FROM tableXXX;
  Make sure it returns 0.
6. [Standby] test.sh pause
7. [Primary] "test.sh vacuum" for VACUUM test
           "test,sh truncate" for TRUNCATE test
8. [Primary] If #7 is done, test.sh stop
9. [Standby] If primary is fully stopped, run "test.sh resume".
  Compute the time difference.

[Results for VACUUM Recovery Performance for 1000 relations]
Unit is in seconds. Average of 5 executions.
% regression = (patched-master)/master

| s_b    | Master | Patched | %reg    | 
|--------|--------|---------|---------| 
| 128 MB | 0.306  | 0.306   | 0.00%   | 
| 1 GB   | 0.506  | 0.306   | -39.53% | 
| 20 GB  | 14.522 | 0.306   | -97.89% | 
| 100 GB | 66.564 | 0.306   | -99.54% |

[Results for TRUNCATE Recovery Performance for 1000 relations]
Unit is in seconds. Average of 5 executions.
% regression = (patched-master)/master

| s_b    | Master | Patched | %reg    | 
|--------|--------|---------|---------| 
| 128 MB | 0.206  | 0.206   | 0.00%   | 
| 1 GB   | 0.506  | 0.206   | -59.29% | 
| 20 GB  | 16.476 | 0.206   | -98.75% | 
| 100 GB | 88.261 | 0.206   | -99.77% |

The results for the patched were constant for all shared_buffers
settings for both TRUNCATE and VACUUM.
That means we can gain huge performance benefits with the patch.

The performance benefits have been tested a lot so there's no question
about that. So I think the final decision for value of threshold would come
if the results will be consistent with others. For now, in my test results,
the threshold NBuffers/32 is what I concluded. It's already indicated in
the attached V39 patch set.

[Specs Used]
Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz
8 CPUs, 256GB Memory
XFS, RHEL7.2, latest Postgres(Head version)

Feedbacks are definitely welcome. 
And if you want to test, I have already indicated the detailed steps
including the scripts I used. Have fun testing!

Regards,
Kirk Jamison

On Wed, January 13, 2021 2:15 PM (JST), Amit Kapila wrote:
> On Wed, Jan 13, 2021 at 7:39 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila
> > <amit.kapila16@gmail.com> wrote in
> > > On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > > >
> > > > At Thu, 7 Jan 2021 09:25:22 +0000, "k.jamison@fujitsu.com"
> <k.jamison@fujitsu.com> wrote in:
> > > > > > Thanks for the detailed tests. NBuffers/32 seems like an
> > > > > > appropriate value for the threshold based on these results. I
> > > > > > would like to slightly modify part of the commit message in
> > > > > > the first patch as below [1], otherwise, I am fine with the
> > > > > > changes. Unless you or anyone else has any more comments, I am
> > > > > > planning to push the 0001 and 0002 sometime next week.
> > > > > >
> > > > > > [1]
> > > > > > "The recovery path of DropRelFileNodeBuffers() is optimized so
> > > > > > that scanning of the whole buffer pool can be avoided when the
> > > > > > number of blocks to be truncated in a relation is below a
> > > > > > certain threshold. For such cases, we find the buffers by doing
> lookups in BufMapping table.
> > > > > > This improves the performance by more than 100 times in many
> > > > > > cases when several small tables (tested with 1000 relations)
> > > > > > are truncated and where the server is configured with a large
> > > > > > value of shared buffers (greater than 100GB)."
> > > > >
> > > > > Thank you for taking a look at the results of the tests. And
> > > > > it's also consistent with the results from Tang too.
> > > > > The commit message LGTM.
> > > >
> > > > +1.
> > > >
> > >
> > > I have pushed the 0001.
> >
> > Thank you for commiting this.
> >
> 
> Pushed 0002 as well.
> 

Thank you very much for committing those two patches, and for everyone here
who contributed in the simplifying the approaches, code reviews, testing, etc.

I compile with the --enable-coverage and check if the newly-added code and updated
parts were covered by tests.
Yes, the lines were hit including the updated lines of DropRelFileNodeBuffers(),
DropRelFileNodesAllBuffers(), smgrdounlinkall(), smgrnblocks().
Newly added APIs were covered too: FindAndDropRelFileNodeBuffers() and
smgrnblocks_cached(). 
However, the parts where UnlockBufHdr(bufHdr, buf_state); is called is not hit.
But I noticed that exists as well in previously existing functions in bufmgr.c.

Thank you very much again.

Regards,
Kirk Jamison