RE: [Patch] Optimize dropping of relation buffers using dlist - Mailing list pgsql-hackers

From k.jamison@fujitsu.com
Subject RE: [Patch] Optimize dropping of relation buffers using dlist
Date
Msg-id OSBPR01MB2341DDECAFA4858A161AF887EF480@OSBPR01MB2341.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: [Patch] Optimize dropping of relation buffers using dlist  (Andres Freund <andres@anarazel.de>)
Responses Re: [Patch] Optimize dropping of relation buffers using dlist
Re: [Patch] Optimize dropping of relation buffers using dlist
List pgsql-hackers
On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:

Hi,
Thank you for your constructive review and comments.
Sorry for the late reply.

> Hi,
>
> On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > Indeed. The buffer mapping hashtable already is visible as a major
> > > bottleneck in a number of workloads. Even in readonly pgbench if s_b
> > > is large enough (so the hashtable is larger than the cache). Not to
> > > speak of things like a cached sequential scan with a cheap qual and wide
> rows.
> >
> > To be fair, the added overhead is in buffer allocation not buffer lookup.
> > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > upthread, the potential trouble spot is where the working set is
> > bigger than shared buffers but still fits in RAM (so there's no actual
> > I/O needed, but we do still have to shuffle buffers a lot).
>
> Oh, right, not sure what I was thinking.
>
>
> > > Wonder if the temporary fix is just to do explicit hashtable probes
> > > for all pages iff the size of the relation is < s_b / 500 or so.
> > > That'll address the case where small tables are frequently dropped -
> > > and dropping large relations is more expensive from the OS and data
> > > loading perspective, so it's not gonna happen as often.
> >
> > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > relation had been (preferably without adding an lseek call), but maybe
> > that's do-able.
>
> IIRC we already do smgrnblocks nearby, when doing the truncation (to figure out
> which segments we need to remove). Perhaps we can arrange to combine the
> two? The layering probably makes that somewhat ugly :(
>
> We could also just use pg_class.relpages. It'll probably mostly be accurate
> enough?
>
> Or we could just cache the result of the last smgrnblocks call...
>
>
> One of the cases where this type of strategy is most intersting to me is the partial
> truncations that autovacuum does... There we even know the range of tables
> ahead of time.

Konstantin tested it on various workloads and saw no regression.
But I understand the sentiment on the added overhead on BufferAlloc.
Regarding the case where the patch would potentially affect workloads that fit into
RAM but not into shared buffers, could one of Andres' suggested idea/s above address
that, in addition to this patch's possible shared invalidation fix? Could that settle
the added overhead in BufferAlloc() as temporary fix?
Thomas Munro is also working on caching relation sizes [1], maybe that way we
could get the latest known relation size. Currently, it's possible only during
recovery in smgrnblocks.

Tom Lane wrote:
> But aside from that, I noted a number of things I didn't like a bit:
>
> * The amount of new shared memory this needs seems several orders of
> magnitude higher than what I'd call acceptable: according to my measurements
> it's over 10KB per shared buffer!  Most of that is going into the
> CachedBufTableLock data structure, which seems fundamentally misdesigned ---
> how could we be needing a lock per map partition *per buffer*?  For comparison,
> the space used by buf_table.c is about 56 bytes per shared buffer; I think this
> needs to stay at least within hailing distance of there.
>
> * It is fairly suspicious that the new data structure is manipulated while holding
> per-partition locks for the existing buffer hashtable.
> At best that seems bad for concurrency, and at worst it could result in deadlocks,
> because I doubt we can assume that the new hash table has partition boundaries
> identical to the old one.
>
> * More generally, it seems like really poor design that this has been written
> completely independently of the existing buffer hash table.
> Can't we get any benefit by merging them somehow?

The original aim is to just shorten the recovery process, and eventually the speedup
on both vacuum and truncate process are just added bonus.
Given that we don't have a shared invalidation mechanism in place yet like radix tree
buffer mapping which is complex, I thought a patch like mine could be an alternative
approach to that. So I want to improve the patch further.
I hope you can help me clarify the direction, so that I can avoid going farther away
from what the community wants.
 1. Both normal operations and recovery process
 2. Improve recovery process only

For 1, the current patch aims to touch on that, but further design improvement is needed.
It would be ideal to modify the BufferDesc, but that cannot be expanded anymore because
it would exceed the CPU cache line size. So I added new data structures (hash table,
dlist, lock) instead of modifying the existing ones.
The new hash table ensures that it's identical to the old one with the use of the same
Relfilenode in the key and a lock when inserting and deleting buffers from buffer table,
as well as during lookups. As for the partition locking, I added it to reduce lock contention.
Tomas Vondra reported regression and mainly its due to buffer mapping locks in V4 and
previous patch versions. So from V5, I used spinlock when inserting/deleting buffers,
to prevent modification when concurrent lookup is happening. LWLock is acquired when
we're doing lookup operation.
If we want this direction, I hope to address Tom's comments in the next patch version.
I admit that this patch needs reworking on shmem resource consumption and clarifying
the design/approach more, i.e. how it affects the existing buffer allocation and
invalidation process, lock mechanism, etc.

If we're going for 2, Konstantin suggested an idea in the previous email:

> I wonder if you have considered case of local hash (maintained only during recovery)?
> If there is after-crash recovery, then there will be no concurrent
> access to shared buffers and this hash will be up-to-date.
> in case of hot-standby replica we can use some simple invalidation (just
> one flag or counter which indicates that buffer cache was updated).
> This hash also can be constructed on demand when DropRelFileNodeBuffers
> is called first time (so w have to scan all buffers once, but subsequent
> drop operation will be fast.

I'm examining this, but I am not sure if I got the correct understanding. Please correct
me if I'm wrong.
I think above is a suggestion wherein the postgres startup process uses local hash table
to keep track of the buffers of relations. Since there may be other read-only sessions which
read from disk, evict cached blocks, and modify the shared_buffers, the flag would be updated.
We could do it during recovery, then release it as recovery completes.

I haven't looked deeply yet into the source code but we maybe can modify the REDO
(main redo do-while loop) in StartupXLOG() once the read-only connections are consistent.
It would also be beneficial to construct this local hash when DropRefFileNodeBuffers()
is called for the first time, so the whole share buffers is scanned initially, then as
you mentioned subsequent dropping will be fast. (similar behavior to what the patch does)

Do you think this is feasible to be implemented? Or should we explore another approach?

I'd really appreciate your ideas, feedback, suggestions, and advice.
Thank you again for the review.

Regards
Kirk Jamison

[1] https://www.postgresql.org/message-id/CA%2BhUKGKEW7-9pq%2Bs2_4Q-Fcpr9cc7_0b3pkno5qzPKC3y2nOPA%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: PROC_IN_ANALYZE stillborn 13 years ago
Next
From: Michael Paquier
Date:
Subject: Re: Add MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP options to VACUUM