Re: Custom explain options - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Custom explain options
Date
Msg-id c791e16b-1658-4369-bc10-d65cc1cf23c8@enterprisedb.com
Whole thread Raw
In response to Re: Custom explain options  (Konstantin Knizhnik <knizhnik@garret.ru>)
Responses Re: Custom explain options
List pgsql-hackers
On 1/13/24 17:13, Konstantin Knizhnik wrote:
> 
> On 13/01/2024 4:51 pm, Tomas Vondra wrote:
>>
>> On 1/12/24 20:30, Konstantin Knizhnik wrote:
>>> On 12/01/2024 7:03 pm, Tomas Vondra wrote:
>>>> On 10/21/23 14:16, Konstantin Knizhnik wrote:
>>>>> Hi hackers,
>>>>>
>>>>> EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS,
>>>>> COST,...) which help to provide useful details of query execution.
>>>>> In Neon we have added PREFETCH option which shows information about
>>>>> page
>>>>> prefetching during query execution (prefetching is more critical for
>>>>> Neon
>>>>> architecture because of separation of compute and storage, so it is
>>>>> implemented not only for bitmap heap scan as in Vanilla Postgres, but
>>>>> also for seqscan, indexscan and indexonly scan). Another possible
>>>>> candidate  for explain options is local file cache (extra caching
>>>>> layer
>>>>> above shared buffers which is used to somehow replace file system
>>>>> cache
>>>>> in standalone Postgres).
>>>> Not quite related to this patch about EXPLAIN options, but can you
>>>> share
>>>> some details how you implemented prefetching for the other nodes?
>>>>
>>>> I'm asking because I've been working on prefetching for index scans, so
>>>> I'm wondering if there's a better way to do this, or how to do it in a
>>>> way that would allow neon to maybe leverage that too.
>>>>
>>>> regards
>>>>
>>> Yes, I am looking at your PR. What we have implemented in Neon is more
>>> specific to Neon architecture where storage is separated from compute.
>>> So each page not found in shared buffers has to be downloaded from page
>>> server. It adds quite noticeable latency, because of network roundtrip.
>>> While vanilla Postgres can rely on OS file system cache when page is not
>>> found in shared buffer (access to OS file cache is certainly slower than
>>> to shared buffers
>>> because of syscall and copying of page, but performance penaly is not
>>> very large - less than 15%), Neon has no local files and so has to send
>>> request to the socket.
>>>
>>> This is why we have to perform aggressive prefetching whenever it is
>>> possible (when it it is possible to predict order of subsequent pages).
>>> Unlike vanilla Postgres which implements prefetch only for bitmap heap
>>> scan, we have implemented it for seqscan, index scan, indexonly scan,
>>> bitmap heap scan, vacuum, pg_prewarm.
>>> The main difference between Neon prefetch and vanilla Postgres prefetch
>>> is that first one is backend specific. So each backend prefetches only
>>> pages which it needs.
>>> This is why we have to rewrite prefetch for bitmap heap scan, which is
>>> using `fadvise` and assumes that pages prefetched by one backend in file
>>> cache, can be used by any other backend.
>>>
>> I do understand why prefetching is important in neon (likely more than
>> for core postgres). I'm interested in how it's actually implemented,
>> whether it's somehow similar to how my patch does things or in some
>> different (perhaps neon-specific way), and if the approaches are
>> different then what are the pros/cons. And so on.
>>
>> So is it implemented in the neon-specific storage, somehow, or where/how
>> does neon issue the prefetch requests?
> 
> Neon mostly preservers Postgres prefetch mechanism, so we are using
> PrefetchBuffer which checks if page is present in shared buffers
> and if not - calls smgrprefetch. We are using own storage manager
> implementation which instead of reading pages from local disk, download
> them from page server.
> And prefetch implementation in Neon storager manager is obviously also
> different from one in vanilla Postgres which uses posix_fadvise.
> Neon prefetch implementation inserts prefetch request in ring buffer and
> sends it to the server. When read operation is performed we check if
> there is correspondent prefetch request in ring buffer and if so - waits
> its completion.
> 

Thanks. Sure, neon has to use some custom prefetch implementation,
considering not posix_fadvise, considering there's no local page cache
in the architecture.

The thing that was not clear to me is who decides what to prefetch,
which code issues the prefetch requests etc. In the github links you
shared I see it happens in the index AM code (in nbtsearch.c).

That's interesting, because that's what my first prefetching patch did
too - not the same way, ofc, but in the same layer. Simply because it
seemed like the simplest way to do that. But the feedback was that's the
wrong layer, and that it should happen in the executor. And I agree with
that - the reasons are somewhere in the other thread.

Based on what I saw in the neon code, I think it should be possible for
neon to use "my" approach too, but that only works for the index scans,
ofc. Not sure what to do about the other places.

> As I already wrote - prefetch is done locally for each backend. And each
> backend has its own connection with page server. It  can be changed in
> future when we implement multiplexing of page server connections. But
> right now prefetch is local. And certainly prefetch can improve
> performance only if we correctly predict subsequent page requests.
> If not - then page server does useless jobs and backend has to waity and
> consume all issues prefetch requests. This is why in prefetch
> implementation for most of nodes we  start with minimal prefetch
> distance and then increase it. It allows to perform prefetch only for
> such queries where it is really efficient (OLAP) and doesn't degrade
> performance of simple OLTP queries.
> 

Not sure I understand what's so important about prefetches being "local"
for each backend. I mean even in postgres each backend prefetches it's
own buffers, no matter what the other backends do. Although, neon
probably doesn't have the cross-backend sharing through shared buffers
etc. right?

FWIW I certainly agree with the goal to not harm queries that can't
benefit from prefetching. Ramping-up the prefetch distance is something
my patch does too, for exactly this reason.

> Out prefetch implementation is also compatible with parallel plans, but
> here we need to preserve some range of pages for each parallel workers
> instead of picking page from some shared queue on demand. It is one of
> the major difference with Postgres prefetch using posix_fadvise: each
> backend shoudl prefetch only those pages which it will going to read.
> 

Understood. I have no opinion on this, though.

>>> Concerning index scan we have implemented two different approaches: for
>>> index only scan we  try to prefetch leave pages and for index scan we
>>> prefetch referenced heap pages.
>> In my experience the IOS handling (only prefetching leaf pages) is very
>> limiting, and may easily lead to index-only scans being way slower than
>> regular index scans. Which is super surprising for users. It's why I
>> ended up improving the prefetcher to optionally look at the VM etc.
> 
> Well, my assumption was the following: prefetch is most efficient for
> OLAP queries.
> Although HTAP (hybrid transactional/analytical processing) is popular
> trend now,
> classical model is that analytic queries are performed on "historical"
> data, which was already proceeded by vacuum and all-visible bits were
> set in VM.
> May be this assumption is wrong but it seems to me that if most heap
> pages are not marked as all-visible, then  optimizer should prefetch
> bitmap scan to index-only scan.

I think this assumption is generally reasonable, but it hinges on the
assumption that OLAP queries have most indexes recently vacuumed and
all-visible. I'm not sure it's wise to rely on that.

Without prefetching it's not that important - the worst thing that would
happen is that the IOS degrades into regular index-scan. But with
prefetching these plans can "invert" with respect to cost.

I'm not saying it's terrible or that IOS must have prefetching, but I
think it's something users may run into fairly often. And it led me to
rework the prefetching so that IOS can prefetch too ...

> And for combination of index and heap bitmap scans we can efficiently
> prefetch both index and heap pages.
> 
>>> In both cases we start from prefetch distance 0 and increase it until it
>>> reaches `effective_io_concurrency` for this relation. Doing so we try to
>>> avoid prefetching of useless pages and slowdown of "point" lookups
>>> returning one or few records.
>>>
>> Right, the regular prefetch ramp-up. My patch does the same thing.
>>
>>> If you are interested, you can look at our implementation in neon repo:
>>> all source are available. But briefly speaking, each backend has its own
>>> prefetch ring (prefetch requests which are waiting for response). The
>>> key idea is that we can send several prefetch requests to page server
>>> and then receive multiple replies. It allows to increased speed of OLAP
>>> queries up to 10 times.
>>>
>> Can you point me to the actual code / branch where it happens? I did
>> check the github repo, but I don't see anything relevant in the default
>> branch (REL_15_STABLE_neon). There are some "prefetch" branches, but
>> those seem abandoned.
> 
> Implementation of prefetch mecnahism is in Neon extension:
> https://github.com/neondatabase/neon/blob/60ced06586a6811470c16c6386daba79ffaeda13/pgxn/neon/pagestore_smgr.c#L205
> 
> But concrete implementation of prefetch for particular nodes is
> certainly inside Postgres.
> For example, if you are interested how it is implemented for index scan,
> then please look at:
>
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L844
>
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1166
>
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1467
>
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1625
>
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L2629
> 

Thanks! Very helpful. As I said, I ended up moving the prefetching to
the executor. For indexscans I think it should be possible for neon to
benefit from that (in a way, it doesn't need to do anything except for
overriding what PrefetchBuffer does). Not sure about the other places
where neon needs to prefetch, I don't have ambition to rework those.

> 
>>
>>> Heikki thinks that prefetch can be somehow combined with async-io
>>> proposal (based on io_uring). But right now they have nothing in common.
>>>
>> I can imagine async I/O being useful here, but I find the flow of I/O
>> requests is quite complex / goes through multiple layers. Or maybe I
>> just don't understand how it should work.
> I also do not think that it will be possible to marry this two approaches.

I didn't actually say it would be impossible - I think it seems like a
use case where async I/O should be a natural fit. But I'm not sure to do
that in a way that would not be super confusing and/or fragile when
something unexpected happens (like a rescan, or maybe some change to the
index structure - page split, etc.)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Christoph Berg
Date:
Subject: Re: plperl and perl 5.38
Next
From: Tomas Vondra
Date:
Subject: Re: BRIN indexes vs. SK_SEARCHARRAY (and preprocessing scan keys)