Thread: Custom explain options
Hi hackers,
EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is implemented not only for bitmap heap scan as in Vanilla Postgres, but also for seqscan, indexscan and indexonly scan). Another possible candidate for explain options is local file cache (extra caching layer above shared buffers which is used to somehow replace file system cache in standalone Postgres).
I think that it will be nice to have a generic mechanism which allows extensions to add its own options to EXPLAIN.
I have attached the patch with implementation of such mechanism (also available as PR: https://github.com/knizhnik/postgres/pull/1 )
I have demonstrated this mechanism using Bloom extension - just to report number of Bloom matches.
Not sure that it is really useful information but it is used mostly as example:
explain (analyze,bloom) select * from t where pk=2000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t (cost=15348.00..15352.01 rows=1 width=4) (actual time=25.244..25.939 rows=1 loops=1) Recheck Cond: (pk = 2000) Rows Removed by Index Recheck: 292 Heap Blocks: exact=283 Bloom: matches=293 -> Bitmap Index Scan on t_pk_idx (cost=0.00..15348.00 rows=1 width=0) (actual time=25.147..25.147 rows=293 loops=1) Index Cond: (pk = 2000) Bloom: matches=293 Planning: Bloom: matches=0 Planning Time: 0.387 ms Execution Time: 26.053 ms (12 rows)
Instrumentation
and some other data structures fixes size. Otherwise maintaining varying parts of this structure is ugly, especially in shared memoryRegisterCustomInsrumentation
function which is called from _PG_init
But
_PG_init
is called when extension is loaded and it is loaded on demand when some of extension functions is called (except when extension is included in shared_preload_libraries list), Bloom extension doesn't require it. So if your first statement executed in your session is:
explain (analyze,bloom) select * from t where pk=2000;
ERROR: unrecognized EXPLAIN option "bloom" LINE 1: explain (analyze,bloom) select * from t where pk=2000;
RegisterCustomInsrumentation
is not yet called. If we repeat the query, then proper result will be displayed (see above).
Attachment
Hi hackers,
EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which help to provide useful details of query execution.
In Neon we have added PREFETCH option which shows information about page prefetching during query execution (prefetching is more critical for Neon
architecture because of separation of compute and storage, so it is implemented not only for bitmap heap scan as in Vanilla Postgres, but also for seqscan, indexscan and indexonly scan). Another possible candidate for explain options is local file cache (extra caching layer above shared buffers which is used to somehow replace file system cache in standalone Postgres).I think that it will be nice to have a generic mechanism which allows extensions to add its own options to EXPLAIN.
I have attached the patch with implementation of such mechanism (also available as PR: https://github.com/knizhnik/postgres/pull/1 )
I have demonstrated this mechanism using Bloom extension - just to report number of Bloom matches.
Not sure that it is really useful information but it is used mostly as example:explain (analyze,bloom) select * from t where pk=2000; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on t (cost=15348.00..15352.01 rows=1 width=4) (actual time=25.244..25.939 rows=1 loops=1) Recheck Cond: (pk = 2000) Rows Removed by Index Recheck: 292 Heap Blocks: exact=283 Bloom: matches=293 -> Bitmap Index Scan on t_pk_idx (cost=0.00..15348.00 rows=1 width=0) (actual time=25.147..25.147 rows=293 loops=1) Index Cond: (pk = 2000) Bloom: matches=293 Planning: Bloom: matches=0 Planning Time: 0.387 ms Execution Time: 26.053 ms (12 rows)There are two known issues with this proposal:1. I have to limit total size of all custom metrics - right now it is limited by 128 bytes. It is done to keepInstrumentation
and some other data structures fixes size. Otherwise maintaining varying parts of this structure is ugly, especially in shared memory2. Custom extension is added by means ofRegisterCustomInsrumentation
function which is called from_PG_init
But_PG_init
is called when extension is loaded and it is loaded on demand when some of extension functions is called (except when extension is included
in shared_preload_libraries list), Bloom extension doesn't require it. So if your first statement executed in your session is:explain (analyze,bloom) select * from t where pk=2000;...you will get error:ERROR: unrecognized EXPLAIN option "bloom" LINE 1: explain (analyze,bloom) select * from t where pk=2000;It happens because at the moment when explain statement parses options, Bloom index is not yet selected and so bloom extension is not loaded andRegisterCustomInsrumentation
is not yet called. If we repeat the query, then proper result will be displayed (see above).
+ foreach (lc, pgCustInstr)
+ {
+ CustomInstrumentation *ci = (CustomInstrumentation*) lfirst(lc);
+
+ ci->selected = false;
+ }
Attachment
On 21/10/2023 19:16, Konstantin Knizhnik wrote: > EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, > COST,...) which help to provide useful details of query execution. > In Neon we have added PREFETCH option which shows information about page > prefetching during query execution (prefetching is more critical for Neon > architecture because of separation of compute and storage, so it is > implemented not only for bitmap heap scan as in Vanilla Postgres, but > also for seqscan, indexscan and indexonly scan). Another possible > candidate for explain options is local file cache (extra caching layer > above shared buffers which is used to somehow replace file system cache > in standalone Postgres). > > I think that it will be nice to have a generic mechanism which allows > extensions to add its own options to EXPLAIN. Generally, I welcome this idea: Extensions can already do a lot of work, and they should have a tool to report their state, not only into the log. But I think your approach needs to be elaborated. At first, it would be better to allow registering extended instruments for specific node types to avoid unneeded calls. Secondly, looking into the Instrumentation usage, I don't see the reason to limit the size: as I see everywhere it exists locally or in the DSA where its size is calculated on the fly. So, by registering an extended instrument, we can reserve a slot for the extension. The actual size of underlying data can be provided by the extension routine. -- regards, Andrei Lepikhov Postgres Professional
On 30/11/2023 5:59 am, Andrei Lepikhov wrote: > On 21/10/2023 19:16, Konstantin Knizhnik wrote: >> EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, >> COST,...) which help to provide useful details of query execution. >> In Neon we have added PREFETCH option which shows information about >> page prefetching during query execution (prefetching is more critical >> for Neon >> architecture because of separation of compute and storage, so it is >> implemented not only for bitmap heap scan as in Vanilla Postgres, but >> also for seqscan, indexscan and indexonly scan). Another possible >> candidate for explain options is local file cache (extra caching >> layer above shared buffers which is used to somehow replace file >> system cache in standalone Postgres). >> >> I think that it will be nice to have a generic mechanism which allows >> extensions to add its own options to EXPLAIN. > > Generally, I welcome this idea: Extensions can already do a lot of > work, and they should have a tool to report their state, not only into > the log. > But I think your approach needs to be elaborated. At first, it would > be better to allow registering extended instruments for specific node > types to avoid unneeded calls. > Secondly, looking into the Instrumentation usage, I don't see the > reason to limit the size: as I see everywhere it exists locally or in > the DSA where its size is calculated on the fly. So, by registering an > extended instrument, we can reserve a slot for the extension. The > actual size of underlying data can be provided by the extension routine. > Thank you for review. I agree that support of extended instruments is desired. I just tried to minimize number of changes to make this patch smaller. Concerning limiting instrumentation size, may be I missed something, but I do not see any goo way to handle this: ``` ./src/backend/executor/nodeMemoize.c1106: si = &node->shared_info->sinstrument[ParallelWorkerNumber]; ./src/backend/executor/nodeAgg.c4322: si = &node->shared_info->sinstrument[ParallelWorkerNumber]; ./src/backend/executor/nodeIncrementalSort.c107: instrumentSortedGroup(&(node)->shared_info->sinfo[ParallelWorkerNumber].groupName##GroupInfo, \ ./src/backend/executor/execParallel.c808: InstrInit(&instrument[i], estate->es_instrument); ./src/backend/executor/execParallel.c1052: InstrAggNode(planstate->instrument, &instrument[n]); ./src/backend/executor/execParallel.c1306: InstrAggNode(&instrument[ParallelWorkerNumber], planstate->instrument); ./src/backend/commands/explain.c1763: Instrumentation *instrument = &w->instrument[n]; ./src/backend/commands/explain.c2168: Instrumentation *instrument = &w->instrument[n]; ``` In all this cases we are using array of `Instrumentation` and if it contains varying part, then it is not clear where to place it. Yes, there is also code which serialize and sends instrumentations between worker processes and I have updated this code in my PR to send actual amount of custom instrumentation data. But it can not help with the cases above.
On 30/11/2023 22:40, Konstantin Knizhnik wrote: > > On 30/11/2023 5:59 am, Andrei Lepikhov wrote: >> On 21/10/2023 19:16, Konstantin Knizhnik wrote: >>> EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, >>> COST,...) which help to provide useful details of query execution. >>> In Neon we have added PREFETCH option which shows information about >>> page prefetching during query execution (prefetching is more critical >>> for Neon >>> architecture because of separation of compute and storage, so it is >>> implemented not only for bitmap heap scan as in Vanilla Postgres, but >>> also for seqscan, indexscan and indexonly scan). Another possible >>> candidate for explain options is local file cache (extra caching >>> layer above shared buffers which is used to somehow replace file >>> system cache in standalone Postgres). >>> >>> I think that it will be nice to have a generic mechanism which allows >>> extensions to add its own options to EXPLAIN. >> >> Generally, I welcome this idea: Extensions can already do a lot of >> work, and they should have a tool to report their state, not only into >> the log. >> But I think your approach needs to be elaborated. At first, it would >> be better to allow registering extended instruments for specific node >> types to avoid unneeded calls. >> Secondly, looking into the Instrumentation usage, I don't see the >> reason to limit the size: as I see everywhere it exists locally or in >> the DSA where its size is calculated on the fly. So, by registering an >> extended instrument, we can reserve a slot for the extension. The >> actual size of underlying data can be provided by the extension routine. >> > Thank you for review. > > I agree that support of extended instruments is desired. I just tried to > minimize number of changes to make this patch smaller. I got it. But having a substantial number of extensions in support, I think the extension part of instrumentation could have advantages and be worth elaborating on. > In all this cases we are using array of `Instrumentation` and if it > contains varying part, then it is not clear where to place it. > Yes, there is also code which serialize and sends instrumentations > between worker processes and I have updated this code in my PR to send > actual amount of custom instrumentation data. But it can not help with > the cases above. I see next basic instruments in the code: - Instrumentation (which should be named NodeInstrumentation) - MemoizeInstrumentation - JitInstrumentation - AggregateInstrumentation - HashInstrumentation - TuplesortInstrumentation As a variant, extensibility can be designed with parent 'AbstractInstrumentation' node, containing node type and link to extensible part. sizeof(Instr) calls should be replaced with the getInstrSize() call - not so much places in the code; memcpy() also can be replaced with the copy_instr() routine. -- regards, Andrei Lepikhov Postgres Professional
On Sat, 21 Oct 2023 at 18:34, Konstantin Knizhnik <knizhnik@garret.ru> wrote: > > Hi hackers, > > EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which help to provide useful details of queryexecution. > In Neon we have added PREFETCH option which shows information about page prefetching during query execution (prefetchingis more critical for Neon > architecture because of separation of compute and storage, so it is implemented not only for bitmap heap scan as in VanillaPostgres, but also for seqscan, indexscan and indexonly scan). Another possible candidate for explain options islocal file cache (extra caching layer above shared buffers which is used to somehow replace file system cache in standalonePostgres). > > I think that it will be nice to have a generic mechanism which allows extensions to add its own options to EXPLAIN. > I have attached the patch with implementation of such mechanism (also available as PR: https://github.com/knizhnik/postgres/pull/1) > > I have demonstrated this mechanism using Bloom extension - just to report number of Bloom matches. > Not sure that it is really useful information but it is used mostly as example: > > explain (analyze,bloom) select * from t where pk=2000; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on t (cost=15348.00..15352.01 rows=1 width=4) (actual time=25.244..25.939 rows=1 loops=1) > Recheck Cond: (pk = 2000) > Rows Removed by Index Recheck: 292 > Heap Blocks: exact=283 > Bloom: matches=293 > -> Bitmap Index Scan on t_pk_idx (cost=0.00..15348.00 rows=1 width=0) (actual time=25.147..25.147 rows=293 loops=1) > Index Cond: (pk = 2000) > Bloom: matches=293 > Planning: > Bloom: matches=0 > Planning Time: 0.387 ms > Execution Time: 26.053 ms > (12 rows) > > There are two known issues with this proposal: There are few compilation errors reported by CFBot at [1] with: [05:00:40.452] ../src/backend/access/brin/brin.c: In function ‘_brin_end_parallel’: [05:00:40.452] ../src/backend/access/brin/brin.c:2675:3: error: too few arguments to function ‘InstrAccumParallelQuery’ [05:00:40.452] 2675 | InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]); [05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~~~ [05:00:40.452] In file included from ../src/include/nodes/execnodes.h:33, [05:00:40.452] from ../src/include/access/brin.h:13, [05:00:40.452] from ../src/backend/access/brin/brin.c:18: [05:00:40.452] ../src/include/executor/instrument.h:151:13: note: declared here [05:00:40.452] 151 | extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* custusage); [05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~~~ [05:00:40.452] ../src/backend/access/brin/brin.c: In function ‘_brin_parallel_build_main’: [05:00:40.452] ../src/backend/access/brin/brin.c:2873:2: error: too few arguments to function ‘InstrEndParallelQuery’ [05:00:40.452] 2873 | InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], [05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~ [05:00:40.452] In file included from ../src/include/nodes/execnodes.h:33, [05:00:40.452] from ../src/include/access/brin.h:13, [05:00:40.452] from ../src/backend/access/brin/brin.c:18: [05:00:40.452] ../src/include/executor/instrument.h:150:13: note: declared here [05:00:40.452] 150 | extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage, char* custusage); [1] - https://cirrus-ci.com/task/5452124486631424?logs=build#L374 Regards, Vignesh
On 30/11/2023 22:40, Konstantin Knizhnik wrote: > In all this cases we are using array of `Instrumentation` and if it > contains varying part, then it is not clear where to place it. > Yes, there is also code which serialize and sends instrumentations > between worker processes and I have updated this code in my PR to send > actual amount of custom instrumentation data. But it can not help with > the cases above. What do you think about this really useful feature? Do you wish to develop it further? -- regards, Andrei Lepikhov Postgres Professional
On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote: > What do you think about this really useful feature? Do you wish to develop > it further? I am biased here. This seems like a lot of code for something we've been delegating to the explain hook for ages. Even if I can see the appeal of pushing that more into explain.c to get more data on a per-node basis depending on the custom options given by the caller of an EXPLAIN entry point, I cannot get really excited about the extra maintenance this facility would involve compared to the potential gains, knowing that there's a hook. -- Michael
Attachment
On 10/01/2024 8:46 am, Michael Paquier wrote: > On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote: >> What do you think about this really useful feature? Do you wish to develop >> it further? > I am biased here. This seems like a lot of code for something we've > been delegating to the explain hook for ages. Even if I can see the > appeal of pushing that more into explain.c to get more data on a > per-node basis depending on the custom options given by the caller of > an EXPLAIN entry point, I cannot get really excited about the extra > maintenance this facility would involve compared to the potential > gains, knowing that there's a hook. > -- > Michael Well, I am not sure that proposed patch is flexible enough to handle all possible scenarios. I just wanted to make it as simple as possible to leave some chances for it to me merged. But it is easy to answer the question why existed explain hook is not enough: 1. It doesn't allow to add some extra options to EXPLAIN. My intention was to be able to do something like this "explain (analyze,buffers,prefetch) ...". It is completely not possible with explain hook. 2. May be I wrong, but it is impossible now to collect and combine instrumentation from all parallel workers without changing Postgres core Explain hook can be useful if you add some custom node to query execution plan and want to provide information about this node. But if you are implementing some alternative storage mechanism or some optimization for existed plan nodes, then it is very difficult to do it using existed explain hook.
On 10/01/2024 8:29 am, Andrei Lepikhov wrote: > On 30/11/2023 22:40, Konstantin Knizhnik wrote: >> In all this cases we are using array of `Instrumentation` and if it >> contains varying part, then it is not clear where to place it. >> Yes, there is also code which serialize and sends instrumentations >> between worker processes and I have updated this code in my PR to >> send actual amount of custom instrumentation data. But it can not >> help with the cases above. > What do you think about this really useful feature? Do you wish to > develop it further? > In Neon (cloud Postgres) we have changed Postgres core to include in explain information about prefetch and local file cache. EXPLAIN seems to be most convenient way for users to get this information which can be very useful for investigation of query execution speed. So my intention was to make it possible to add extra information to explain without patching Postgres core. Existed explain hook is not enough for it. I am not sure that the suggested approach is flexible enough. First of all I tried to make it is simple as possible, minimizing changes in Postgres core.
On 09/01/2024 10:33 am, vignesh C wrote: > On Sat, 21 Oct 2023 at 18:34, Konstantin Knizhnik <knizhnik@garret.ru> wrote: >> Hi hackers, >> >> EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which help to provide useful details of queryexecution. >> In Neon we have added PREFETCH option which shows information about page prefetching during query execution (prefetchingis more critical for Neon >> architecture because of separation of compute and storage, so it is implemented not only for bitmap heap scan as in VanillaPostgres, but also for seqscan, indexscan and indexonly scan). Another possible candidate for explain options islocal file cache (extra caching layer above shared buffers which is used to somehow replace file system cache in standalonePostgres). >> >> I think that it will be nice to have a generic mechanism which allows extensions to add its own options to EXPLAIN. >> I have attached the patch with implementation of such mechanism (also available as PR: https://github.com/knizhnik/postgres/pull/1) >> >> I have demonstrated this mechanism using Bloom extension - just to report number of Bloom matches. >> Not sure that it is really useful information but it is used mostly as example: >> >> explain (analyze,bloom) select * from t where pk=2000; >> QUERY PLAN >> ------------------------------------------------------------------------------------------------------------------------- >> Bitmap Heap Scan on t (cost=15348.00..15352.01 rows=1 width=4) (actual time=25.244..25.939 rows=1 loops=1) >> Recheck Cond: (pk = 2000) >> Rows Removed by Index Recheck: 292 >> Heap Blocks: exact=283 >> Bloom: matches=293 >> -> Bitmap Index Scan on t_pk_idx (cost=0.00..15348.00 rows=1 width=0) (actual time=25.147..25.147 rows=293 loops=1) >> Index Cond: (pk = 2000) >> Bloom: matches=293 >> Planning: >> Bloom: matches=0 >> Planning Time: 0.387 ms >> Execution Time: 26.053 ms >> (12 rows) >> >> There are two known issues with this proposal: > There are few compilation errors reported by CFBot at [1] with: > [05:00:40.452] ../src/backend/access/brin/brin.c: In function > ‘_brin_end_parallel’: > [05:00:40.452] ../src/backend/access/brin/brin.c:2675:3: error: too > few arguments to function ‘InstrAccumParallelQuery’ > [05:00:40.452] 2675 | > InstrAccumParallelQuery(&brinleader->bufferusage[i], > &brinleader->walusage[i]); > [05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~~~ > [05:00:40.452] In file included from ../src/include/nodes/execnodes.h:33, > [05:00:40.452] from ../src/include/access/brin.h:13, > [05:00:40.452] from ../src/backend/access/brin/brin.c:18: > [05:00:40.452] ../src/include/executor/instrument.h:151:13: note: declared here > [05:00:40.452] 151 | extern void InstrAccumParallelQuery(BufferUsage > *bufusage, WalUsage *walusage, char* custusage); > [05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~~~ > [05:00:40.452] ../src/backend/access/brin/brin.c: In function > ‘_brin_parallel_build_main’: > [05:00:40.452] ../src/backend/access/brin/brin.c:2873:2: error: too > few arguments to function ‘InstrEndParallelQuery’ > [05:00:40.452] 2873 | > InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], > [05:00:40.452] | ^~~~~~~~~~~~~~~~~~~~~ > [05:00:40.452] In file included from ../src/include/nodes/execnodes.h:33, > [05:00:40.452] from ../src/include/access/brin.h:13, > [05:00:40.452] from ../src/backend/access/brin/brin.c:18: > [05:00:40.452] ../src/include/executor/instrument.h:150:13: note: declared here > [05:00:40.452] 150 | extern void InstrEndParallelQuery(BufferUsage > *bufusage, WalUsage *walusage, char* custusage); > > [1] - https://cirrus-ci.com/task/5452124486631424?logs=build#L374 > > Regards, > Vignesh Thank you for reporting the problem. Rebased version of the patch is attached.
Attachment
On 10/1/2024 20:27, Konstantin Knizhnik wrote: > > On 10/01/2024 8:46 am, Michael Paquier wrote: >> On Wed, Jan 10, 2024 at 01:29:30PM +0700, Andrei Lepikhov wrote: >>> What do you think about this really useful feature? Do you wish to >>> develop >>> it further? >> I am biased here. This seems like a lot of code for something we've >> been delegating to the explain hook for ages. Even if I can see the >> appeal of pushing that more into explain.c to get more data on a >> per-node basis depending on the custom options given by the caller of >> an EXPLAIN entry point, I cannot get really excited about the extra >> maintenance this facility would involve compared to the potential >> gains, knowing that there's a hook. >> -- >> Michael > > > Well, I am not sure that proposed patch is flexible enough to handle all > possible scenarios. > I just wanted to make it as simple as possible to leave some chances for > it to me merged. > But it is easy to answer the question why existed explain hook is not > enough: > > 1. It doesn't allow to add some extra options to EXPLAIN. My intention > was to be able to do something like this "explain > (analyze,buffers,prefetch) ...". It is completely not possible with > explain hook. I agree. Designing mostly planner-related extensions, I also wanted to add some information to the explain of nodes. For example, pg_query_state could add the status of the node at the time of interruption of execution: started, stopped, or loop closed. Maybe we should gather some statistics on how developers of extensions deal with that issue ... -- regards, Andrei Lepikhov Postgres Professional
On 10/21/23 14:16, Konstantin Knizhnik wrote: > Hi hackers, > > EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, > COST,...) which help to provide useful details of query execution. > In Neon we have added PREFETCH option which shows information about page > prefetching during query execution (prefetching is more critical for Neon > architecture because of separation of compute and storage, so it is > implemented not only for bitmap heap scan as in Vanilla Postgres, but > also for seqscan, indexscan and indexonly scan). Another possible > candidate for explain options is local file cache (extra caching layer > above shared buffers which is used to somehow replace file system cache > in standalone Postgres). Not quite related to this patch about EXPLAIN options, but can you share some details how you implemented prefetching for the other nodes? I'm asking because I've been working on prefetching for index scans, so I'm wondering if there's a better way to do this, or how to do it in a way that would allow neon to maybe leverage that too. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/01/2024 7:03 pm, Tomas Vondra wrote: > On 10/21/23 14:16, Konstantin Knizhnik wrote: >> Hi hackers, >> >> EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, >> COST,...) which help to provide useful details of query execution. >> In Neon we have added PREFETCH option which shows information about page >> prefetching during query execution (prefetching is more critical for Neon >> architecture because of separation of compute and storage, so it is >> implemented not only for bitmap heap scan as in Vanilla Postgres, but >> also for seqscan, indexscan and indexonly scan). Another possible >> candidate for explain options is local file cache (extra caching layer >> above shared buffers which is used to somehow replace file system cache >> in standalone Postgres). > Not quite related to this patch about EXPLAIN options, but can you share > some details how you implemented prefetching for the other nodes? > > I'm asking because I've been working on prefetching for index scans, so > I'm wondering if there's a better way to do this, or how to do it in a > way that would allow neon to maybe leverage that too. > > regards > Yes, I am looking at your PR. What we have implemented in Neon is more specific to Neon architecture where storage is separated from compute. So each page not found in shared buffers has to be downloaded from page server. It adds quite noticeable latency, because of network roundtrip. While vanilla Postgres can rely on OS file system cache when page is not found in shared buffer (access to OS file cache is certainly slower than to shared buffers because of syscall and copying of page, but performance penaly is not very large - less than 15%), Neon has no local files and so has to send request to the socket. This is why we have to perform aggressive prefetching whenever it is possible (when it it is possible to predict order of subsequent pages). Unlike vanilla Postgres which implements prefetch only for bitmap heap scan, we have implemented it for seqscan, index scan, indexonly scan, bitmap heap scan, vacuum, pg_prewarm. The main difference between Neon prefetch and vanilla Postgres prefetch is that first one is backend specific. So each backend prefetches only pages which it needs. This is why we have to rewrite prefetch for bitmap heap scan, which is using `fadvise` and assumes that pages prefetched by one backend in file cache, can be used by any other backend. Concerning index scan we have implemented two different approaches: for index only scan we try to prefetch leave pages and for index scan we prefetch referenced heap pages. In both cases we start from prefetch distance 0 and increase it until it reaches `effective_io_concurrency` for this relation. Doing so we try to avoid prefetching of useless pages and slowdown of "point" lookups returning one or few records. If you are interested, you can look at our implementation in neon repo: all source are available. But briefly speaking, each backend has its own prefetch ring (prefetch requests which are waiting for response). The key idea is that we can send several prefetch requests to page server and then receive multiple replies. It allows to increased speed of OLAP queries up to 10 times. Heikki thinks that prefetch can be somehow combined with async-io proposal (based on io_uring). But right now they have nothing in common.
On 1/12/24 20:30, Konstantin Knizhnik wrote: > > On 12/01/2024 7:03 pm, Tomas Vondra wrote: >> On 10/21/23 14:16, Konstantin Knizhnik wrote: >>> Hi hackers, >>> >>> EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, >>> COST,...) which help to provide useful details of query execution. >>> In Neon we have added PREFETCH option which shows information about page >>> prefetching during query execution (prefetching is more critical for >>> Neon >>> architecture because of separation of compute and storage, so it is >>> implemented not only for bitmap heap scan as in Vanilla Postgres, but >>> also for seqscan, indexscan and indexonly scan). Another possible >>> candidate for explain options is local file cache (extra caching layer >>> above shared buffers which is used to somehow replace file system cache >>> in standalone Postgres). >> Not quite related to this patch about EXPLAIN options, but can you share >> some details how you implemented prefetching for the other nodes? >> >> I'm asking because I've been working on prefetching for index scans, so >> I'm wondering if there's a better way to do this, or how to do it in a >> way that would allow neon to maybe leverage that too. >> >> regards >> > Yes, I am looking at your PR. What we have implemented in Neon is more > specific to Neon architecture where storage is separated from compute. > So each page not found in shared buffers has to be downloaded from page > server. It adds quite noticeable latency, because of network roundtrip. > While vanilla Postgres can rely on OS file system cache when page is not > found in shared buffer (access to OS file cache is certainly slower than > to shared buffers > because of syscall and copying of page, but performance penaly is not > very large - less than 15%), Neon has no local files and so has to send > request to the socket. > > This is why we have to perform aggressive prefetching whenever it is > possible (when it it is possible to predict order of subsequent pages). > Unlike vanilla Postgres which implements prefetch only for bitmap heap > scan, we have implemented it for seqscan, index scan, indexonly scan, > bitmap heap scan, vacuum, pg_prewarm. > The main difference between Neon prefetch and vanilla Postgres prefetch > is that first one is backend specific. So each backend prefetches only > pages which it needs. > This is why we have to rewrite prefetch for bitmap heap scan, which is > using `fadvise` and assumes that pages prefetched by one backend in file > cache, can be used by any other backend. > I do understand why prefetching is important in neon (likely more than for core postgres). I'm interested in how it's actually implemented, whether it's somehow similar to how my patch does things or in some different (perhaps neon-specific way), and if the approaches are different then what are the pros/cons. And so on. So is it implemented in the neon-specific storage, somehow, or where/how does neon issue the prefetch requests? > > Concerning index scan we have implemented two different approaches: for > index only scan we try to prefetch leave pages and for index scan we > prefetch referenced heap pages. In my experience the IOS handling (only prefetching leaf pages) is very limiting, and may easily lead to index-only scans being way slower than regular index scans. Which is super surprising for users. It's why I ended up improving the prefetcher to optionally look at the VM etc. > In both cases we start from prefetch distance 0 and increase it until it > reaches `effective_io_concurrency` for this relation. Doing so we try to > avoid prefetching of useless pages and slowdown of "point" lookups > returning one or few records. > Right, the regular prefetch ramp-up. My patch does the same thing. > If you are interested, you can look at our implementation in neon repo: > all source are available. But briefly speaking, each backend has its own > prefetch ring (prefetch requests which are waiting for response). The > key idea is that we can send several prefetch requests to page server > and then receive multiple replies. It allows to increased speed of OLAP > queries up to 10 times. > Can you point me to the actual code / branch where it happens? I did check the github repo, but I don't see anything relevant in the default branch (REL_15_STABLE_neon). There are some "prefetch" branches, but those seem abandoned. > Heikki thinks that prefetch can be somehow combined with async-io > proposal (based on io_uring). But right now they have nothing in common. > I can imagine async I/O being useful here, but I find the flow of I/O requests is quite complex / goes through multiple layers. Or maybe I just don't understand how it should work. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/12/24 20:30, Konstantin Knizhnik wrote:On 12/01/2024 7:03 pm, Tomas Vondra wrote:On 10/21/23 14:16, Konstantin Knizhnik wrote:Hi hackers, EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, COST,...) which help to provide useful details of query execution. In Neon we have added PREFETCH option which shows information about page prefetching during query execution (prefetching is more critical for Neon architecture because of separation of compute and storage, so it is implemented not only for bitmap heap scan as in Vanilla Postgres, but also for seqscan, indexscan and indexonly scan). Another possible candidate for explain options is local file cache (extra caching layer above shared buffers which is used to somehow replace file system cache in standalone Postgres).Not quite related to this patch about EXPLAIN options, but can you share some details how you implemented prefetching for the other nodes? I'm asking because I've been working on prefetching for index scans, so I'm wondering if there's a better way to do this, or how to do it in a way that would allow neon to maybe leverage that too. regardsYes, I am looking at your PR. What we have implemented in Neon is more specific to Neon architecture where storage is separated from compute. So each page not found in shared buffers has to be downloaded from page server. It adds quite noticeable latency, because of network roundtrip. While vanilla Postgres can rely on OS file system cache when page is not found in shared buffer (access to OS file cache is certainly slower than to shared buffers because of syscall and copying of page, but performance penaly is not very large - less than 15%), Neon has no local files and so has to send request to the socket. This is why we have to perform aggressive prefetching whenever it is possible (when it it is possible to predict order of subsequent pages). Unlike vanilla Postgres which implements prefetch only for bitmap heap scan, we have implemented it for seqscan, index scan, indexonly scan, bitmap heap scan, vacuum, pg_prewarm. The main difference between Neon prefetch and vanilla Postgres prefetch is that first one is backend specific. So each backend prefetches only pages which it needs. This is why we have to rewrite prefetch for bitmap heap scan, which is using `fadvise` and assumes that pages prefetched by one backend in file cache, can be used by any other backend.I do understand why prefetching is important in neon (likely more than for core postgres). I'm interested in how it's actually implemented, whether it's somehow similar to how my patch does things or in some different (perhaps neon-specific way), and if the approaches are different then what are the pros/cons. And so on. So is it implemented in the neon-specific storage, somehow, or where/how does neon issue the prefetch requests?
Neon mostly preservers Postgres prefetch mechanism, so we are using PrefetchBuffer which checks if page is present in shared buffers
and if not - calls smgrprefetch. We are using own storage manager implementation which instead of reading pages from local disk, download them from page server.
And prefetch implementation in Neon storager manager is obviously also different from one in vanilla Postgres which uses posix_fadvise.
Neon prefetch implementation inserts prefetch request in ring buffer and sends it to the server. When read operation is performed we check if there is correspondent prefetch request in ring buffer and if so - waits its completion.
As I already wrote - prefetch is done locally for each backend. And each backend has its own connection with page server. It can be changed in future when we implement multiplexing of page server connections. But right now prefetch is local. And certainly prefetch can improve performance only if we correctly predict subsequent page requests.
If not - then page server does useless jobs and backend has to waity and consume all issues prefetch requests. This is why in prefetch implementation for most of nodes we start with minimal prefetch distance and then increase it. It allows to perform prefetch only for such queries where it is really efficient (OLAP) and doesn't degrade performance of simple OLTP queries.
Out prefetch implementation is also compatible with parallel plans, but here we need to preserve some range of pages for each parallel workers instead of picking page from some shared queue on demand. It is one of the major difference with Postgres prefetch using posix_fadvise: each backend shoudl prefetch only those pages which it will going to read.
Concerning index scan we have implemented two different approaches: for index only scan we try to prefetch leave pages and for index scan we prefetch referenced heap pages.In my experience the IOS handling (only prefetching leaf pages) is very limiting, and may easily lead to index-only scans being way slower than regular index scans. Which is super surprising for users. It's why I ended up improving the prefetcher to optionally look at the VM etc.
Well, my assumption was the following: prefetch is most efficient for OLAP queries.
Although HTAP (hybrid transactional/analytical processing) is popular trend now,
classical model is that analytic queries are performed on "historical" data, which was already proceeded by vacuum and all-visible bits were set in VM.
May be this assumption is wrong but it seems to me that if most heap pages are not marked as all-visible, then optimizer should prefetch bitmap scan to index-only scan.
And for combination of index and heap bitmap scans we can efficiently prefetch both index and heap pages.
In both cases we start from prefetch distance 0 and increase it until it reaches `effective_io_concurrency` for this relation. Doing so we try to avoid prefetching of useless pages and slowdown of "point" lookups returning one or few records.Right, the regular prefetch ramp-up. My patch does the same thing.If you are interested, you can look at our implementation in neon repo: all source are available. But briefly speaking, each backend has its own prefetch ring (prefetch requests which are waiting for response). The key idea is that we can send several prefetch requests to page server and then receive multiple replies. It allows to increased speed of OLAP queries up to 10 times.Can you point me to the actual code / branch where it happens? I did check the github repo, but I don't see anything relevant in the default branch (REL_15_STABLE_neon). There are some "prefetch" branches, but those seem abandoned.
Implementation of prefetch mecnahism is in Neon extension:
https://github.com/neondatabase/neon/blob/60ced06586a6811470c16c6386daba79ffaeda13/pgxn/neon/pagestore_smgr.c#L205
But concrete implementation of prefetch for particular nodes is certainly inside Postgres.
For example, if you are interested how it is implemented for index scan, then please look at:
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L844
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1166
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1467
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1625
https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L2629
I also do not think that it will be possible to marry this two approaches.Heikki thinks that prefetch can be somehow combined with async-io proposal (based on io_uring). But right now they have nothing in common.I can imagine async I/O being useful here, but I find the flow of I/O requests is quite complex / goes through multiple layers. Or maybe I just don't understand how it should work.
On 1/13/24 17:13, Konstantin Knizhnik wrote: > > On 13/01/2024 4:51 pm, Tomas Vondra wrote: >> >> On 1/12/24 20:30, Konstantin Knizhnik wrote: >>> On 12/01/2024 7:03 pm, Tomas Vondra wrote: >>>> On 10/21/23 14:16, Konstantin Knizhnik wrote: >>>>> Hi hackers, >>>>> >>>>> EXPLAIN statement has a list of options (i.e. ANALYZE, BUFFERS, >>>>> COST,...) which help to provide useful details of query execution. >>>>> In Neon we have added PREFETCH option which shows information about >>>>> page >>>>> prefetching during query execution (prefetching is more critical for >>>>> Neon >>>>> architecture because of separation of compute and storage, so it is >>>>> implemented not only for bitmap heap scan as in Vanilla Postgres, but >>>>> also for seqscan, indexscan and indexonly scan). Another possible >>>>> candidate for explain options is local file cache (extra caching >>>>> layer >>>>> above shared buffers which is used to somehow replace file system >>>>> cache >>>>> in standalone Postgres). >>>> Not quite related to this patch about EXPLAIN options, but can you >>>> share >>>> some details how you implemented prefetching for the other nodes? >>>> >>>> I'm asking because I've been working on prefetching for index scans, so >>>> I'm wondering if there's a better way to do this, or how to do it in a >>>> way that would allow neon to maybe leverage that too. >>>> >>>> regards >>>> >>> Yes, I am looking at your PR. What we have implemented in Neon is more >>> specific to Neon architecture where storage is separated from compute. >>> So each page not found in shared buffers has to be downloaded from page >>> server. It adds quite noticeable latency, because of network roundtrip. >>> While vanilla Postgres can rely on OS file system cache when page is not >>> found in shared buffer (access to OS file cache is certainly slower than >>> to shared buffers >>> because of syscall and copying of page, but performance penaly is not >>> very large - less than 15%), Neon has no local files and so has to send >>> request to the socket. >>> >>> This is why we have to perform aggressive prefetching whenever it is >>> possible (when it it is possible to predict order of subsequent pages). >>> Unlike vanilla Postgres which implements prefetch only for bitmap heap >>> scan, we have implemented it for seqscan, index scan, indexonly scan, >>> bitmap heap scan, vacuum, pg_prewarm. >>> The main difference between Neon prefetch and vanilla Postgres prefetch >>> is that first one is backend specific. So each backend prefetches only >>> pages which it needs. >>> This is why we have to rewrite prefetch for bitmap heap scan, which is >>> using `fadvise` and assumes that pages prefetched by one backend in file >>> cache, can be used by any other backend. >>> >> I do understand why prefetching is important in neon (likely more than >> for core postgres). I'm interested in how it's actually implemented, >> whether it's somehow similar to how my patch does things or in some >> different (perhaps neon-specific way), and if the approaches are >> different then what are the pros/cons. And so on. >> >> So is it implemented in the neon-specific storage, somehow, or where/how >> does neon issue the prefetch requests? > > Neon mostly preservers Postgres prefetch mechanism, so we are using > PrefetchBuffer which checks if page is present in shared buffers > and if not - calls smgrprefetch. We are using own storage manager > implementation which instead of reading pages from local disk, download > them from page server. > And prefetch implementation in Neon storager manager is obviously also > different from one in vanilla Postgres which uses posix_fadvise. > Neon prefetch implementation inserts prefetch request in ring buffer and > sends it to the server. When read operation is performed we check if > there is correspondent prefetch request in ring buffer and if so - waits > its completion. > Thanks. Sure, neon has to use some custom prefetch implementation, considering not posix_fadvise, considering there's no local page cache in the architecture. The thing that was not clear to me is who decides what to prefetch, which code issues the prefetch requests etc. In the github links you shared I see it happens in the index AM code (in nbtsearch.c). That's interesting, because that's what my first prefetching patch did too - not the same way, ofc, but in the same layer. Simply because it seemed like the simplest way to do that. But the feedback was that's the wrong layer, and that it should happen in the executor. And I agree with that - the reasons are somewhere in the other thread. Based on what I saw in the neon code, I think it should be possible for neon to use "my" approach too, but that only works for the index scans, ofc. Not sure what to do about the other places. > As I already wrote - prefetch is done locally for each backend. And each > backend has its own connection with page server. It can be changed in > future when we implement multiplexing of page server connections. But > right now prefetch is local. And certainly prefetch can improve > performance only if we correctly predict subsequent page requests. > If not - then page server does useless jobs and backend has to waity and > consume all issues prefetch requests. This is why in prefetch > implementation for most of nodes we start with minimal prefetch > distance and then increase it. It allows to perform prefetch only for > such queries where it is really efficient (OLAP) and doesn't degrade > performance of simple OLTP queries. > Not sure I understand what's so important about prefetches being "local" for each backend. I mean even in postgres each backend prefetches it's own buffers, no matter what the other backends do. Although, neon probably doesn't have the cross-backend sharing through shared buffers etc. right? FWIW I certainly agree with the goal to not harm queries that can't benefit from prefetching. Ramping-up the prefetch distance is something my patch does too, for exactly this reason. > Out prefetch implementation is also compatible with parallel plans, but > here we need to preserve some range of pages for each parallel workers > instead of picking page from some shared queue on demand. It is one of > the major difference with Postgres prefetch using posix_fadvise: each > backend shoudl prefetch only those pages which it will going to read. > Understood. I have no opinion on this, though. >>> Concerning index scan we have implemented two different approaches: for >>> index only scan we try to prefetch leave pages and for index scan we >>> prefetch referenced heap pages. >> In my experience the IOS handling (only prefetching leaf pages) is very >> limiting, and may easily lead to index-only scans being way slower than >> regular index scans. Which is super surprising for users. It's why I >> ended up improving the prefetcher to optionally look at the VM etc. > > Well, my assumption was the following: prefetch is most efficient for > OLAP queries. > Although HTAP (hybrid transactional/analytical processing) is popular > trend now, > classical model is that analytic queries are performed on "historical" > data, which was already proceeded by vacuum and all-visible bits were > set in VM. > May be this assumption is wrong but it seems to me that if most heap > pages are not marked as all-visible, then optimizer should prefetch > bitmap scan to index-only scan. I think this assumption is generally reasonable, but it hinges on the assumption that OLAP queries have most indexes recently vacuumed and all-visible. I'm not sure it's wise to rely on that. Without prefetching it's not that important - the worst thing that would happen is that the IOS degrades into regular index-scan. But with prefetching these plans can "invert" with respect to cost. I'm not saying it's terrible or that IOS must have prefetching, but I think it's something users may run into fairly often. And it led me to rework the prefetching so that IOS can prefetch too ... > And for combination of index and heap bitmap scans we can efficiently > prefetch both index and heap pages. > >>> In both cases we start from prefetch distance 0 and increase it until it >>> reaches `effective_io_concurrency` for this relation. Doing so we try to >>> avoid prefetching of useless pages and slowdown of "point" lookups >>> returning one or few records. >>> >> Right, the regular prefetch ramp-up. My patch does the same thing. >> >>> If you are interested, you can look at our implementation in neon repo: >>> all source are available. But briefly speaking, each backend has its own >>> prefetch ring (prefetch requests which are waiting for response). The >>> key idea is that we can send several prefetch requests to page server >>> and then receive multiple replies. It allows to increased speed of OLAP >>> queries up to 10 times. >>> >> Can you point me to the actual code / branch where it happens? I did >> check the github repo, but I don't see anything relevant in the default >> branch (REL_15_STABLE_neon). There are some "prefetch" branches, but >> those seem abandoned. > > Implementation of prefetch mecnahism is in Neon extension: > https://github.com/neondatabase/neon/blob/60ced06586a6811470c16c6386daba79ffaeda13/pgxn/neon/pagestore_smgr.c#L205 > > But concrete implementation of prefetch for particular nodes is > certainly inside Postgres. > For example, if you are interested how it is implemented for index scan, > then please look at: > https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L844 > https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1166 > https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1467 > https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L1625 > https://github.com/neondatabase/postgres/blob/c1c2272f436ed9231f6172f49de219fe71a9280d/src/backend/access/nbtree/nbtsearch.c#L2629 > Thanks! Very helpful. As I said, I ended up moving the prefetching to the executor. For indexscans I think it should be possible for neon to benefit from that (in a way, it doesn't need to do anything except for overriding what PrefetchBuffer does). Not sure about the other places where neon needs to prefetch, I don't have ambition to rework those. > >> >>> Heikki thinks that prefetch can be somehow combined with async-io >>> proposal (based on io_uring). But right now they have nothing in common. >>> >> I can imagine async I/O being useful here, but I find the flow of I/O >> requests is quite complex / goes through multiple layers. Or maybe I >> just don't understand how it should work. > I also do not think that it will be possible to marry this two approaches. I didn't actually say it would be impossible - I think it seems like a use case where async I/O should be a natural fit. But I'm not sure to do that in a way that would not be super confusing and/or fragile when something unexpected happens (like a rescan, or maybe some change to the index structure - page split, etc.) regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
The thing that was not clear to me is who decides what to prefetch, which code issues the prefetch requests etc. In the github links you shared I see it happens in the index AM code (in nbtsearch.c).
It is up to the particular plan node (seqscan, indexscan,...) which pages to prefetch.
That's interesting, because that's what my first prefetching patch did too - not the same way, ofc, but in the same layer. Simply because it seemed like the simplest way to do that. But the feedback was that's the wrong layer, and that it should happen in the executor. And I agree with that - the reasons are somewhere in the other thread.
I read the arguments in
Separating prefetch info in index scan descriptor is really good idea. It will be amazing to have generic prefetch mechanism for all indexes.
But unfortunately I do not understand how it is possible. The logic of index traversal is implemented inside AM. Executor doesn't know it.
For example for B-Tree scan we can prefetch:
- intermediate pages
- leave pages
- referenced by TID heap pages
Before we load next intermediate page, we do not know next leave pages.
And before we load next leave page, we can not find out TIDs from this page.
Another challenge - is how far we should prefetch (as far as I understand both your and our approach using dynamically extended prefetch window)
We definitely need prefetch for heap scan (it gives the most advantages in performance), for vacuum and also for pg_prewarm. Also I tried to implement it for custom indexes such as pg_vector. I still not sure whether it is possible to create some generic solution which will work for all indexes.Based on what I saw in the neon code, I think it should be possible for neon to use "my" approach too, but that only works for the index scans, ofc. Not sure what to do about the other places.
I have also tried to implement alternative approach for prefetch based on access statistic.
It comes from use case of seqscan of table with larger toasted records. So for each record we have to extract its TOAST data.
It is done using standard index scan, but unfortunately index prefetch doesn't help much here: there is usually just one TOAST segment and so prefetch just have no chance to do something useful. But as far as heap records are accessed sequentially, there is good chance that toast table will also be accessed mostly sequentially. So we just can count number of sequential requests to each relation and if ratio or seq/rand accesses is above some threshold we can prefetch next pages of this relation. This is really universal approach but ... working mostly for TOAST table.
As I already wrote - prefetch is done locally for each backend. And each backend has its own connection with page server. It can be changed in future when we implement multiplexing of page server connections. But right now prefetch is local. And certainly prefetch can improve performance only if we correctly predict subsequent page requests. If not - then page server does useless jobs and backend has to waity and consume all issues prefetch requests. This is why in prefetch implementation for most of nodes we start with minimal prefetch distance and then increase it. It allows to perform prefetch only for such queries where it is really efficient (OLAP) and doesn't degrade performance of simple OLTP queries.Not sure I understand what's so important about prefetches being "local" for each backend. I mean even in postgres each backend prefetches it's own buffers, no matter what the other backends do. Although, neon probably doesn't have the cross-backend sharing through shared buffers etc. right?
Sorry if my explanation was not clear:(
> I mean even in postgres each backend prefetches it's own buffers, no matter what the other backends do. This is exactly the difference. In Neon such approach doesn't work. Each backend maintains it's own prefetch ring. And if prefetched page was not actually received, then the whole pipe is lost. I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it has to consume responses for 1,5,10 and issue another request for page 2. Instead of improving speed we are just doing extra job. So each backend should prefetch only those pages which it is actually going to read. This is why prefetch approach used in Postgres for example for parallel bitmap heap scan doesn't work for Neon. If you do `posic_fadvise` then prefetched page is placed in OS cache and can be used by any parallel worker. But in Neon each parallel worker should be given its own range of pages to scan and prefetch only this pages.
I think that it is also problem without prefetch. There are cases where seqscan or bitmap heap scan are really much faster then IOS because last one has to perform a lot of visibility checks. Yes, certainly optimizer takes in account percent of all-visible pages. But with it is not tricial to adjust optimizer parameters so that it can really choose fastest plan.Well, my assumption was the following: prefetch is most efficient forOLAP queries. Although HTAP (hybrid transactional/analytical processing) is popular trend now, classical model is that analytic queries are performed on "historical" data, which was already proceeded by vacuum and all-visible bits were set in VM. May be this assumption is wrong but it seems to me that if most heap pages are not marked as all-visible, then optimizer should prefetch bitmap scan to index-only scan.
I think this assumption is generally reasonable, but it hinges on the assumption that OLAP queries have most indexes recently vacuumed and all-visible. I'm not sure it's wise to rely on that. Without prefetching it's not that important - the worst thing that would happen is that the IOS degrades into regular index-scan.
But withprefetching these plans can "invert" with respect to cost.
I'm not saying it's terrible or that IOS must have prefetching, but I
think it's something users may run into fairly often. And it led me to
rework the prefetching so that IOS can prefetch too ...
I think that inspecting VM for prefetch is really good idea.
Thanks! Very helpful. As I said, I ended up moving the prefetching to the executor. For indexscans I think it should be possible for neon to benefit from that (in a way, it doesn't need to do anything except for overriding what PrefetchBuffer does). Not sure about the other places where neon needs to prefetch, I don't have ambition to rework those.
Once your PR will be merged, I will rewrite Neon prefetch implementation fopr indexces using your approach.
On 1/15/24 15:22, Konstantin Knizhnik wrote: > > On 14/01/2024 11:47 pm, Tomas Vondra wrote: >> The thing that was not clear to me is who decides what to prefetch, >> which code issues the prefetch requests etc. In the github links you >> shared I see it happens in the index AM code (in nbtsearch.c). > > > It is up to the particular plan node (seqscan, indexscan,...) which > pages to prefetch. > > >> >> That's interesting, because that's what my first prefetching patch did >> too - not the same way, ofc, but in the same layer. Simply because it >> seemed like the simplest way to do that. But the feedback was that's the >> wrong layer, and that it should happen in the executor. And I agree with >> that - the reasons are somewhere in the other thread. >> > I read the arguments in > > https://www.postgresql.org/message-id/flat/8c86c3a6-074e-6c88-3e7e-9452b6a37b9b%40enterprisedb.com#fc792f8d013215ace7971535a5744c83 > > Separating prefetch info in index scan descriptor is really good idea. > It will be amazing to have generic prefetch mechanism for all indexes. > But unfortunately I do not understand how it is possible. The logic of > index traversal is implemented inside AM. Executor doesn't know it. > For example for B-Tree scan we can prefetch: > > - intermediate pages > - leave pages > - referenced by TID heap pages > My patch does not care about prefetching internal index pages. Yes, it's a limitation, but my assumption is the internal pages are maybe 0.1% of the index, and typically very hot / cached. Yes, if the index is not used very often, this may be untrue. But I consider it a possible future improvement, for some other patch. FWIW there's a prefetching patch for inserts into indexes (which only prefetches just the index leaf pages). > Before we load next intermediate page, we do not know next leave pages. > And before we load next leave page, we can not find out TIDs from this > page. > Not sure I understand what this is about. The patch simply calls the index AM function index_getnext_tid() enough times to fill the prefetch queue. It does not prefetch the next index leaf page, it however does prefetch the heap pages. It does not "stall" at the boundary of the index leaf page, or something. > Another challenge - is how far we should prefetch (as far as I > understand both your and our approach using dynamically extended > prefetch window) > By dynamic extension of prefetch window you mean the incremental growth of the prefetch distance from 0 to effective_io_concurrency? I don't think there's a better solution. There might be additional information that we could consider (e.g. expected number of rows for the plan, earlier executions of the scan, ...) but each of these has a failure more. >> Based on what I saw in the neon code, I think it should be possible for >> neon to use "my" approach too, but that only works for the index scans, >> ofc. Not sure what to do about the other places. > We definitely need prefetch for heap scan (it gives the most advantages > in performance), for vacuum and also for pg_prewarm. Also I tried to > implement it for custom indexes such as pg_vector. I still not sure > whether it is possible to create some generic solution which will work > for all indexes. > I haven't tried with pgvector, but I don't see why my patch would not work for all index AMs that cna return TID. > I have also tried to implement alternative approach for prefetch based > on access statistic. > It comes from use case of seqscan of table with larger toasted records. > So for each record we have to extract its TOAST data. > It is done using standard index scan, but unfortunately index prefetch > doesn't help much here: there is usually just one TOAST segment and so > prefetch just have no chance to do something useful. But as far as heap > records are accessed sequentially, there is good chance that toast table > will also be accessed mostly sequentially. So we just can count number > of sequential requests to each relation and if ratio or seq/rand > accesses is above some threshold we can prefetch next pages of this > relation. This is really universal approach but ... working mostly for > TOAST table. > Are you're talking about what works / doesn't work in neon, or about postgres in general? I'm not sure what you mean by "one TOAST segment" and I'd also guess that if both tables are accessed mostly sequentially, the read-ahead will do most of the work (in postgres). It's probably true that as we do a separate index scan for each TOAST-ed value, that can't really ramp-up the prefetch distance fast enough. Maybe we could have a mode where we start with the full distance? > >>> As I already wrote - prefetch is done locally for each backend. And each >>> backend has its own connection with page server. It can be changed in >>> future when we implement multiplexing of page server connections. But >>> right now prefetch is local. And certainly prefetch can improve >>> performance only if we correctly predict subsequent page requests. >>> If not - then page server does useless jobs and backend has to waity and >>> consume all issues prefetch requests. This is why in prefetch >>> implementation for most of nodes we start with minimal prefetch >>> distance and then increase it. It allows to perform prefetch only for >>> such queries where it is really efficient (OLAP) and doesn't degrade >>> performance of simple OLTP queries. >>> >> Not sure I understand what's so important about prefetches being "local" >> for each backend. I mean even in postgres each backend prefetches it's >> own buffers, no matter what the other backends do. Although, neon >> probably doesn't have the cross-backend sharing through shared buffers >> etc. right? > > > Sorry if my explanation was not clear:( > >> I mean even in postgres each backend prefetches it's own buffers, no >> matter what the other backends do. > > This is exactly the difference. In Neon such approach doesn't work. > Each backend maintains it's own prefetch ring. And if prefetched page > was not actually received, then the whole pipe is lost. > I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it > has to consume responses for 1,5,10 and issue another request for page 2. > Instead of improving speed we are just doing extra job. > So each backend should prefetch only those pages which it is actually > going to read. > This is why prefetch approach used in Postgres for example for parallel > bitmap heap scan doesn't work for Neon. > If you do `posic_fadvise` then prefetched page is placed in OS cache and > can be used by any parallel worker. > But in Neon each parallel worker should be given its own range of pages > to scan and prefetch only this pages. > I still don't quite see/understand the difference. I mean, even in postgres each backend does it's own prefetches, using it's own prefetch ring. But I'm not entirely sure about the neon architecture differences. Does this mean neon can do prefetching from the executor in principle? Could you perhaps describe a situation where the bitmap can prefetching (as implemented in Postgres) does not work for neon? >> >>> Well, my assumption was the following: prefetch is most efficient >>> forOLAP queries. >>> Although HTAP (hybrid transactional/analytical processing) is popular >>> trend now, >>> classical model is that analytic queries are performed on "historical" >>> data, which was already proceeded by vacuum and all-visible bits were >>> set in VM. >>> May be this assumption is wrong but it seems to me that if most heap >>> pages are not marked as all-visible, then optimizer should prefetch >>> bitmap scan to index-only scan. >> I think this assumption is generally reasonable, but it hinges on the >> assumption that OLAP queries have most indexes recently vacuumed and >> all-visible. I'm not sure it's wise to rely on that. >> >> Without prefetching it's not that important - the worst thing that would >> happen is that the IOS degrades into regular index-scan. >> > I think that it is also problem without prefetch. There are cases where > seqscan or bitmap heap scan are really much faster then IOS because last > one has to perform a lot of visibility checks. Yes, certainly optimizer > takes in account percent of all-visible pages.But with it is not tricial > to adjust optimizer parameters so that it can really choose fastest plan. True. There's more cases where it can happen, no doubt about it. But I think those cases are somewhat less likely. >> But withprefetching these plans can "invert" with respect to cost. >> >> I'm not saying it's terrible or that IOS must have prefetching, but I >> think it's something users may run into fairly often. And it led me to >> rework the prefetching so that IOS can prefetch too ... >> >> > > I think that inspecting VM for prefetch is really good idea. > >> Thanks! Very helpful. As I said, I ended up moving the prefetching to >> the executor. For indexscans I think it should be possible for neon to >> benefit from that (in a way, it doesn't need to do anything except for >> overriding what PrefetchBuffer does). Not sure about the other places >> where neon needs to prefetch, I don't have ambition to rework those. >> > Once your PR will be merged, I will rewrite Neon prefetch implementation > fopr indexces using your approach. > Well, maybe you could try doing rewriting it now, so that you can give some feedback to the patch. I'd appreciate that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
My patch does not care about prefetching internal index pages. Yes, it's a limitation, but my assumption is the internal pages are maybe 0.1% of the index, and typically very hot / cached. Yes, if the index is not used very often, this may be untrue. But I consider it a possible future improvement, for some other patch. FWIW there's a prefetching patch for inserts into indexes (which only prefetches just the index leaf pages).
We have to prefetch pages at height-1 level (parents of leave pages) for IOS because otherwise prefetch pipeline is broken at each transition to next leave page.
When we start with new leave patch we have to fill prefetch ring from the scratch which certainly has negative impact on performance.
Not sure I understand what this is about. The patch simply calls the index AM function index_getnext_tid() enough times to fill the prefetch queue. It does not prefetch the next index leaf page, it however does prefetch the heap pages. It does not "stall" at the boundary of the index leaf page, or something.
Ok, now I fully understand your approach. Looks really elegant and works for all indexes.
There is still issue with IOS and seqscan.
Another challenge - is how far we should prefetch (as far as I understand both your and our approach using dynamically extended prefetch window)By dynamic extension of prefetch window you mean the incremental growth of the prefetch distance from 0 to effective_io_concurrency?
Yes
I don't think there's a better solution.
I tried one more solution: propagate information about expected number of fetched rows to AM. Based on this information it is possible to choose proper prefetch distance.
Certainly it is not quote precise: we can scan large number rows but filter only few of them. This is why this approach was not committed in Neon.
But I still think that using statistics for determining prefetch window is not so bad idea. May be it needs better thinking.
There might be additional information that we could consider (e.g. expected number of rows for the plan, earlier executions of the scan, ...) but each of these has a failure more.
I wrote reply above before reading next fragment:)
So I have already tried it.
I haven't tried with pgvector, but I don't see why my patch would not work for all index AMs that cna return TID.
Yes, I agree. But it will be efficient only if getting next TIS is cheap - it is located on the same leaf page.
I have also tried to implement alternative approach for prefetch based on access statistic. It comes from use case of seqscan of table with larger toasted records. So for each record we have to extract its TOAST data. It is done using standard index scan, but unfortunately index prefetch doesn't help much here: there is usually just one TOAST segment and so prefetch just have no chance to do something useful. But as far as heap records are accessed sequentially, there is good chance that toast table will also be accessed mostly sequentially. So we just can count number of sequential requests to each relation and if ratio or seq/rand accesses is above some threshold we can prefetch next pages of this relation. This is really universal approach but ... working mostly for TOAST table.Are you're talking about what works / doesn't work in neon, or about postgres in general? I'm not sure what you mean by "one TOAST segment" and I'd also guess that if both tables are accessed mostly sequentially, the read-ahead will do most of the work (in postgres).
Yes, I agree: in case of vanilla Postgres OS will do read-ahead. But not in Neon.
By one TOAST segment I mean "one TOAST record - 2kb.
It's probably true that as we do a separate index scan for each TOAST-ed value, that can't really ramp-up the prefetch distance fast enough. Maybe we could have a mode where we start with the full distance?
Sorry, I do not understand. Especially in this case large prefetch window is undesired.
Most of records fits in 2kb, so we need to fetch onely one head (TOAST) record per TOAST index search.
This is exactly the difference. In Neon such approach doesn't work. Each backend maintains it's own prefetch ring. And if prefetched page was not actually received, then the whole pipe is lost. I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it has to consume responses for 1,5,10 and issue another request for page 2. Instead of improving speed we are just doing extra job. So each backend should prefetch only those pages which it is actually going to read. This is why prefetch approach used in Postgres for example for parallel bitmap heap scan doesn't work for Neon. If you do `posic_fadvise` then prefetched page is placed in OS cache and can be used by any parallel worker. But in Neon each parallel worker should be given its own range of pages to scan and prefetch only this pages.I still don't quite see/understand the difference. I mean, even in postgres each backend does it's own prefetches, using it's own prefetch ring. But I'm not entirely sure about the neon architecture differences
I am not speaking about your approach. It will work with Neon as well.
I am describing why implementation of prefetch for heap bitmap scan doesn't work for Neon:
it issues prefetch requests for pages which never accessed by this parallel worker.
Does this mean neon can do prefetching from the executor in principle? Could you perhaps describe a situation where the bitmap can prefetching (as implemented in Postgres) does not work for neon?
I am speaking about prefetch implementation in nodeBitmpapHeapScan. Prefetch iterator is not synced with normal iterator, i.e. they can return different pages.
Well, maybe you could try doing rewriting it now, so that you can give some feedback to the patch. I'd appreciate that.
I will try.
Best regards,
Konstantin
On 1/15/24 21:42, Konstantin Knizhnik wrote: > > On 15/01/2024 5:08 pm, Tomas Vondra wrote: >> >> My patch does not care about prefetching internal index pages. Yes, it's >> a limitation, but my assumption is the internal pages are maybe 0.1% of >> the index, and typically very hot / cached. Yes, if the index is not >> used very often, this may be untrue. But I consider it a possible future >> improvement, for some other patch. FWIW there's a prefetching patch for >> inserts into indexes (which only prefetches just the index leaf pages). > > We have to prefetch pages at height-1 level (parents of leave pages) for > IOS because otherwise prefetch pipeline is broken at each transition to > next leave page. > When we start with new leave patch we have to fill prefetch ring from > the scratch which certainly has negative impact on performance. > By "broken" you mean that you prefetch items only from a single leaf page, so immediately after reading the next one nothing is prefetched. Correct? Yeah, I had this problem initially too, when I did the prefetching in the index AM code. One of the reasons why it got moved to the executor. > >> Not sure I understand what this is about. The patch simply calls the >> index AM function index_getnext_tid() enough times to fill the prefetch >> queue. It does not prefetch the next index leaf page, it however does >> prefetch the heap pages. It does not "stall" at the boundary of the >> index leaf page, or something. > > Ok, now I fully understand your approach. Looks really elegant and works > for all indexes. > There is still issue with IOS and seqscan. > Not sure. For seqscan, I think this has nothing to do with it. Postgres relies on read-ahad to do the work - of course, if that doesn't work (e.g. for async/direct I/O that'd be the case), an improvement will be needed. But it's unrelated to this patch, and I'm certainly not saying this patch does that. I think Thomas/Andres did some work on that. For IOS, I think the limitation that this does not prefetch any index pages (especially the leafs) is there, and it'd be nice to do something about it. But I see it as a separate thing, which I think does need to happen in the index AM layer (not in the executor). > > >> >>> Another challenge - is how far we should prefetch (as far as I >>> understand both your and our approach using dynamically extended >>> prefetch window) >>> >> By dynamic extension of prefetch window you mean the incremental growth >> of the prefetch distance from 0 to effective_io_concurrency? > > Yes > >> I don't >> think there's a better solution. > > I tried one more solution: propagate information about expected number > of fetched rows to AM. Based on this information it is possible to > choose proper prefetch distance. > Certainly it is not quote precise: we can scan large number rows but > filter only few of them. This is why this approach was not committed in > Neon. > But I still think that using statistics for determining prefetch window > is not so bad idea. May be it needs better thinking. > I don't think we should rely on this information too much. It's far too unreliable - especially the planner estimates. The run-time data may be more accurate, but I'm worried it may be quite variable (e.g. for different runs of the scan). My position is to keep this as simple as possible, and prefer to be more conservative when possible - that is, shorter prefetch distances. In my experience the benefit of prefetching is subject to diminishing returns, i.e. going from 0 => 16 is way bigger difference than 16 => 32. So better to stick with lower value instead of wasting resources. > >> >> There might be additional information that we could consider (e.g. >> expected number of rows for the plan, earlier executions of the scan, >> ...) but each of these has a failure more. > > I wrote reply above before reading next fragment:) > So I have already tried it. > >> I haven't tried with pgvector, but I don't see why my patch would not >> work for all index AMs that cna return TID. > > > Yes, I agree. But it will be efficient only if getting next TIS is > cheap - it is located on the same leaf page. > Maybe. I haven't tried/thought about it, but yes - if it requires doing a lot of work in between the prefetches, the benefits of prefetching will diminish naturally. Might be worth doing some experiments. > >> >>> I have also tried to implement alternative approach for prefetch based >>> on access statistic. >>> It comes from use case of seqscan of table with larger toasted records. >>> So for each record we have to extract its TOAST data. >>> It is done using standard index scan, but unfortunately index prefetch >>> doesn't help much here: there is usually just one TOAST segment and so >>> prefetch just have no chance to do something useful. But as far as heap >>> records are accessed sequentially, there is good chance that toast table >>> will also be accessed mostly sequentially. So we just can count number >>> of sequential requests to each relation and if ratio or seq/rand >>> accesses is above some threshold we can prefetch next pages of this >>> relation. This is really universal approach but ... working mostly for >>> TOAST table. >>> >> Are you're talking about what works / doesn't work in neon, or about >> postgres in general? >> >> I'm not sure what you mean by "one TOAST segment" and I'd also guess >> that if both tables are accessed mostly sequentially, the read-ahead >> will do most of the work (in postgres). > > Yes, I agree: in case of vanilla Postgres OS will do read-ahead. But not > in Neon. > By one TOAST segment I mean "one TOAST record - 2kb. > Ah, you mean "TOAST chunk". Yes, if a record fits into a single TOAST chunk, my prefetch won't work. Not sure what to do for neon ... > >> It's probably true that as we do a separate index scan for each TOAST-ed >> value, that can't really ramp-up the prefetch distance fast enough. >> Maybe we could have a mode where we start with the full distance? > > Sorry, I do not understand. Especially in this case large prefetch > window is undesired. > Most of records fits in 2kb, so we need to fetch onely one head (TOAST) > record per TOAST index search. > Yeah, I was confused what you mean by "segment". My point was that if a value is TOAST-ed into multiple chunks, maybe we should allow more aggressive prefetching instead of the slow ramp-up ... But yeah, if there's just one TOAST chunk, that does not help. > >>> This is exactly the difference. In Neon such approach doesn't work. >>> Each backend maintains it's own prefetch ring. And if prefetched page >>> was not actually received, then the whole pipe is lost. >>> I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it >>> has to consume responses for 1,5,10 and issue another request for >>> page 2. >>> Instead of improving speed we are just doing extra job. >>> So each backend should prefetch only those pages which it is actually >>> going to read. >>> This is why prefetch approach used in Postgres for example for parallel >>> bitmap heap scan doesn't work for Neon. >>> If you do `posic_fadvise` then prefetched page is placed in OS cache and >>> can be used by any parallel worker. >>> But in Neon each parallel worker should be given its own range of pages >>> to scan and prefetch only this pages. >>> >> I still don't quite see/understand the difference. I mean, even in >> postgres each backend does it's own prefetches, using it's own prefetch >> ring. But I'm not entirely sure about the neon architecture differences >> > I am not speaking about your approach. It will work with Neon as well. > I am describing why implementation of prefetch for heap bitmap scan > doesn't work for Neon: > it issues prefetch requests for pages which never accessed by this > parallel worker. > >> Does this mean neon can do prefetching from the executor in principle? >> >> Could you perhaps describe a situation where the bitmap can prefetching >> (as implemented in Postgres) does not work for neon? >> > > I am speaking about prefetch implementation in nodeBitmpapHeapScan. > Prefetch iterator is not synced with normal iterator, i.e. they can > return different pages. > Ah, now I think I understand. The workers don't share memory, so the pages prefetched by one worker are wasted if some other worker ends up processing them. >> >> Well, maybe you could try doing rewriting it now, so that you can give >> some feedback to the patch. I'd appreciate that. > > I will try. > Thanks! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
By "broken" you mean that you prefetch items only from a single leafpage, so immediately after reading the next one nothing is prefetched. Correct?
Yes, exactly. It means that reading first heap page from next leaf page will be done without prefetch which in case of Neon means roundtrip with page server (~0.2msec within one data center).
Yeah, I had this problem initially too, when I did the prefetching in the index AM code. One of the reasons why it got moved to the executor.
Yeh, it works nice for vanilla Postgres. You call index_getnext_tid() and when it reaches end of leaf page it reads next read page. Because of OS read-ahead this read is expected to be fast even without prefetch. But not in Neon case - we have to download this page from page server (see above). So ideal solution for Neon will be to prefetch both leave pages and referenced heap pages. And prefetch of last one should be initiated as soon as leaf page is loaded. Unfortunately it is non-trivial to implement and current index scan prefetch implementation for Neon is not doing it.
On 16/01/2024 5:38 pm, Tomas Vondra wrote:By "broken" you mean that you prefetch items only from a single leafpage, so immediately after reading the next one nothing is prefetched. Correct?
Yes, exactly. It means that reading first heap page from next leaf page will be done without prefetch which in case of Neon means roundtrip with page server (~0.2msec within one data center).
Yeah, I had this problem initially too, when I did the prefetching in the index AM code. One of the reasons why it got moved to the executor.Yeh, it works nice for vanilla Postgres. You call index_getnext_tid() and when it reaches end of leaf page it reads next read page. Because of OS read-ahead this read is expected to be fast even without prefetch. But not in Neon case - we have to download this page from page server (see above). So ideal solution for Neon will be to prefetch both leave pages and referenced heap pages. And prefetch of last one should be initiated as soon as leaf page is loaded. Unfortunately it is non-trivial to implement and current index scan prefetch implementation for Neon is not doing it.