Thread: Memory consumed by paths during partitionwise join planning
Hi All, If add_path() and add_partial_path() do not find a new path to be superior to any existing paths, they free the new path. They free an existing path if it is found to be inferior to the new path. But not all paths surviving in a RelOptInfo are used to create paths for relations which use it as input. Further add_path and add_partial_path do not free the whole path tree but just the topmost pathnode. The subpath nodes are not freed because they may be referenced by other paths. The subpaths continue to occupy memory even if they are not used anywhere. As we build the relation tree upwards (base, join, upper relations) more and more such paths are accumulated and continue to consume memory till the statement ends. With partitionwise join involving partitioned tables with thousands of partitions this memory consumption increases proportional to the number of partitions. Attached WIP patches address this issue by adding infrastructure to track references to paths and free unreferenced paths once they can be deemed as useless. A new member Path::ref_count in a pathnode tracks how many other objects, like pathlists in RelOptInfo and other paths, reference that pathnode. As the path nodes are referenced they are "linked" using link_path() to referencing objects. When the path nodes are no longer referenced they are "unlinked" using unlink_path() from the referencing objects. Path nodes are freed using free_path(). The function unlinks the sub path nodes so that they can be freed when their reference count drops to 0. The paths whose references reach 0 during unlinking are freed automatically using free_path(). Patch 0002 adds this infrastructure. 0003 and 0004 use these functions in example cases. With these patches the memory consumption numbers look like below. Experiment ---------- Memory consumption is measured using a method similar to the one described in [1]. The changes to EXPLAIN ANALYZE to report planning memory are in 0001. Memory consumed when planning a self-join query is measured. The queries involve partitioned and unpartitioned tables respectively. The partitioned table has 1000 partitions in it. The table definitions and helper function can be found in setup.sql and the queries can be found in queries.sql. This is the simplest setup. Higher savings may be seen with more complex setups involving indexes, SQL operators and clauses. Table 1: Join between unpartitioned tables Number of tables | without patch | with patch | % reduction | being joined | | | | -------------------------------------------------------------- 2 | 29.0 KiB | 28.9 KiB | 0.66% | 3 | 79.1 KiB | 76.7 KiB | 3.03% | 4 | 208.6 KiB | 198.2 KiB | 4.97% | 5 | 561.6 KiB | 526.3 KiB | 6.28% | Table 2: join between partitioned tables with partitionwise join enabled (enable_partitionwise_join = true). Number of tables | without patch | with patch | % reduction | being joined | | | | ---------------------------------------------------------------- 2 | 40.3 MiB | 40.0 MiB | 0.70% | 3 | 146.9 MiB | 143.1 MiB | 2.55% | 4 | 445.4 MiB | 430.4 MiB | 3.38% | 5 | 1563.3 MiB | 1517.6 MiB | 2.92% | The patch is not complete because of following issues: a. 0003 and 0004 patches do not cover all the path nodes. I have covered only those which I encountered in the queries I ran. If we accept this proposal, I will work on covering all the path nodes. b. It does not cover the entire RelOptInfo tree that the planner creates. The paths in a lower level RelOptInfo can be deemed as useless only when path creation for all the immediate upper RelOptInfos is complete. Thus we need access to both upper and lower level RelOptInfos at the same time. The RelOptInfos created while planning join are available in standard_join_search(). Thus it's possible to free unused paths listed in all the RelOptInfos except the topmost RelOptInfo in this function as done in the patch. But the topmost RelOptInfo acts as an input to upper RelOptInfo in grouping_planner() where we don't have access to the RelOptInfos at lower levels in the join tree. Hence we can't free the unused paths from topmost RelOptInfo and cascade that effect down the RelOptInfo tree. This is the reason why we don't see memory reduction in case of 2-way join. This is also the reason why the numbers above are lower than the actual memory that can be saved. If we decide to move ahead with this approach, I will work on this too. c. The patch does not call link_path and unlink_path in all the required places. We will need some work to identify such places, to build infrastructure and methods to identify such places in future. Another approach to fixing this would be to use separate memory context for creating path nodes and then deleting the entire memory context at the end of planning once the plan is created. We will need to reset path lists as well as cheapest_path members in RelOptInfos as well. This will possibly free up more memory and might be faster. But I have not tried it. The approach taken in the patch has an advantage over this one i.e. the paths can be freed at any stage in the planning using free_unused_paths_from_rel() implemented in the patch. Thus we can monitor the memory consumption and trigger garbage collection when it crosses a certain threashold. Or we may implement both the approaches to clean every bit of paths at the end of planning while garbage collecting pathnodes when memory consumption goes beyond threashold. The reference mechanism may have other usages as well. Suggestions/comments welcome. References 1. https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com -- Best Wishes, Ashutosh Bapat
Attachment
On Fri, 28 Jul 2023 at 02:06, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > Table 1: Join between unpartitioned tables > Number of tables | without patch | with patch | % reduction | > being joined | | | | > -------------------------------------------------------------- > 2 | 29.0 KiB | 28.9 KiB | 0.66% | > 3 | 79.1 KiB | 76.7 KiB | 3.03% | > 4 | 208.6 KiB | 198.2 KiB | 4.97% | > 5 | 561.6 KiB | 526.3 KiB | 6.28% | I have mostly just random thoughts and some questions... Have you done any analysis of the node types that are consuming all that memory? Are Path and subclasses of Path really that dominant in this? The memory savings aren't that impressive and it makes me wonder if this is worth the trouble. What's the goal of the memory reduction work? If it's for performance, does this increase performance? Tracking refcounts of Paths isn't free. Why do your refcounts start at 1? This seems weird: + /* Free the path if none references it. */ + if (path->ref_count == 1) Does ref_count really need to be an int? There's a 16-bit hole you could use between param_info and parallel_aware. I doubt you'd need 32 bits of space for this. I agree that it would be nice to get rid of the "if (!IsA(old_path, IndexPath))" hack in add_path() and it seems like this idea could allow that. However, I think if we were to have something like this then you'd want all the logic to be neatly contained inside pathnode.c without adding any burden in other areas to track or check Path refcounts. I imagine the patch would be neater if you added some kind of Path traversal function akin to expression_tree_walker() that allows you to visit each subpath recursively and run a callback function on each. David
On Wed, Nov 29, 2023 at 1:10 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 28 Jul 2023 at 02:06, Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > Table 1: Join between unpartitioned tables > > Number of tables | without patch | with patch | % reduction | > > being joined | | | | > > -------------------------------------------------------------- > > 2 | 29.0 KiB | 28.9 KiB | 0.66% | > > 3 | 79.1 KiB | 76.7 KiB | 3.03% | > > 4 | 208.6 KiB | 198.2 KiB | 4.97% | > > 5 | 561.6 KiB | 526.3 KiB | 6.28% | > > I have mostly just random thoughts and some questions... > > Have you done any analysis of the node types that are consuming all > that memory? Are Path and subclasses of Path really that dominant in > this? The memory savings aren't that impressive and it makes me > wonder if this is worth the trouble. This thread and patch targets saving memory consumed by Path nodes i.e. Path and its subclasses. Breakdown of memory consumed by various nodes can be found in [1] for partitioned table queries. > > What's the goal of the memory reduction work? If it's for > performance, does this increase performance? Tracking refcounts of > Paths isn't free. This memory reduction work is part of work to reduce planner's memory consumption while planning queries involving partitioned tables with a large number (thousands) of partitions [1]. The second table (copies below) in my first email in this thread gives the memory saved by freeing unused path nodes when planning queries involving partitioned tables with a thousand partitions each. Table 2: join between partitioned tables with partitionwise join enabled (enable_partitionwise_join = true). Number of tables | without patch | with patch | % reduction | being joined | | | | ------------------------------ ---------------------------------- 2 | 40.3 MiB | 40.0 MiB | 0.70% | 3 | 146.9 MiB | 143.1 MiB | 2.55% | 4 | 445.4 MiB | 430.4 MiB | 3.38% | 5 | 1563.3 MiB | 1517.6 MiB | 2.92% | %wise the memory savings are not great but because of sheer amount of memory used, the savings are in MBs. Also I don't think I managed to free all the unused paths for the reasons listed in my original mail. But I was doubtful whether the work is worth it. So I halted my work. I see you think that it's not worth it. So one vote against it. > > Why do your refcounts start at 1? This seems weird: > > + /* Free the path if none references it. */ > + if (path->ref_count == 1) Refcounts start from 0. This code is from free_unused_paths_from_rel() which frees paths in the rel->pathlist. At this stage a path is referenced from rel->pathlist, hence the count will >= 1 and we should be freeing paths with refcount > 1 i.e. referenced elsewhere. > > Does ref_count really need to be an int? There's a 16-bit hole you > could use between param_info and parallel_aware. I doubt you'd need > 32 bits of space for this. 16 bit space allows the refcount to go upto 65535 (or twice of that). This is somewhere between 18C9 and 19C9. If a (the cheapest) path in a given relation in 19 relation query is referenced in all the joins that that relation is part of, this refcount will be exhausted. If we aren't dealing with queries involving those many relations already, we will soon. Maybe we should add code to protect from overflows. > > I agree that it would be nice to get rid of the "if (!IsA(old_path, > IndexPath))" hack in add_path() and it seems like this idea could > allow that. However, I think if we were to have something like this > then you'd want all the logic to be neatly contained inside pathnode.c > without adding any burden in other areas to track or check Path > refcounts. The code is in following files src/backend/optimizer/path/allpaths.c: The definitions of free_unused_paths_from_rel() and free_unused_paths() can be moved to pathnode.c src/backend/optimizer/util/pathnode.c src/include/nodes/pathnodes.h src/include/optimizer/pathnode.h Those three together I consider as pathnode.c. But let me know if you feel otherwise. src/backend/optimizer/path/joinpath.c this is the only place where other than pathnode.c where we use link_path(). That's required to stop add_path from freeing a materialized path that is tried multiple times.We can't add_path that path to rel since it will get kicked out by the path that it's materialising for. But I think it's still path creation code so should be ok. Let me know if you feel otherwise. There may be other places but I can find those out and consider them case by case basis. > > I imagine the patch would be neater if you added some kind of Path > traversal function akin to expression_tree_walker() that allows you to > visit each subpath recursively and run a callback function on each. Yes. Agreed. I am fine to work on this further if the community thinks it is acceptable. It seems from your point of view the worth of this work is as follows a. memory savings - not worth it b. getting rid of !IsA(old_path, IndexPath) - worth it c. problem of same path being added to multiple relations - not worth it since you have other solution Can't judge whether that means it's acceptable or not. [1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com -- Best Wishes, Ashutosh Bapat
On Fri, 1 Dec 2023 at 19:59, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > I am fine to work on this further if the community thinks it is > acceptable. It seems from your point of view the worth of this work is > as follows > a. memory savings - not worth it > b. getting rid of !IsA(old_path, IndexPath) - worth it > c. problem of same path being added to multiple relations - not worth > it since you have other solution > > Can't judge whether that means it's acceptable or not. I think it's worth trying to get this working and we can run some performance tests to see if it's cheap enough to use. I've not had enough time to fully process your patches, but on a quick look, I don't quite understand why link_path() does not have to recursively increment ref_counts in each of its subpaths. It seems you're doing the opposite in free_path(), so it seems like this would break for cases like a SortPath on say, a LimitPath over a SeqScan. When you call create_sort_path you'll only increment the refcount on the LimitPath. The SeqScan path then has a refcount 1 lower than it should. If something calls unlink_path on the SeqScan path that could put the refcount to 0 and break the SortPath by freeing a Path indirectly referenced by the sort. Recursively incrementing the refcounts would slow the patch down a bit, so we'd need to test the performance of this. I think at the rate we create and free join paths in complex join problems we could end up bumping refcounts quite often. Another way to fix the pfree issues was to add infrastructure to shallow copy paths. We could shallow copy paths whenever we borrow a Path from another RelOptInfo and then just reset the parent to the new RelOptInfo. That unfortunately makes your memory issue a bit worse. We discussed this a bit in [1]. It also seems there was some discussion about wrapping a Path up in a ProjectionPath in there. That unfortunately also takes us in the opposite direction in terms of your memory reduction goals. Maybe we can try to move forward with your refcount idea and see how the performance looks. If that's intolerable then that might help us decide on the next best alternative solution. David [1] https://postgr.es/m/CAApHDvo8tiB5xbJa94f4Mo8Of2fPJ2zG+zQQQTGr-ThjSsymQQ@mail.gmail.com
On Thu, Dec 7, 2023 at 6:19 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 1 Dec 2023 at 19:59, Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > I am fine to work on this further if the community thinks it is > > acceptable. It seems from your point of view the worth of this work is > > as follows > > a. memory savings - not worth it > > b. getting rid of !IsA(old_path, IndexPath) - worth it > > c. problem of same path being added to multiple relations - not worth > > it since you have other solution > > > > Can't judge whether that means it's acceptable or not. > > I think it's worth trying to get this working and we can run some > performance tests to see if it's cheap enough to use. > > I've not had enough time to fully process your patches, but on a quick > look, I don't quite understand why link_path() does not have to > recursively increment ref_counts in each of its subpaths. It seems > you're doing the opposite in free_path(), so it seems like this would > break for cases like a SortPath on say, a LimitPath over a SeqScan. > When you call create_sort_path you'll only increment the refcount on > the LimitPath. The SeqScan path then has a refcount 1 lower than it > should. If something calls unlink_path on the SeqScan path that could > put the refcount to 0 and break the SortPath by freeing a Path > indirectly referenced by the sort. Let me explain how linking and unlinking works. You may skip over it if you are comfortable with this logic already. refcounts count how many other paths or objects are referencing a given path. E.g. we have three path chains as follows 1. joinpath_1->joinpath_2->seqpath_1, 2. joinpath_3->joinpath_4->seqpath_1, 3. joinpath_5->joinpath_2->seqpath_1 Please note that this is not full path tree/forest. It is difficult to describe the whole path forest in text format. But this should be sufficient to get the gist of how linking and unlinking works. Let's say all these paths are listed in pathlists of respective RelOptInfos. Assume that joinpath_1, joinpath_3 and joinpath_5 are all part of the topmost RelOptInfo. Then the refcounts will be as follows joinpath_1, joinpath_3 and joinpath_5 -> each 1 (RelOptInfo) joinpath_2 - 3 (joinpath_2, joinpath_5 and RelOptInfo) joinpath_4 - 2 (joinpath_3, RelOptInfo) seqpath_1 - 3 (joinpath_2, joinpath_4, RelOptInfo) If joinpath_3 is chosen finally to be converted into a plan, joinpath_1, joinpath_5 will be unlinked, their refcount drops to 0 and they will be freed. unlinking and freeing them reduces refcount of joinpath 2 to 1. When we will cleanse the pathlist of corresponding RelOptInfo, joinpath_2's refcount will reduce to 0 and it will be freed. freeing joinpath_2 will reduce refcount of seqpath to 2 (joinpath_4 and RelOptInfo). joinpath_3 and joinpath_4 will have their refcounts unchanged. This will leave the paths joinpath_3, joinpath_4 and seqpath_1 to be converted to plans. Rest of the paths will be freed. If we increment the refcount all the way down the path forest, they won't really be refcounts anymore since they will count indirect references as well. Then unlink_path also has to traverse the entire tree resetting refcounts. Both of those are unnecessary. Please note that it's only free_path which calls unlink_path on the referenced paths not unlink_path itself. In your example, the path chain looks like this sort_path->limit_path->seqpath with refcounts 1, 1, 2 respectively. I am assuming that seqpath is referenced by pathlist of its RelOptInfo and limitpath. sortpath is referenced by its RelOptInfo or is chosen as the final path. limitpath is not referenced in a RelOptInfo but is referenced in sortpath. Now the only things that can call unlink_path on seqpath are free_path on limitpath OR pathlist cleaning of its RelOptInfo. In both the cases the refcount will correctly report the number of objects referencing it. So I don't see a problem here. If you can provide me a query that does this, I will test it. > Recursively incrementing the refcounts would slow the patch down a > bit, so we'd need to test the performance of this. I think at the > rate we create and free join paths in complex join problems we could > end up bumping refcounts quite often. This won't be necessary as explained above. > > Another way to fix the pfree issues was to add infrastructure to > shallow copy paths. We could shallow copy paths whenever we borrow a > Path from another RelOptInfo and then just reset the parent to the new > RelOptInfo. That unfortunately makes your memory issue a bit worse. > We discussed this a bit in [1]. It also seems there was some > discussion about wrapping a Path up in a ProjectionPath in there. That > unfortunately also takes us in the opposite direction in terms of your > memory reduction goals. Yeah. > > Maybe we can try to move forward with your refcount idea and see how > the performance looks. If that's intolerable then that might help us > decide on the next best alternative solution. I will report performance numbers (planning time) with my current patchset which has limitation listed in [1]. If performance gets worse, we will abandon this idea. If not, I will work on completing the patch. Does that work for you? [1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com -- Best Wishes, Ashutosh Bapat
On Fri, 8 Dec 2023 at 18:02, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > given path. E.g. we have three path chains as follows > 1. joinpath_1->joinpath_2->seqpath_1, > 2. joinpath_3->joinpath_4->seqpath_1, > 3. joinpath_5->joinpath_2->seqpath_1 > > Please note that this is not full path tree/forest. It is difficult to > describe the whole path forest in text format. But this should be > sufficient to get the gist of how linking and unlinking works. > > Let's say all these paths are listed in pathlists of respective > RelOptInfos. Assume that joinpath_1, joinpath_3 and joinpath_5 are all > part of the topmost RelOptInfo. Then the refcounts will be as follows > joinpath_1, joinpath_3 and joinpath_5 -> each 1 (RelOptInfo) > joinpath_2 - 3 (joinpath_2, joinpath_5 and RelOptInfo) > joinpath_4 - 2 (joinpath_3, RelOptInfo) > seqpath_1 - 3 (joinpath_2, joinpath_4, RelOptInfo) I think this likely works fine providing you always unlink paths from the root of the path tree. When you start unlinking from non-root paths I think you could start to see problems unless you increment refcounts recursively. The problem I had when working on improving the union planner was when building the final paths for a union child. Let's say I have the query: SELECT a,b FROM t1 UNION SELECT x,y FROM t2; When planning the first union child, I'd like to provide the union planner with a sorted path and also the cheapest scan path on t1 to allow it to Hash Aggregate on the cheapest path. Let's say the planner produces the following paths on t1. SeqScan_t1 When I create the final union child paths I want to try and produce a few sorted paths to allow MergeAppend to work. I'll loop over each path in t1's pathlist. SeqScan_t1 isn't sorted, so I'll make Sort1->SeqScan_t1 add add_path that to final_rel. Now, I also want to allow cheap Hash Aggregates to implement the UNION, so I'll include SeqScan_t1 without sorting it. If I now do add_path(final_rel, SeqScan_t1), add_path() could see that the SeqScan cost is fuzzily the same as the Sort->SeqScan_t1 and since the sort version has better PathKeys, add_path will unlink SeqScan_t1. Now, I don't think your scheme for refcounting paths is broken by this because you'll increment the refcount of the SeqScan_t1 when the Sort uses it as its subpath, but if instead of a Sort->SeqScan that was IncrementalSort->Sort->SeqScan, then suddenly you're not incrementing the refcount for the SeqScan when you create the Incremental Sort due to the Sort being between the two. So, if I now do add_path(final_rel, Sort2->Incremental_Sort->SeqScan_t1) Sort1 is rejected due to cost and we unlink it. Sort1 refcount gets to 0 and we free the path and the SeqScan_t1's refcount goes down by 1 but does not reach 0. We then add_path(final_rel, SeqScan_t1) but this path is fuzzily the same cost as Sort2 so we reject SeqScan_t1 due to Sort2 having better pathkeys and we unlink SeqScan_t1. Refcount on SeqScan_t1 gets to 0 and we free the path. We need SeqScan_t1 for the incremental sort. Perhaps in reality here, since SeqScan_t1 exists in the base rel's pathlist we'd still have a refcount from that, but I assume at some point you want to start freeing up paths from RelOptInfos that we're done with, in which case the SeqScan_t1's refcount would reach 0 at that point and we'd crash during create plan of the Sort2->Incremental_Sort->SeqScan_t1 plan. Have I misunderstood something here? As far as I see it with your non-recursive refcounts, it's only safe to free from the root of a pathtree and isn't safe when unlinking paths that are the subpath of another path unless the unlinking is done from the root. David
On Fri, Dec 8, 2023 at 1:02 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 8 Dec 2023 at 18:02, Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > given path. E.g. we have three path chains as follows > > 1. joinpath_1->joinpath_2->seqpath_1, > > 2. joinpath_3->joinpath_4->seqpath_1, > > 3. joinpath_5->joinpath_2->seqpath_1 > > > > Please note that this is not full path tree/forest. It is difficult to > > describe the whole path forest in text format. But this should be > > sufficient to get the gist of how linking and unlinking works. > > > > Let's say all these paths are listed in pathlists of respective > > RelOptInfos. Assume that joinpath_1, joinpath_3 and joinpath_5 are all > > part of the topmost RelOptInfo. Then the refcounts will be as follows > > joinpath_1, joinpath_3 and joinpath_5 -> each 1 (RelOptInfo) > > joinpath_2 - 3 (joinpath_2, joinpath_5 and RelOptInfo) > > joinpath_4 - 2 (joinpath_3, RelOptInfo) > > seqpath_1 - 3 (joinpath_2, joinpath_4, RelOptInfo) > > I think this likely works fine providing you always unlink paths from > the root of the path tree. When you start unlinking from non-root > paths I think you could start to see problems unless you increment > refcounts recursively. > > The problem I had when working on improving the union planner was when > building the final paths for a union child. > > Let's say I have the query: SELECT a,b FROM t1 UNION SELECT x,y FROM t2; > > When planning the first union child, I'd like to provide the union > planner with a sorted path and also the cheapest scan path on t1 to > allow it to Hash Aggregate on the cheapest path. > > Let's say the planner produces the following paths on t1. > > SeqScan_t1 > > When I create the final union child paths I want to try and produce a > few sorted paths to allow MergeAppend to work. I'll loop over each > path in t1's pathlist. > > SeqScan_t1 isn't sorted, so I'll make Sort1->SeqScan_t1 add add_path > that to final_rel. > > Now, I also want to allow cheap Hash Aggregates to implement the > UNION, so I'll include SeqScan_t1 without sorting it. > > If I now do add_path(final_rel, SeqScan_t1), add_path() could see that > the SeqScan cost is fuzzily the same as the Sort->SeqScan_t1 and since > the sort version has better PathKeys, add_path will unlink SeqScan_t1. > > Now, I don't think your scheme for refcounting paths is broken by this > because you'll increment the refcount of the SeqScan_t1 when the Sort > uses it as its subpath, but if instead of a Sort->SeqScan that was > IncrementalSort->Sort->SeqScan, then suddenly you're not incrementing > the refcount for the SeqScan when you create the Incremental Sort due > to the Sort being between the two. > > So, if I now do add_path(final_rel, Sort2->Incremental_Sort->SeqScan_t1) > > Sort1 is rejected due to cost and we unlink it. Sort1 refcount gets to > 0 and we free the path and the SeqScan_t1's refcount goes down by 1 > but does not reach 0. What is Sort1? Sort1->Seqscan_t1 OR incremental_sort->Sort->seqscan_t1? I am getting lost here. But I think you have misunderstood the code so this question is irrelevant. See next. > > We then add_path(final_rel, SeqScan_t1) but this path is fuzzily the > same cost as Sort2 so we reject SeqScan_t1 due to Sort2 having better > pathkeys and we unlink SeqScan_t1. Refcount on SeqScan_t1 gets to 0 > and we free the path. We need SeqScan_t1 for the incremental sort. the path being presented to add_path is never passed to unlink_path(). If it's suboptimal to any existing path, it is passed to free_path() which in its current form will throw an error. But that can fixed. free_path can just ignore a path with refcount > 0, if we want to present the same path to add_path multiple times. > > Perhaps in reality here, since SeqScan_t1 exists in the base rel's > pathlist we'd still have a refcount from that, but I assume at some > point you want to start freeing up paths from RelOptInfos that we're > done with, in which case the SeqScan_t1's refcount would reach 0 at > that point and we'd crash during create plan of the > Sort2->Incremental_Sort->SeqScan_t1 plan. > > Have I misunderstood something here? As far as I see it with your > non-recursive refcounts, it's only safe to free from the root of a > pathtree and isn't safe when unlinking paths that are the subpath of > another path unless the unlinking is done from the root. > As long as we call link_path every time we reference a path, we could start freeing paths from anywhere. The reference count takes care of not freeing referenced paths. Of course, I will need to fix free_path() as above. -- Best Wishes, Ashutosh Bapat
On Thu, Dec 7, 2023 at 6:19 PM David Rowley <dgrowleyml@gmail.com> wrote: > > Maybe we can try to move forward with your refcount idea and see how > the performance looks. If that's intolerable then that might help us > decide on the next best alternative solution. > Here are performance numbers setup create table t1 (a integer primary key, b integer); create table t1_parted (a integer primary key, b integer) partition by range(a); create 1000 partitions of t1 query (a five way self-join) select * from t1 a, t1 b, t1 c, t1 d, t1 e where a.a = b.a and b.a = c.a and c.a = d.a and d.a = e.a -- unpartitioned table case select * from t1_parted a, t1_parted b, t1_parted c, t1_parted d, t1_parted e where a.a = b.a and b.a = c.a and c.a = d.a and d.a = e.a; -- partitioned table case The numbers with patches attached to [1] with limitations listed in the email are thus Ran each query 10 times through EXPLAIN (SUMMARY) and averaged planning time with and without patch. unpartitioned case without patch: 0.25 with patch: 0.19 this improvement is probably misleading. The planning time for this query change a lot. partitioned case (without partitionwise join) without patch: 14580.65 with patch: 14625.12 % degradation: 0.3% partitioned case (with partitionwise join) without patch: 23273.69 with patch: 23328.52 % degradation: 0.2% That looks pretty small considering the benefits. What do you think? [1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com -- Best Wishes, Ashutosh Bapat
Forgot to mention, On Thu, Dec 14, 2023 at 5:34 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Thu, Dec 7, 2023 at 6:19 PM David Rowley <dgrowleyml@gmail.com> wrote: > > > > Maybe we can try to move forward with your refcount idea and see how > > the performance looks. If that's intolerable then that might help us > > decide on the next best alternative solution. > > > > Here are performance numbers > > setup > > create table t1 (a integer primary key, b integer); > create table t1_parted (a integer primary key, b integer) partition by range(a); > create 1000 partitions of t1 > > query (a five way self-join) > select * from t1 a, t1 b, t1 c, t1 d, t1 e where a.a = b.a and b.a = > c.a and c.a = d.a and d.a = e.a -- unpartitioned table case > select * from t1_parted a, t1_parted b, t1_parted c, t1_parted d, > t1_parted e where a.a = b.a and b.a = c.a and c.a = d.a and d.a = > e.a; -- partitioned table case > > The numbers with patches attached to [1] with limitations listed in > the email are thus > > Ran each query 10 times through EXPLAIN (SUMMARY) and averaged > planning time with and without patch. > unpartitioned case > without patch: 0.25 > with patch: 0.19 > this improvement is probably misleading. The planning time for this > query change a lot. > > partitioned case (without partitionwise join) > without patch: 14580.65 > with patch: 14625.12 > % degradation: 0.3% > > partitioned case (with partitionwise join) > without patch: 23273.69 > with patch: 23328.52 > % degradation: 0.2% > > That looks pretty small considering the benefits. What do you think? > > [1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com If you want to experiment, please use attached patches. There's a fix for segfault during initdb in them. The patches are still raw. -- Best Wishes, Ashutosh Bapat
Attachment
On Fri, Dec 15, 2023 at 5:22 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > > > That looks pretty small considering the benefits. What do you think? > > > > [1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com > > If you want to experiment, please use attached patches. There's a fix > for segfault during initdb in them. The patches are still raw. First patch is no longer required. Here's rebased set The patches are raw. make check has some crashes that I need to fix. I am waiting to hear whether this is useful and whether the design is on the right track. -- Best Wishes, Ashutosh Bapat
Attachment
On 6/2/2024 19:51, Ashutosh Bapat wrote: > On Fri, Dec 15, 2023 at 5:22 AM Ashutosh Bapat > The patches are raw. make check has some crashes that I need to fix. I > am waiting to hear whether this is useful and whether the design is on > the right track. Let me write words of opinion on that feature. I generally like the idea of a refcount field. We had problems generating alternative paths in extensions and designing the Asymmetric Join feature [1] when we proposed an alternative path one level below and called the add_path() routine. We had lost the path in the path list referenced from the upper RelOptInfo. This approach allows us to make more handy solutions instead of hacking with a copy/restore pathlist. But I'm not sure about freeing unreferenced paths. I would have to see alternatives in the pathlist. About partitioning. As I discovered planning issues connected to partitions, the painful problem is a rule, according to which we are trying to use all nomenclature of possible paths for each partition. With indexes, it quickly increases optimization work. IMO, this can help a 'symmetrical' approach, which could restrict the scope of possible pathways for upcoming partitions if we filter some paths in a set of previously planned partitions. Also, I am glad to see a positive opinion about the path_walker() routine. Somewhere else, for example, in [2], it seems we need it too. [1] https://www.postgresql.org/message-id/flat/CAOP8fzaVL_2SCJayLL9kj5pCA46PJOXXjuei6-3aFUV45j4LJQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAMbWs496%2BN%3DUAjOc%3DrcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw%40mail.gmail.com -- regards, Andrei Lepikhov Postgres Professional
On Thu, Feb 15, 2024 at 9:41 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > On 6/2/2024 19:51, Ashutosh Bapat wrote: > > On Fri, Dec 15, 2023 at 5:22 AM Ashutosh Bapat > > The patches are raw. make check has some crashes that I need to fix. I > > am waiting to hear whether this is useful and whether the design is on > > the right track. > Let me write words of opinion on that feature. > I generally like the idea of a refcount field. We had problems > generating alternative paths in extensions and designing the Asymmetric > Join feature [1] when we proposed an alternative path one level below > and called the add_path() routine. We had lost the path in the path list > referenced from the upper RelOptInfo. This approach allows us to make > more handy solutions instead of hacking with a copy/restore pathlist. Thanks for your comments. I am glad that you find this useful. > But I'm not sure about freeing unreferenced paths. I would have to see > alternatives in the pathlist. I didn't understand this. Can you please elaborate? A path in any pathlist is referenced. An unreferenced path should not be in any pathlist. > About partitioning. As I discovered planning issues connected to > partitions, the painful problem is a rule, according to which we are > trying to use all nomenclature of possible paths for each partition. > With indexes, it quickly increases optimization work. IMO, this can help > a 'symmetrical' approach, which could restrict the scope of possible > pathways for upcoming partitions if we filter some paths in a set of > previously planned partitions. filter or free? > Also, I am glad to see a positive opinion about the path_walker() > routine. Somewhere else, for example, in [2], it seems we need it too. > I agree that we should introduce path_walker() yes. I am waiting for David's opinion about the performance numbers shared in [1] before working further on the patches and bring them out of WIP state. [1] postgr.es/m/CAExHW5uDkMQL8SicV3_=AYcsWwMTNR8GzJo310J7sv7TMLoL6Q@mail.gmail.com -- Best Wishes, Ashutosh Bapat
On 15/2/2024 19:06, Ashutosh Bapat wrote: > On Thu, Feb 15, 2024 at 9:41 AM Andrei Lepikhov >> But I'm not sure about freeing unreferenced paths. I would have to see >> alternatives in the pathlist. > > I didn't understand this. Can you please elaborate? A path in any > pathlist is referenced. An unreferenced path should not be in any > pathlist. I mean that at some point, an extension can reconsider the path tree after building the top node of this path. I vaguely recall that we already have (or had) kind of such optimization in the core where part of the plan changes after it has been built. Live example: right now, I am working on the code like MSSQL has - a combination of NestLoop and HashJoin paths and switching between them in real-time. It requires both paths in the path list at the moment when extensions are coming. Even if one of them isn't referenced from the upper pathlist, it may still be helpful for the extension. >> About partitioning. As I discovered planning issues connected to >> partitions, the painful problem is a rule, according to which we are >> trying to use all nomenclature of possible paths for each partition. >> With indexes, it quickly increases optimization work. IMO, this can help >> a 'symmetrical' approach, which could restrict the scope of possible >> pathways for upcoming partitions if we filter some paths in a set of >> previously planned partitions. > > filter or free? Filter. I meant that Postres tries to apply IndexScan, BitmapScan, IndexOnlyScan, and other strategies, passing throughout the partition indexes. The optimizer spends a lot of time doing that. So, why not introduce a symmetrical strategy and give away from the search some indexes of types of scan based on the pathifying experience of previous partitions of the same table: if you have dozens of partitions, Is it beneficial for the system to find a bit more optimal IndexScan on one partition having SeqScans on 999 other? -- regards, Andrei Lepikhov Postgres Professional
On Fri, Feb 16, 2024 at 8:42 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > Live example: right now, I am working on the code like MSSQL has - a > combination of NestLoop and HashJoin paths and switching between them in > real-time. It requires both paths in the path list at the moment when > extensions are coming. Even if one of them isn't referenced from the > upper pathlist, it may still be helpful for the extension. There is no guarantee that every path presented to add_path will be preserved. Suboptimal paths are freed as and when add_path discovers that they are suboptimal. So I don't think an extension can rely on existence of a path. But having a refcount makes it easy to preserve the required paths by referencing them. > > >> About partitioning. As I discovered planning issues connected to > >> partitions, the painful problem is a rule, according to which we are > >> trying to use all nomenclature of possible paths for each partition. > >> With indexes, it quickly increases optimization work. IMO, this can help > >> a 'symmetrical' approach, which could restrict the scope of possible > >> pathways for upcoming partitions if we filter some paths in a set of > >> previously planned partitions. > > > > filter or free? > Filter. > I meant that Postres tries to apply IndexScan, BitmapScan, > IndexOnlyScan, and other strategies, passing throughout the partition > indexes. The optimizer spends a lot of time doing that. So, why not > introduce a symmetrical strategy and give away from the search some > indexes of types of scan based on the pathifying experience of previous > partitions of the same table: if you have dozens of partitions, Is it > beneficial for the system to find a bit more optimal IndexScan on one > partition having SeqScans on 999 other? > IIUC, you are suggesting that instead of planning each partition/partitionwise join, we only create paths with the strategies which were found to be optimal with previous partitions. That's a good heuristic but it won't work if partition properties - statistics, indexes etc. differ between groups of partitions. -- Best Wishes, Ashutosh Bapat
On 19/2/2024 19:25, Ashutosh Bapat wrote: > On Fri, Feb 16, 2024 at 8:42 AM Andrei Lepikhov > <a.lepikhov@postgrespro.ru> wrote: >> Live example: right now, I am working on the code like MSSQL has - a >> combination of NestLoop and HashJoin paths and switching between them in >> real-time. It requires both paths in the path list at the moment when >> extensions are coming. Even if one of them isn't referenced from the >> upper pathlist, it may still be helpful for the extension. > > There is no guarantee that every path presented to add_path will be > preserved. Suboptimal paths are freed as and when add_path discovers > that they are suboptimal. So I don't think an extension can rely on > existence of a path. But having a refcount makes it easy to preserve > the required paths by referencing them. I don't insist, just provide my use case. It would be ideal if you would provide some external routines for extensions that allow for sticking the path in pathlist even when it has terrible cost estimation. > >> >>>> About partitioning. As I discovered planning issues connected to >>>> partitions, the painful problem is a rule, according to which we are >>>> trying to use all nomenclature of possible paths for each partition. >>>> With indexes, it quickly increases optimization work. IMO, this can help >>>> a 'symmetrical' approach, which could restrict the scope of possible >>>> pathways for upcoming partitions if we filter some paths in a set of >>>> previously planned partitions. >>> >>> filter or free? >> Filter. >> I meant that Postres tries to apply IndexScan, BitmapScan, >> IndexOnlyScan, and other strategies, passing throughout the partition >> indexes. The optimizer spends a lot of time doing that. So, why not >> introduce a symmetrical strategy and give away from the search some >> indexes of types of scan based on the pathifying experience of previous >> partitions of the same table: if you have dozens of partitions, Is it >> beneficial for the system to find a bit more optimal IndexScan on one >> partition having SeqScans on 999 other? >> > IIUC, you are suggesting that instead of planning each > partition/partitionwise join, we only create paths with the strategies > which were found to be optimal with previous partitions. That's a good > heuristic but it won't work if partition properties - statistics, > indexes etc. differ between groups of partitions. Sure, but the "Symmetry" strategy assumes that on the scope of a thousand partitions, especially with parallel append involved, it doesn't cause sensible performance degradation if we find a bit suboptimal path in a small subset of partitions. Does it make sense? As I see, when people use 10-100 partitions for the table, they usually strive to keep indexes symmetrical for all partitions. -- regards, Andrei Lepikhov Postgres Professional
On Tue, Feb 20, 2024 at 8:19 AM Andrei Lepikhov <a.lepikhov@postgrespro.ru> wrote: > > On 19/2/2024 19:25, Ashutosh Bapat wrote: > > On Fri, Feb 16, 2024 at 8:42 AM Andrei Lepikhov > > <a.lepikhov@postgrespro.ru> wrote: > >> Live example: right now, I am working on the code like MSSQL has - a > >> combination of NestLoop and HashJoin paths and switching between them in > >> real-time. It requires both paths in the path list at the moment when > >> extensions are coming. Even if one of them isn't referenced from the > >> upper pathlist, it may still be helpful for the extension. > > > > There is no guarantee that every path presented to add_path will be > > preserved. Suboptimal paths are freed as and when add_path discovers > > that they are suboptimal. So I don't think an extension can rely on > > existence of a path. But having a refcount makes it easy to preserve > > the required paths by referencing them. > I don't insist, just provide my use case. It would be ideal if you would > provide some external routines for extensions that allow for sticking > the path in pathlist even when it has terrible cost estimation. With refcounts you can reference it and store it somewhere other than pathlist. The path won't be lost until it is dereferrenced. RelOptInfo::Pathlist is for optimal paths. > > > >> > >> > > IIUC, you are suggesting that instead of planning each > > partition/partitionwise join, we only create paths with the strategies > > which were found to be optimal with previous partitions. That's a good > > heuristic but it won't work if partition properties - statistics, > > indexes etc. differ between groups of partitions. > Sure, but the "Symmetry" strategy assumes that on the scope of a > thousand partitions, especially with parallel append involved, it > doesn't cause sensible performance degradation if we find a bit > suboptimal path in a small subset of partitions. Does it make sense? > As I see, when people use 10-100 partitions for the table, they usually > strive to keep indexes symmetrical for all partitions. > I agree that we need something like that. In order to do that, we need machinery to prove that all partitions have similar properties. Once that is proved, we can skip creating paths for similar partitions. But that's out of scope of this work and complements it. -- Best Wishes, Ashutosh Bapat
On 6/2/2024 13:51, Ashutosh Bapat wrote: > On Fri, Dec 15, 2023 at 5:22 AM Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > >>> >>> That looks pretty small considering the benefits. What do you think? >>> >>> [1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com >> >> If you want to experiment, please use attached patches. There's a fix >> for segfault during initdb in them. The patches are still raw. > > First patch is no longer required. Here's rebased set > > The patches are raw. make check has some crashes that I need to fix. I > am waiting to hear whether this is useful and whether the design is on > the right track. I think this work is promising, especially in the scope of partitioning. I've analysed how it works by basing these patches on the current master. You propose freeing unused paths after the end of the standard path search. In my view, upper paths generally consume less memory than join and scan paths. This is primarily due to the limited options provided by Postgres so far. At the same time, this technique (while highly useful in general) adds fragility and increases complexity: a developer needs to remember to link the path using the pointer in different places of the code. So, maybe go the alternative way? Invent a subquery memory context and store all the path allocations there. It can be freed after setrefs finishes this subquery planning without pulling up this subquery. -- regards, Andrei Lepikhov
On Thu, Sep 19, 2024 at 4:18 PM Andrei Lepikhov <lepihov@gmail.com> wrote: > > On 6/2/2024 13:51, Ashutosh Bapat wrote: > > On Fri, Dec 15, 2023 at 5:22 AM Ashutosh Bapat > > <ashutosh.bapat.oss@gmail.com> wrote: > > > >>> > >>> That looks pretty small considering the benefits. What do you think? > >>> > >>> [1] https://www.postgresql.org/message-id/CAExHW5stmOUobE55pMt83r8UxvfCph+Pvo5dNpdrVCsBgXEzDQ@mail.gmail.com > >> > >> If you want to experiment, please use attached patches. There's a fix > >> for segfault during initdb in them. The patches are still raw. > > > > First patch is no longer required. Here's rebased set > > > > The patches are raw. make check has some crashes that I need to fix. I > > am waiting to hear whether this is useful and whether the design is on > > the right track. > I think this work is promising, especially in the scope of partitioning. > I've analysed how it works by basing these patches on the current > master. You propose freeing unused paths after the end of the standard > path search. > In my view, upper paths generally consume less memory than join and scan > paths. This is primarily due to the limited options provided by Postgres > so far. > At the same time, this technique (while highly useful in general) adds > fragility and increases complexity: a developer needs to remember to > link the path using the pointer in different places of the code. > So, maybe go the alternative way? Invent a subquery memory context and > store all the path allocations there. It can be freed after setrefs > finishes this subquery planning without pulling up this subquery. > Using memory context for subquery won't help with partitioning right? If the outermost query has a partitioned table with thousands of partitions, it will still accumulate those paths till the very end of planning. We could instead use memory context/s to store all the paths created, then copy the optimal paths into a new memory context at strategic points and blow up the old memory context. And repeat this till we choose the final path and create a plan out of it; at that point we could blow up the memory context containing remaining paths as well. That will free the paths as soon as they are rendered useless. I discussed this idea with Alvaro offline. We thought that this approach needs some code to copy paths and then copying paths recursively has some overhead of itself. The current work of adding a reference count, OTOH has potential to bring discipline into the way we handle paths. We need to avoid risks posed by dangling pointers. For which Alvaro suggested looking at the way we manage snapshots. But I didn't get time to explore that idea yet. -- Best Wishes, Ashutosh Bapat
On 19/9/2024 13:12, Ashutosh Bapat wrote: > On Thu, Sep 19, 2024 at 4:18 PM Andrei Lepikhov <lepihov@gmail.com> wrote: >> At the same time, this technique (while highly useful in general) adds >> fragility and increases complexity: a developer needs to remember to >> link the path using the pointer in different places of the code. >> So, maybe go the alternative way? Invent a subquery memory context and >> store all the path allocations there. It can be freed after setrefs >> finishes this subquery planning without pulling up this subquery. >> > > Using memory context for subquery won't help with partitioning right? > If the outermost query has a partitioned table with thousands of > partitions, it will still accumulate those paths till the very end of > planning. I got it. Just haven't had huge tables in the outer before. > We could instead use memory context/s to store all the paths > created, then copy the optimal paths into a new memory context at > strategic points and blow up the old memory context. And repeat this > till we choose the final path and create a plan out of it; at that > point we could blow up the memory context containing remaining paths > as well. That will free the paths as soon as they are rendered > useless. I think any scalable solution should be based on a per-partition cleanup. For starters, why not adopt Tom's patch [1] for selectivity estimations? We will see the profit in the case of long lists of clauses. > I discussed this idea with Alvaro offline. We thought that > this approach needs some code to copy paths and then copying paths > recursively has some overhead of itself. It needs path_tree_walker at first. We discussed it before but failed. Maybe design it beforehand and use it in re-parameterising code? > The current work of adding a > reference count, OTOH has potential to bring discipline into the way > we handle paths. We need to avoid risks posed by dangling pointers. Both hands up for having pointer counters: It is painful all the time in extensions to invent an approach to safely removing a path you want to replace with a custom one. I just want to say it looks too dangerous compared to the value of a possible positive outcome. > For which Alvaro suggested looking at the way we manage snapshots. But > I didn't get time to explore that idea yet. Unfortunately, I can't understand this idea without an explanation. [1] Optimize planner memory consumption for huge arrays https://www.postgresql.org/message-id/1367418.1708816059@sss.pgh.pa.us -- regards, Andrei Lepikhov