Thread: [HACKERS] Partitioned tables and relfilenode
The new partitioned tables do not contain any data by themselves. Any data inserted into a partitioned table is routed to and stored in one of its partitions. In fact, it is impossible to insert *any* data before a partition (to be precise, a leaf partition) is created. It seems wasteful then to allocate physical storage (files) for partitioned tables. If we do not allocate the storage, then we must make sure that the right thing happens when a command that is intended to manipulate a table's storage encounters a partitioned table, the "right thing" here being that the command's code either throws an error or warning (in some cases) if the specified table is a partitioned table or ignores any partitioned tables when it reads the list of relations to process from pg_class. Commands that need to be taught about this are vacuum, analyze, truncate, and alter table. Specifically: - In case of vacuum, specifying a partitioned table causes a warning - In case of analyze, we do not throw an error or warning but simply avoid calling do_analyze_rel() *non-recursively*. Further in acquire_inherited_sample_rows(), any partitioned tables in the list returned by find_all_inheritors() are skipped. - In case of truncate, only the part which manipulates table's physical storage is skipped for partitioned tables. - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned tables, because there is nothing to be done. - Since we cannot create indexes on partitioned tables anyway, there is no need to handle cluster and reindex (they throw a meaningful error already due to the lack of indexes.) Patches 0001 and 0002 of the attached implement the above part. 0001 teaches the above mentioned commands to do the "right thing" as described above and 0002 teaches heap_create() and heap_create_with_catalog() to not create any physical storage (none of the forks) for partitioned tables. Then comes 0003, which concerns inheritance planning. In a regular inheritance set (ie, the inheritance set corresponding to an inheritance hierarchy whose root is a regular table), the inheritance parents are included in their role as child members, because they might contribute rows to the final result. So AppendRelInfo's are created for each such table by the planner prep phase, which the later planning steps use to create a scan plan for those tables as the Append's child plans. Currently, the partitioned tables are also processed by the optimizer as inheritance sets. Partitioned table inheritance parents do not own any storage, so we *must* not create scan plans for them. So we do not need to process them as child members of the inheritance set. 0003 teaches expand_inherited_rtentry() to not add partitioned tables as child members. Also, since the root partitioned table RTE is no longer added to the Append list as the 1st child member, inheritance_planner() cannot assume that it can install the 1st child RTE as the nominalRelation of a given ModifyTable node, instead the original root parent table RTE is installed as the nominalRelation. Together the above patches implement the first item listed in "Some notes:" part of an email [1] on the original declarative partitioning thread, which says: "We should try to teach the executor never to scan the parent. That's never necessary with this system, and it might add significant overhead. We should also try to get rid of the idea of the parent having storage (i.e. a relfilenode)." Thoughts, comments? Thanks, Amit [1] https://postgr.es/m/CA%2BTgmob6DJEQBd%3DqH2wZG1onYZ9R1Sji3q%2BGqCRUqQi%2Bug28Rw%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 10 February 2017 at 06:19, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > The new partitioned tables do not contain any data by themselves. Any > data inserted into a partitioned table is routed to and stored in one of > its partitions. In fact, it is impossible to insert *any* data before a > partition (to be precise, a leaf partition) is created. Where is all of this documented? All I see at the moment is old docs. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017/02/10 15:58, Simon Riggs wrote: > On 10 February 2017 at 06:19, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> The new partitioned tables do not contain any data by themselves. Any >> data inserted into a partitioned table is routed to and stored in one of >> its partitions. In fact, it is impossible to insert *any* data before a >> partition (to be precise, a leaf partition) is created. > > Where is all of this documented? All I see at the moment is old docs. Last week I sent documentation patches in a separate message titled "Documentation improvements for partitioning" [1]. One of the patches adds a separate section in the DDL chapter describing partitioned tables. Thanks, Amit [1] https://www.postgresql.org/message-id/bf9ebbb3-fb1e-3206-b67c-e7a803a747d9%40lab.ntt.co.jp
On Fri, Feb 10, 2017 at 3:19 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > The new partitioned tables do not contain any data by themselves. Any > data inserted into a partitioned table is routed to and stored in one of > its partitions. In fact, it is impossible to insert *any* data before a > partition (to be precise, a leaf partition) is created. It seems wasteful > then to allocate physical storage (files) for partitioned tables. If we > do not allocate the storage, then we must make sure that the right thing > happens when a command that is intended to manipulate a table's storage > encounters a partitioned table, the "right thing" here being that the > command's code either throws an error or warning (in some cases) if the > specified table is a partitioned table or ignores any partitioned tables > when it reads the list of relations to process from pg_class. Commands > that need to be taught about this are vacuum, analyze, truncate, and alter > table. Specifically: Thanks. I have been looking a bit at this set of patches. > - In case of vacuum, specifying a partitioned table causes a warning > - In case of analyze, we do not throw an error or warning but simply > avoid calling do_analyze_rel() *non-recursively*. Further in > acquire_inherited_sample_rows(), any partitioned tables in the list > returned by find_all_inheritors() are skipped. > - In case of truncate, only the part which manipulates table's physical > storage is skipped for partitioned tables. I am wondering if instead those should not just be errors saying that such operations are just not support. This could be relaxed in the future to mean that a vacuum, truncate or analyze on the partitioned table triggers an operator in cascade. > - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned > tables, because there is nothing to be done. Perhaps that makes sense, foreign tables do that. > - Since we cannot create indexes on partitioned tables anyway, there is > no need to handle cluster and reindex (they throw a meaningful error > already due to the lack of indexes.) Yep. > Patches 0001 and 0002 of the attached implement the above part. 0001 > teaches the above mentioned commands to do the "right thing" as described > above and 0002 teaches heap_create() and heap_create_with_catalog() to not > create any physical storage (none of the forks) for partitioned tables. - else + /* + * Although we cannot analyze partitioned tables themselves, we are + * still be able to do the recursive ANALYZE. + */ + else if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) { /* No need for a WARNING if we already complainedduring VACUUM */ if (!(options & VACOPT_VACUUM)) It seems to me that it is better style to use an empty else if with only the comment. Comments related to a condition that are outside this condition should be conditional in their formulations. Comments within a condition can be affirmations when they refer to a decided state of things. From patch 2: @@ -1351,7 +1352,12 @@ heap_create_with_catalog(const char *relname, relkind == RELKIND_TOASTVALUE || relkind == RELKIND_PARTITIONED_TABLE); - heap_create_init_fork(new_rel_desc); + /* + * We do not want to create any storage objects for a partitioned + * table. + */ + if (relkind != RELKIND_PARTITIONED_TABLE) + heap_create_init_fork(new_rel_desc); } This does not make much sense with an assertion telling exactly the contrary a couple of lines before. I think that it would make more sense to move the assertion on relkind directly in heap_create_init_fork(). > Then comes 0003, which concerns inheritance planning. In a regular > inheritance set (ie, the inheritance set corresponding to an inheritance > hierarchy whose root is a regular table), the inheritance parents are > included in their role as child members, because they might contribute > rows to the final result. So AppendRelInfo's are created for each such > table by the planner prep phase, which the later planning steps use to > create a scan plan for those tables as the Append's child plans. > Currently, the partitioned tables are also processed by the optimizer as > inheritance sets. Partitioned table inheritance parents do not own any > storage, so we *must* not create scan plans for them. So we do not need > to process them as child members of the inheritance set. 0003 teaches > expand_inherited_rtentry() to not add partitioned tables as child members. > Also, since the root partitioned table RTE is no longer added to the > Append list as the 1st child member, inheritance_planner() cannot assume > that it can install the 1st child RTE as the nominalRelation of a given > ModifyTable node, instead the original root parent table RTE is installed > as the nominalRelation. This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE to avoid interactions with partition tables. Did you consider doing something in the executor instead? -- Michael
On Wed, Feb 15, 2017 at 2:14 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE > to avoid interactions with partition tables. Did you consider doing > something in the executor instead? That seems inferior, because the planner really ought to know that the partitioning root doesn't need to be scanned. That way, it can do a better job getting the cost and selectivity estimates right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10 February 2017 at 06:19, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > the "right thing" here being that the > command's code either throws an error or warning (in some cases) if the > specified table is a partitioned table or ignores any partitioned tables > when it reads the list of relations to process from pg_class. This is a massive assumption and deserves major discussion. My expectation is that "partitioned tables" are "tables". Anything else seems to fly in the face of both the SQL Standard and the POLA principle for users coming from other database systems. IMHO all the main actions should all "just work" not throw errors. > Commands > that need to be taught about this are vacuum, analyze, truncate, and alter > table. Specifically: > > - In case of vacuum, specifying a partitioned table causes a warning Why not vacuum all partitions? > - In case of analyze, we do not throw an error or warning but simply > avoid calling do_analyze_rel() *non-recursively*. Further in > acquire_inherited_sample_rows(), any partitioned tables in the list > returned by find_all_inheritors() are skipped. Why not analyze all partitions? > - In case of truncate, only the part which manipulates table's physical > storage is skipped for partitioned tables. Truncate all partitions > - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned > tables, because there is nothing to be done. > > - Since we cannot create indexes on partitioned tables anyway, there is > no need to handle cluster and reindex (they throw a meaningful error > already due to the lack of indexes.) Create index on all partitions > Thoughts, comments? We already have Inheritance. Anybody that likes that behaviour can use it. Most people don't like that behaviour and wish to see change. They want the same behaviour as they get from other database products where a partitioned table is a table and you can do stuff to it just like other tables. We have the opportunity to change things here and we should do so. (It also seems like wasted effort to try to remove the overhead caused by a parent table for partitioning. Why introduce a special case to save a few bytes? Premature optimization, surely?) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 16, 2017 at 6:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Why not vacuum all partitions? > Why not analyze all partitions? > Truncate all partitions I agree. But, we need to be careful that a database-wide VACUUM or ANALYZE doesn't hit the partitions multiple times, once for the parent and again for each child. Actually, a database-wide VACUUM should hit each partition individually and do nothing for the parents, but a database-wide ANALYZE should process the parents and do nothing for the children, so that the inheritance statistics get updated. >> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned >> tables, because there is nothing to be done. >> >> - Since we cannot create indexes on partitioned tables anyway, there is >> no need to handle cluster and reindex (they throw a meaningful error >> already due to the lack of indexes.) > > Create index on all partitions That one's more complicated, per what I wrote in https://www.postgresql.org/message-id/CA+TgmoZUwj=QYnaK+F7xEf4w_e2g3XxdMnSNZMZjuinHRcOB8A@mail.gmail.com > (It also seems like wasted effort to try to remove the overhead caused > by a parent table for partitioning. Why introduce a special case to > save a few bytes? Premature optimization, surely?) I don't think it's wasted effort, actually. My concern isn't so much the empty file on disk (which is stupid, but maybe harmless) as eliminating the dummy scan from the query plan. I believe the do-nothing scan can actually be a noticeable drag on performance in some cases - e.g. if the scan of the partitioned table is on the inside of a nested loop, so that instead of repeatedly doing an index scan on each of 4 partitions, you repeatedly do an index scan on each of 4 partitions and a sequential scan of an empty table. A zero-page sequential scan is pretty fast, but not free. An even bigger problem is that the planner may think that always-empty parent can contain some rows, throwing planner estimates off and messing up the whole plan. We've been living with that problem for a long time, but now that we have an opportunity to fix it, it would be good to do so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/02/15 16:14, Michael Paquier wrote: > On Fri, Feb 10, 2017 at 3:19 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> The new partitioned tables do not contain any data by themselves. Any >> data inserted into a partitioned table is routed to and stored in one of >> its partitions. In fact, it is impossible to insert *any* data before a >> partition (to be precise, a leaf partition) is created. It seems wasteful >> then to allocate physical storage (files) for partitioned tables. If we >> do not allocate the storage, then we must make sure that the right thing >> happens when a command that is intended to manipulate a table's storage >> encounters a partitioned table, the "right thing" here being that the >> command's code either throws an error or warning (in some cases) if the >> specified table is a partitioned table or ignores any partitioned tables >> when it reads the list of relations to process from pg_class. Commands >> that need to be taught about this are vacuum, analyze, truncate, and alter >> table. Specifically: > > Thanks. I have been looking a bit at this set of patches. Thanks for reviewing! >> - In case of vacuum, specifying a partitioned table causes a warning >> - In case of analyze, we do not throw an error or warning but simply >> avoid calling do_analyze_rel() *non-recursively*. Further in >> acquire_inherited_sample_rows(), any partitioned tables in the list >> returned by find_all_inheritors() are skipped. >> - In case of truncate, only the part which manipulates table's physical >> storage is skipped for partitioned tables. > > I am wondering if instead those should not just be errors saying that > such operations are just not support. This could be relaxed in the > future to mean that a vacuum, truncate or analyze on the partitioned > table triggers an operator in cascade. As the discussion down-thread suggests, that might be a path worth considering. That is, vacuum or analyze on a partitioned table causes all the (leaf) partitions to be vacuumed or analyzed. Truncate already does that (that is, recurse to leaf partitions). This patch simply prevents ExecuteTruncate() from trying to manipulate the partitioned table's relfilenode. >> - Since we cannot create indexes on partitioned tables anyway, there is >> no need to handle cluster and reindex (they throw a meaningful error >> already due to the lack of indexes.) > > Yep. This one is still under discussion. >> Patches 0001 and 0002 of the attached implement the above part. 0001 >> teaches the above mentioned commands to do the "right thing" as described >> above and 0002 teaches heap_create() and heap_create_with_catalog() to not >> create any physical storage (none of the forks) for partitioned tables. > > - else > + /* > + * Although we cannot analyze partitioned tables themselves, we are > + * still be able to do the recursive ANALYZE. > + */ > + else if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) > { > /* No need for a WARNING if we already complained during VACUUM */ > if (!(options & VACOPT_VACUUM)) > It seems to me that it is better style to use an empty else if with > only the comment. Comments related to a condition that are outside > this condition should be conditional in their formulations. Comments > within a condition can be affirmations when they refer to a decided > state of things. Not sure if this will survive based on the decisions on what to do about vacuum and analyze, but how about the following rewrite of the above: else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE) { /* * Although we cannot analyze partitioned tables themselves, we are * still be able to do the recursive ANALYZE,which we do below. */ } else { >>From patch 2: > @@ -1351,7 +1352,12 @@ heap_create_with_catalog(const char *relname, > relkind == RELKIND_TOASTVALUE || > relkind == RELKIND_PARTITIONED_TABLE); > > - heap_create_init_fork(new_rel_desc); > + /* > + * We do not want to create any storage objects for a partitioned > + * table. > + */ > + if (relkind != RELKIND_PARTITIONED_TABLE) > + heap_create_init_fork(new_rel_desc); > } > This does not make much sense with an assertion telling exactly the > contrary a couple of lines before. I think that it would make more > sense to move the assertion on relkind directly in > heap_create_init_fork(). Hmm, perhaps the below is saner, with the Assert moved to the start of heap_create_init_fork() as shown below: /** We do not want to create any storage objects for a partitioned* table, including the init fork.*/ if (relpersistence == RELPERSISTENCE_UNLOGGED && relkind != RELKIND_PARTITIONED_TABLE) heap_create_init_fork(new_rel_desc); @@ -1382,6 +1376,8 @@ heap_create_with_catalog(const char *relname,voidheap_create_init_fork(Relation rel){ + Assert(relkind == RELKIND_RELATION || relkind == RELKIND_MATVIEW || + relkind == RELKIND_TOASTVALUE); >> Then comes 0003, which concerns inheritance planning. In a regular >> inheritance set (ie, the inheritance set corresponding to an inheritance >> hierarchy whose root is a regular table), the inheritance parents are >> included in their role as child members, because they might contribute >> rows to the final result. So AppendRelInfo's are created for each such >> table by the planner prep phase, which the later planning steps use to >> create a scan plan for those tables as the Append's child plans. >> Currently, the partitioned tables are also processed by the optimizer as >> inheritance sets. Partitioned table inheritance parents do not own any >> storage, so we *must* not create scan plans for them. So we do not need >> to process them as child members of the inheritance set. 0003 teaches >> expand_inherited_rtentry() to not add partitioned tables as child members. >> Also, since the root partitioned table RTE is no longer added to the >> Append list as the 1st child member, inheritance_planner() cannot assume >> that it can install the 1st child RTE as the nominalRelation of a given >> ModifyTable node, instead the original root parent table RTE is installed >> as the nominalRelation. > > This is a collection of checks on relkind == RELKIND_PARTITIONED_TABLE > to avoid interactions with partition tables. Did you consider doing > something in the executor instead? Getting rid of the parent relation (of which there could be many if a given partition tree is multi-level) well within the optimizer means that the optimizer makes a plan that does not have redundant Scan nodes for partitioned tables (and if the 2nd patch is worthy of consideration, we absolutely cannot create Scan nodes for them at all). And Robert's point about optimizer's cost and selectivity estimations. I will post the updated patches after addressing other comments in the thread. Thanks, Amit
On 2017/02/16 23:40, Robert Haas wrote: > On Thu, Feb 16, 2017 at 6:32 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Why not vacuum all partitions? >> Why not analyze all partitions? >> Truncate all partitions > > I agree. But, we need to be careful that a database-wide VACUUM or > ANALYZE doesn't hit the partitions multiple times, once for the parent > and again for each child. Actually, a database-wide VACUUM should hit > each partition individually and do nothing for the parents, but a This is what would happen even without the patch. Patch only modifies what happens when a partitioned table is specified in the vacuum command. It emits a warning: WARNING: skipping "%s" --- cannot vacuum partitioned tables It seems both you and Simon agree that instead of this warning, we should recurse to process the leaf partitions (ignoring any partitioned tables in the hierarchy for which there is nothing to do). If that's right, I will modify the patch to do that. > database-wide ANALYZE should process the parents and do nothing for > the children, so that the inheritance statistics get updated. Currently vacuum() processes the following relations: /* * Process all plain relations and materialized views listed in * pg_class */ while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) { Form_pg_class classForm = (Form_pg_class)GETSTRUCT(tuple); if (classForm->relkind != RELKIND_RELATION && classForm->relkind != RELKIND_MATVIEW) continue; Do you mean that if database-wide analyze is to be run, we should also exclude those RELKIND_RELATION relations that are partitions? So the only way to update a partition's statistics is to directly specify it in the command or by autovacuum. Truncate already recurses to partitions by way of inheritance recursion that's already in place. The patch simply teaches ExecuteTruncate() to ignore partitioned tables when we get to the part where relfilenodes are manipulated. >>> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned >>> tables, because there is nothing to be done. >>> >>> - Since we cannot create indexes on partitioned tables anyway, there is >>> no need to handle cluster and reindex (they throw a meaningful error >>> already due to the lack of indexes.) >> >> Create index on all partitions > > That one's more complicated, per what I wrote in > https://www.postgresql.org/message-id/CA+TgmoZUwj=QYnaK+F7xEf4w_e2g3XxdMnSNZMZjuinHRcOB8A@mail.gmail.com I agree that it would be nice to fix this usability issue. But, it's also important as you say in the quoted email that we should not underestimate the amount of work that might be required to properly implement it. >> (It also seems like wasted effort to try to remove the overhead caused >> by a parent table for partitioning. Why introduce a special case to >> save a few bytes? Premature optimization, surely?) > > I don't think it's wasted effort, actually. My concern isn't so much > the empty file on disk (which is stupid, but maybe harmless) as > eliminating the dummy scan from the query plan. I believe the > do-nothing scan can actually be a noticeable drag on performance in > some cases - e.g. if the scan of the partitioned table is on the > inside of a nested loop, so that instead of repeatedly doing an index > scan on each of 4 partitions, you repeatedly do an index scan on each > of 4 partitions and a sequential scan of an empty table. A zero-page > sequential scan is pretty fast, but not free. An even bigger problem > is that the planner may think that always-empty parent can contain > some rows, throwing planner estimates off and messing up the whole > plan. We've been living with that problem for a long time, but now > that we have an opportunity to fix it, it would be good to do so. Agreed. I would have reconsidered if it were a more invasive patch. Thanks, Amit
On Fri, Feb 17, 2017 at 1:12 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I agree. But, we need to be careful that a database-wide VACUUM or >> ANALYZE doesn't hit the partitions multiple times, once for the parent >> and again for each child. Actually, a database-wide VACUUM should hit >> each partition individually and do nothing for the parents, but a > > This is what would happen even without the patch. Patch only modifies > what happens when a partitioned table is specified in the vacuum command. > It emits a warning: > > WARNING: skipping "%s" --- cannot vacuum partitioned tables > > It seems both you and Simon agree that instead of this warning, we should > recurse to process the leaf partitions (ignoring any partitioned tables in > the hierarchy for which there is nothing to do). If that's right, I will > modify the patch to do that. Yeah, that sounds fine. >> database-wide ANALYZE should process the parents and do nothing for >> the children, so that the inheritance statistics get updated. > > Currently vacuum() processes the following relations: > > /* > * Process all plain relations and materialized views listed in > * pg_class > */ > > while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) > { > Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple); > > if (classForm->relkind != RELKIND_RELATION && > classForm->relkind != RELKIND_MATVIEW) > continue; > > Do you mean that if database-wide analyze is to be run, we should also > exclude those RELKIND_RELATION relations that are partitions? > > So the only way to update a partition's statistics is to directly specify > it in the command or by autovacuum. I think if you type: ANALYZE; ...that should process all partitioned tables and all tables that are not themselves partitions. If you type: ANALYZE name; ...that should ANALYZE that relation, whatever it is. If it's a partitioned table, it should recurse. > Truncate already recurses to partitions by way of inheritance recursion > that's already in place. The patch simply teaches ExecuteTruncate() to > ignore partitioned tables when we get to the part where relfilenodes are > manipulated. Oh, OK. That seems fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16 February 2017 at 11:32, Simon Riggs <simon@2ndquadrant.com> wrote: > On 10 February 2017 at 06:19, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> the "right thing" here being that the >> command's code either throws an error or warning (in some cases) if the >> specified table is a partitioned table or ignores any partitioned tables >> when it reads the list of relations to process from pg_class. > > This is a massive assumption and deserves major discussion. > > My expectation is that "partitioned tables" are "tables". Anything > else seems to fly in the face of both the SQL Standard and the POLA > principle for users coming from other database systems. > > IMHO all the main actions should all "just work" not throw errors. This included DROP TABLE, which I commented on before. CASCADE should not be required. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017/02/20 5:31, Simon Riggs wrote: > On 16 February 2017 at 11:32, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 10 February 2017 at 06:19, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >>> the "right thing" here being that the >>> command's code either throws an error or warning (in some cases) if the >>> specified table is a partitioned table or ignores any partitioned tables >>> when it reads the list of relations to process from pg_class. >> >> This is a massive assumption and deserves major discussion. >> >> My expectation is that "partitioned tables" are "tables". Anything >> else seems to fly in the face of both the SQL Standard and the POLA >> principle for users coming from other database systems. >> >> IMHO all the main actions should all "just work" not throw errors. > > This included DROP TABLE, which I commented on before. > > CASCADE should not be required. Yeah, it seemed like that is the consensus so I posted a patch [0], which re-posted in a new thread titled "dropping partitioned tables without CASCADE" [1]. Thanks, Amit [0] https://postgr.es/m/ca132b99-0d18-439a-fe65-024085449259%40lab.ntt.co.jp [1] https://postgr.es/m/6c420206-45d7-3f56-8325-4bd7b76483ba%40lab.ntt.co.jp
On 2017/02/19 18:53, Robert Haas wrote: > On Fri, Feb 17, 2017 at 1:12 PM, Amit Langote wrote: >> Do you mean that if database-wide analyze is to be run, we should also >> exclude those RELKIND_RELATION relations that are partitions? >> >> So the only way to update a partition's statistics is to directly specify >> it in the command or by autovacuum. > > I think if you type: > > ANALYZE; > > ...that should process all partitioned tables and all tables that are > not themselves partitions. OK. > If you type: > > ANALYZE name; > > ...that should ANALYZE that relation, whatever it is. If it's a > partitioned table, it should recurse. To be clear, by "recurse" I assume you mean to perform ANALYZE on individual partitions, not just collect the inheritance statistics. So ANALYZE partitioned_table would both a) collect the inheritance statistics for the specified table and other partitioned tables in the hierarchy, b) ANALYZE every leaf partitions updating their statistics in pg_class. While working on this, I noticed that autovacuum.c does not collect RELKIND_PARTITIONED_TABLE relations, which I think is not right. It should match what get_rel_oids() does, which in the database-wide VACUUM/ANALYZE case collects them. Attached updated patches: Updated 0001 addresses the following comments: - should recurse when vacuum/analyze is performed on a partitioned table - database-wide vacuum should ignore partitioned tables - database-wide analyze should ignore partitions; only the inheritance statistics of the partitioned tables must be collected in this case Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Some comments about 0003 patch. @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root) Index rti; + RangeTblEntry *parent_rte; There's already another variable declared in that function within a loop foreach(lc, root->append_rel_list) { ... RangeTblEntry *parent_rte; RangeTblEntry *child_rte; You might want to choose a different name or delete the one within the loop. I am wondering whether we should deal with inh flat reset in a slightly different way. Let expand_inherited_rtentry() mark inh = false for the partitioned tables without any partitions and deal with those at the time of estimating size by marking those as dummy. That might be better than multiple changes here. I will try to cook up a patch soon for that. Also we should add tests to make sure the scans on partitioned tables without any partitions do not get into problems. PFA patch which adds those tests. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017/02/21 22:21, Ashutosh Bapat wrote: > Some comments about 0003 patch. > > @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root) > Index rti; > + RangeTblEntry *parent_rte; > There's already another variable declared in that function within a loop > foreach(lc, root->append_rel_list) > { > ... > RangeTblEntry *parent_rte; > RangeTblEntry *child_rte; > > You might want to choose a different name or delete the one within the loop. Deleted the one within the loop. > I am wondering whether we should deal with inh flat reset in a > slightly different way. Let expand_inherited_rtentry() mark inh = > false for the partitioned tables without any partitions and deal with > those at the time of estimating size by marking those as dummy. That > might be better than multiple changes here. I will try to cook up a > patch soon for that. Are thinking something along the lines of the attached rewritten patch 0003? I also tend to think that's probably a cleaner patch. Thanks for the idea. > Also we should add tests to make sure the scans on partitioned tables > without any partitions do not get into problems. PFA patch which adds > those tests. I added the test case you suggest, but kept just the first one. I am including the patches 0001 and 0002 to keep all patches being discussed on this thread together. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
> >> I am wondering whether we should deal with inh flat reset in a >> slightly different way. Let expand_inherited_rtentry() mark inh = >> false for the partitioned tables without any partitions and deal with >> those at the time of estimating size by marking those as dummy. That >> might be better than multiple changes here. I will try to cook up a >> patch soon for that. > > Are thinking something along the lines of the attached rewritten patch > 0003? I also tend to think that's probably a cleaner patch. Thanks for > the idea. Yes. Thanks for working on it. > >> Also we should add tests to make sure the scans on partitioned tables >> without any partitions do not get into problems. PFA patch which adds >> those tests. > > I added the test case you suggest, but kept just the first one. The second one was testing multi-level partitioning case, which deserves a testcase by its own. Some more comments The comment below seems too verbose. All it can say is "A partitioned table without any partitions results in a dummy relation." I don't think we need to be explicit about rte->inh. But it's more of personal preference. We will leave it to the committer, if you don't agree. + /* + * An empty partitioned table, i.e., one without any leaf + * partitions, as signaled by rte->inh being set to false + * by the prep phase (see expand_inherited_rtentry). + */ We don't need this comment as well. Rather than repeating what's been said at the top of the function, it should just refer to it like "nominal relation has been set above for partitioned tables. For other parent relations, we'll use the first child ...". + * + * If the parent is a partitioned table, we do not have a separate RTE + * representing it as a member of the inheritance set, because we do + * not create a scan plan for it. As mentioned at the top of this + * function, we make the parent RTE itself the nominal relation. */ Following condition is not very readable. It's not evident that it's of the form (A && B) || C, at a glance it looks like it's A && (B || C). + if ((rte->relkind != RELKIND_PARTITIONED_TABLE && + list_length(appinfos) < 2) || list_length(appinfos) < 1) Instead you may rearrage it as min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2); if (list_length(appinfos) < min_child_rels) -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Thanks for the review. On 2017/02/22 21:58, Ashutosh Bapat wrote: >>> Also we should add tests to make sure the scans on partitioned tables >>> without any partitions do not get into problems. PFA patch which adds >>> those tests. >> >> I added the test case you suggest, but kept just the first one. > > The second one was testing multi-level partitioning case, which > deserves a testcase by its own. Ah, okay. Added it back. > Some more comments > > The comment below seems too verbose. All it can say is "A partitioned table > without any partitions results in a dummy relation." I don't think we need to > be explicit about rte->inh. But it's more of personal preference. We will leave > it to the committer, if you don't agree. > + /* > + * An empty partitioned table, i.e., one without any leaf > + * partitions, as signaled by rte->inh being set to false > + * by the prep phase (see expand_inherited_rtentry). > + */ It does seem poorly worded. How about: /* * A partitioned table without leaf partitions is marked * as a dummy rel. */ > We don't need this comment as well. Rather than repeating what's been said at > the top of the function, it should just refer to it like "nominal relation has > been set above for partitioned tables. For other parent relations, we'll use > the first child ...". > + * > + * If the parent is a partitioned table, we do not have a separate RTE > + * representing it as a member of the inheritance set, because we do > + * not create a scan plan for it. As mentioned at the top of this > + * function, we make the parent RTE itself the nominal relation. > */ Rewrote that comment block as: * * If the parent is a partitioned table, we already set the nominal * relation. */ > Following condition is not very readable. It's not evident that it's of the > form (A && B) || C, at a glance it looks like it's A && (B || C). > + if ((rte->relkind != RELKIND_PARTITIONED_TABLE && > + list_length(appinfos) < 2) || list_length(appinfos) < 1) > > Instead you may rearrage it as > min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2); > if (list_length(appinfos) < min_child_rels) OK, done that way. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks for the review. > > On 2017/02/22 21:58, Ashutosh Bapat wrote: >>>> Also we should add tests to make sure the scans on partitioned tables >>>> without any partitions do not get into problems. PFA patch which adds >>>> those tests. >>> >>> I added the test case you suggest, but kept just the first one. >> >> The second one was testing multi-level partitioning case, which >> deserves a testcase by its own. > > Ah, okay. Added it back. Thanks. > >> Some more comments >> >> The comment below seems too verbose. All it can say is "A partitioned table >> without any partitions results in a dummy relation." I don't think we need to >> be explicit about rte->inh. But it's more of personal preference. We will leave >> it to the committer, if you don't agree. >> + /* >> + * An empty partitioned table, i.e., one without any leaf >> + * partitions, as signaled by rte->inh being set to false >> + * by the prep phase (see expand_inherited_rtentry). >> + */ > > It does seem poorly worded. How about: > > /* > * A partitioned table without leaf partitions is marked > * as a dummy rel. > */ > >> We don't need this comment as well. Rather than repeating what's been said at >> the top of the function, it should just refer to it like "nominal relation has >> been set above for partitioned tables. For other parent relations, we'll use >> the first child ...". >> + * >> + * If the parent is a partitioned table, we do not have a separate RTE >> + * representing it as a member of the inheritance set, because we do >> + * not create a scan plan for it. As mentioned at the top of this >> + * function, we make the parent RTE itself the nominal relation. >> */ > > Rewrote that comment block as: > > * > * If the parent is a partitioned table, we already set the nominal > * relation. > */ > I reworded those comments a bit and corrected grammar. Please check in the attached patch. > >> Following condition is not very readable. It's not evident that it's of the >> form (A && B) || C, at a glance it looks like it's A && (B || C). >> + if ((rte->relkind != RELKIND_PARTITIONED_TABLE && >> + list_length(appinfos) < 2) || list_length(appinfos) < 1) >> >> Instead you may rearrage it as >> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2); >> if (list_length(appinfos) < min_child_rels) > > OK, done that way. On a second thought, I added a boolean to check if there were any children created and then reset rte->inh based on that value. That's better than relying on appinfos length now. Let me know what you think. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Thanks for the review. On 2017/02/23 15:44, Ashutosh Bapat wrote: > On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote wrote: >> Rewrote that comment block as: >> >> * >> * If the parent is a partitioned table, we already set the nominal >> * relation. >> */ >> > > I reworded those comments a bit and corrected grammar. Please check in > the attached patch. What was there sounds grammatically correct to me, but fine. >>> Following condition is not very readable. It's not evident that it's of the >>> form (A && B) || C, at a glance it looks like it's A && (B || C). >>> + if ((rte->relkind != RELKIND_PARTITIONED_TABLE && >>> + list_length(appinfos) < 2) || list_length(appinfos) < 1) >>> >>> Instead you may rearrage it as >>> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2); >>> if (list_length(appinfos) < min_child_rels) >> >> OK, done that way. > > On a second thought, I added a boolean to check if there were any > children created and then reset rte->inh based on that value. That's > better than relying on appinfos length now. @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root) /* + * Partitioned tables do not have storage for themselves and should not be + * scanned. @@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti) /* + * Partitioned tables themselves do not have any storage and should not + * be scanned. So, do not create child relations for those. + */ I guess we should not have to repeat "partitioned tables do not have storage" in all these places. + * a partitioned relation as dummy. The duplicate RTE we added for the + * parent table is harmless, so we don't bother to get rid of it; ditto for + * the useless PlanRowMark node. There is no duplicate RTE in the partitioned table case, which even my original comment failed to consider. Can you, maybe? Thanks, Amit
On Thu, Feb 23, 2017 at 1:08 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks for the review. > > On 2017/02/23 15:44, Ashutosh Bapat wrote: >> On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote wrote: >>> Rewrote that comment block as: >>> >>> * >>> * If the parent is a partitioned table, we already set the nominal >>> * relation. >>> */ >>> >> >> I reworded those comments a bit and corrected grammar. Please check in >> the attached patch. > > What was there sounds grammatically correct to me, but fine. > >>>> Following condition is not very readable. It's not evident that it's of the >>>> form (A && B) || C, at a glance it looks like it's A && (B || C). >>>> + if ((rte->relkind != RELKIND_PARTITIONED_TABLE && >>>> + list_length(appinfos) < 2) || list_length(appinfos) < 1) >>>> >>>> Instead you may rearrage it as >>>> min_child_rels = (rte->relkind == RELKIND_PARTITIONED_TABLE ? 1 : 2); >>>> if (list_length(appinfos) < min_child_rels) >>> >>> OK, done that way. >> >> On a second thought, I added a boolean to check if there were any >> children created and then reset rte->inh based on that value. That's >> better than relying on appinfos length now. > > @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root) > /* > + * Partitioned tables do not have storage for themselves and should not be > + * scanned. > > @@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root, > RangeTblEntry *rte, Index rti) > /* > + * Partitioned tables themselves do not have any storage and should not > + * be scanned. So, do not create child relations for those. > + */ > > I guess we should not have to repeat "partitioned tables do not have > storage" in all these places. Hmm, you are right. But they are two different functions and the code blocks are located away from each other. So, I thought it would be better to have complete comment there, instead of pointing to other places. I am fine, if we can reword it without compromising readability. > > + * a partitioned relation as dummy. The duplicate RTE we added for the > + * parent table is harmless, so we don't bother to get rid of it; ditto for > + * the useless PlanRowMark node. > > There is no duplicate RTE in the partitioned table case, which even my > original comment failed to consider. Can you, maybe? May be we could says "If we have added duplicate RTE for the parent table, it is harmless ...". Does that sound good? -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2017/02/23 16:48, Ashutosh Bapat wrote: > On Thu, Feb 23, 2017 at 1:08 PM, Amit Langote wrote: >> @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root) >> /* >> + * Partitioned tables do not have storage for themselves and should not be >> + * scanned. >> >> @@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root, >> RangeTblEntry *rte, Index rti) >> /* >> + * Partitioned tables themselves do not have any storage and should not >> + * be scanned. So, do not create child relations for those. >> + */ >> >> I guess we should not have to repeat "partitioned tables do not have >> storage" in all these places. > > Hmm, you are right. But they are two different functions and the code > blocks are located away from each other. So, I thought it would be > better to have complete comment there, instead of pointing to other > places. I am fine, if we can reword it without compromising > readability. I was saying in general. I agree that different sites in the optimizer may have different considerations for why partitioned tables are to be handled specially, common theme being that we do not have to scan the parent relation. If the implementation changes someday such that we don't manipulate child tables (especially, partitions) through most planning phases anymore, then maybe we will start using some different terminology where we don't have to stress this fact too much. We're not there yet though. >> + * a partitioned relation as dummy. The duplicate RTE we added for the >> + * parent table is harmless, so we don't bother to get rid of it; ditto for >> + * the useless PlanRowMark node. >> >> There is no duplicate RTE in the partitioned table case, which even my >> original comment failed to consider. Can you, maybe? > > May be we could says "If we have added duplicate RTE for the parent > table, it is harmless ...". Does that sound good? Duplicate RTE added in the non-partitioned table case is harmless, so we don't bother to get rid of it; ditto for the useless PlanRowMark node. Thanks, Amit
On Thu, Feb 23, 2017 at 1:38 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/02/23 16:48, Ashutosh Bapat wrote: >> On Thu, Feb 23, 2017 at 1:08 PM, Amit Langote wrote: >>> @@ -996,10 +996,20 @@ inheritance_planner(PlannerInfo *root) >>> /* >>> + * Partitioned tables do not have storage for themselves and should not be >>> + * scanned. >>> >>> @@ -1450,6 +1451,21 @@ expand_inherited_rtentry(PlannerInfo *root, >>> RangeTblEntry *rte, Index rti) >>> /* >>> + * Partitioned tables themselves do not have any storage and should not >>> + * be scanned. So, do not create child relations for those. >>> + */ >>> >>> I guess we should not have to repeat "partitioned tables do not have >>> storage" in all these places. >> >> Hmm, you are right. But they are two different functions and the code >> blocks are located away from each other. So, I thought it would be >> better to have complete comment there, instead of pointing to other >> places. I am fine, if we can reword it without compromising >> readability. > > I was saying in general. I agree that different sites in the optimizer > may have different considerations for why partitioned tables are to be > handled specially, common theme being that we do not have to scan the > parent relation. If the implementation changes someday such that we don't > manipulate child tables (especially, partitions) through most planning > phases anymore, then maybe we will start using some different terminology > where we don't have to stress this fact too much. We're not there yet though. I agree. > >>> + * a partitioned relation as dummy. The duplicate RTE we added for the >>> + * parent table is harmless, so we don't bother to get rid of it; ditto for >>> + * the useless PlanRowMark node. >>> >>> There is no duplicate RTE in the partitioned table case, which even my >>> original comment failed to consider. Can you, maybe? >> >> May be we could says "If we have added duplicate RTE for the parent >> table, it is harmless ...". Does that sound good? > > Duplicate RTE added in the non-partitioned table case is harmless, so we > don't bother to get rid of it; ditto for the useless PlanRowMark node. Fine with me. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Fri, Feb 10, 2017 at 03:19:47PM +0900, Amit Langote wrote: > The new partitioned tables do not contain any data by themselves. Any > data inserted into a partitioned table is routed to and stored in one of > its partitions. In fact, it is impossible to insert *any* data before a > partition (to be precise, a leaf partition) is created. It seems wasteful > then to allocate physical storage (files) for partitioned tables. If we > do not allocate the storage, then we must make sure that the right thing > happens when a command that is intended to manipulate a table's storage > encounters a partitioned table, the "right thing" here being that the > command's code either throws an error or warning (in some cases) if the > specified table is a partitioned table or ignores any partitioned tables > when it reads the list of relations to process from pg_class. Commands > that need to be taught about this are vacuum, analyze, truncate, and alter > table. Specifically: > > - In case of vacuum, specifying a partitioned table causes a warning > > - In case of analyze, we do not throw an error or warning but simply > avoid calling do_analyze_rel() *non-recursively*. Further in > acquire_inherited_sample_rows(), any partitioned tables in the list > returned by find_all_inheritors() are skipped. > > - In case of truncate, only the part which manipulates table's physical > storage is skipped for partitioned tables. > > - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned > tables, because there is nothing to be done. > > - Since we cannot create indexes on partitioned tables anyway, there is > no need to handle cluster and reindex (they throw a meaningful error > already due to the lack of indexes.) I don't think we are doing this, but if the parent table doesn't have a physical file pg_upgrade will need to be taught that. We have that case now for unlogged tables on standby servers that we need to address. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 2017/02/28 3:54, Bruce Momjian wrote: > On Fri, Feb 10, 2017 at 03:19:47PM +0900, Amit Langote wrote: >> The new partitioned tables do not contain any data by themselves. Any >> data inserted into a partitioned table is routed to and stored in one of >> its partitions. In fact, it is impossible to insert *any* data before a >> partition (to be precise, a leaf partition) is created. It seems wasteful >> then to allocate physical storage (files) for partitioned tables. If we >> do not allocate the storage, then we must make sure that the right thing >> happens when a command that is intended to manipulate a table's storage >> encounters a partitioned table, the "right thing" here being that the >> command's code either throws an error or warning (in some cases) if the >> specified table is a partitioned table or ignores any partitioned tables >> when it reads the list of relations to process from pg_class. Commands >> that need to be taught about this are vacuum, analyze, truncate, and alter >> table. Specifically: >> >> - In case of vacuum, specifying a partitioned table causes a warning >> >> - In case of analyze, we do not throw an error or warning but simply >> avoid calling do_analyze_rel() *non-recursively*. Further in >> acquire_inherited_sample_rows(), any partitioned tables in the list >> returned by find_all_inheritors() are skipped. >> >> - In case of truncate, only the part which manipulates table's physical >> storage is skipped for partitioned tables. >> >> - ATRewriteTables() skips on the AlteredTableInfo entries for partitioned >> tables, because there is nothing to be done. >> >> - Since we cannot create indexes on partitioned tables anyway, there is >> no need to handle cluster and reindex (they throw a meaningful error >> already due to the lack of indexes.) > > I don't think we are doing this, but if the parent table doesn't have a > physical file pg_upgrade will need to be taught that. We have that case > now for unlogged tables on standby servers that we need to address. Partitioned tables do have physical files as of now. This thread is to discuss the proposal to get rid of the physical file for partitioned tables. By the way, I just tried pg_upgrade on top of this patch, and it seems partitioned tables without the physical file migrate just fine to the new cluster. To test I did: created a partitioned table and few partitions, inserted some data into it, pg_upgraded the cluster to find the partitioned table intact with its data in the new cluster (to be clear, the data is contained in partitions). Is there something that wouldn't work that I should instead be testing? Also, it seems that the partitioned tables (without physical files) won't have the same issue on the standby servers as unlogged tables. It's just that we route inserts into a partitioned table to its partitions and hence the partitioned table itself does not contain because all the incoming data is routed. Am I missing something? Thanks, Amit
On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks for the review. In 0001, the documentation which you are patching has a section for limitations that apply only to both partitioning and constraint exclusion, and another for limitations that apply only to constraint exclusion. Surely the patch should be moving a limitation that will no longer apply to partitioning from the first section to the second section, rather than leaving it in the section for limitations that apply to both systems and just adding a note that say "this doesn't apply to partitioning any more". In acquire_inherited_sample_rows(), instead of inserting a whole stanza of logic just above the existing dispatch on relkind, I think we can get by with a very slightly update to what's already there. You can't use the result of a & b as a bool. You need to write (a & b) != 0, because the bool should always use 1 for true and 0 for false; it should not set some higher-numbered bit. The changes to autovacuum.c look useless, because relation_needs_vacanalyze() will presumably never fire for a table with no tuples of its own. Updated patch with those changes and a few cosmetic tweaks attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Feb 28, 2017 at 11:53:16AM +0900, Amit Langote wrote: > > I don't think we are doing this, but if the parent table doesn't have a > > physical file pg_upgrade will need to be taught that. We have that case > > now for unlogged tables on standby servers that we need to address. > > Partitioned tables do have physical files as of now. This thread is to > discuss the proposal to get rid of the physical file for partitioned tables. > > By the way, I just tried pg_upgrade on top of this patch, and it seems > partitioned tables without the physical file migrate just fine to the new > cluster. To test I did: created a partitioned table and few partitions, > inserted some data into it, pg_upgraded the cluster to find the > partitioned table intact with its data in the new cluster (to be clear, > the data is contained in partitions). Is there something that wouldn't > work that I should instead be testing? > > Also, it seems that the partitioned tables (without physical files) won't > have the same issue on the standby servers as unlogged tables. It's just > that we route inserts into a partitioned table to its partitions and hence > the partitioned table itself does not contain because all the incoming > data is routed. Am I missing something? I see why it works now. pg_upgrade only upgrades relations and materialized views: " WHERE relkind IN ('r', 'm') AND " but partitions are defined as 'P': #define RELKIND_PARTITIONED_TABLE 'P' /* partitioned table */ I am a little confused by the above. Is a partitioned table the parent or the children? Reading the code it seems it is the parent, which explains why it works. Can I clarify that? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Tue, Feb 28, 2017 at 12:23 PM, Bruce Momjian <bruce@momjian.us> wrote: > I am a little confused by the above. Is a partitioned table the parent > or the children? Reading the code it seems it is the parent, which > explains why it works. Can I clarify that? As I understand things, a partitioned table is a parent relation that exists only to link to its child portions, holding as well the definitions linking to each partition. -- Michael
On 2017/02/28 12:29, Michael Paquier wrote: > On Tue, Feb 28, 2017 at 12:23 PM, Bruce Momjian <bruce@momjian.us> wrote: >> I am a little confused by the above. Is a partitioned table the parent >> or the children? Reading the code it seems it is the parent, which >> explains why it works. Can I clarify that? > > As I understand things, a partitioned table is a parent relation that > exists only to link to its child portions, holding as well the > definitions linking to each partition. Yes. As I mentioned in my previous email, data inserted into the partitioned table is routed to its partitions. A partition may be itself a partitioned table, so the routing continues until we find a leaf partition. All the partitioned tables in this chain leading to the leaf partition are RELKIND_PARTITIONED_TABLE ('P') relations and only the leaf tables are RELKIND_RELATION tables. Thanks, Amit
Amit's original patch had an assertion failure in extract_autovac_opts(). His patch adds partitioned tables in the list of tables to be auto-analyzed. But there's an assertion in extract_autovac_opts(), which did not consider partitioned tables. When a partitioned table is created, the assertion trips in autovacuum worker and resets the cluster, restarting all the backends. I have updated Robert's patch with the assertion fixed. On Tue, Feb 28, 2017 at 8:50 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Thanks for the review. > > In 0001, the documentation which you are patching has a section for > limitations that apply only to both partitioning and constraint > exclusion, and another for limitations that apply only to constraint > exclusion. Surely the patch should be moving a limitation that will > no longer apply to partitioning from the first section to the second > section, rather than leaving it in the section for limitations that > apply to both systems and just adding a note that say "this doesn't > apply to partitioning any more". > > In acquire_inherited_sample_rows(), instead of inserting a whole > stanza of logic just above the existing dispatch on relkind, I think > we can get by with a very slightly update to what's already there. > > You can't use the result of a & b as a bool. You need to write (a & > b) != 0, because the bool should always use 1 for true and 0 for > false; it should not set some higher-numbered bit. > > The changes to autovacuum.c look useless, because > relation_needs_vacanalyze() will presumably never fire for a table > with no tuples of its own. > > Updated patch with those changes and a few cosmetic tweaks attached. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017/02/28 13:52, Ashutosh Bapat wrote: > Amit's original patch had an assertion failure in > extract_autovac_opts(). His patch adds partitioned tables in the list > of tables to be auto-analyzed. But there's an assertion in > extract_autovac_opts(), which did not consider partitioned tables. > When a partitioned table is created, the assertion trips in autovacuum > worker and resets the cluster, restarting all the backends. I have > updated Robert's patch with the assertion fixed. Since Robert's patch backed out all autovacuum.c changes, partitioned tables case would no longer be able to reach that Assert. Thanks, Amit
Thanks for the review. I was just about to send a new version of the patches. On 2017/02/28 12:20, Robert Haas wrote: > On Thu, Feb 23, 2017 at 11:19 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Thanks for the review. > > In 0001, the documentation which you are patching has a section for > limitations that apply only to both partitioning and constraint > exclusion, and another for limitations that apply only to constraint > exclusion. Surely the patch should be moving a limitation that will > no longer apply to partitioning from the first section to the second > section, rather than leaving it in the section for limitations that > apply to both systems and just adding a note that say "this doesn't > apply to partitioning any more". So we have the following headings under 5.11.6 Caveats: "The following caveats apply to partitioned tables implemented using either method (unless noted otherwise)" and, "The following caveats apply to constraint exclusion" The latter lists only the caveats about the planner capability (constraint exclusion) for partitioning - that it cannot be performed against volatile WHERE conditions, only works with comparisons using btree operators, does not scale beyond hundred partitions, etc. These limitations are common to both systems (inheritance and declarative partitioned tables), because both rely on constraint exclusion. How about separating the limitations listed under the first heading into: "Limitations of using inheritance for partitioning" and, "Limitations of using partitioned tables" This particular patch gets rid of the restriction that vacuum/analyze do not recurse to child tables (partitions) only for the second of the above. How about the documentation changes in the attached updated 0001? I know that this page needs a much larger rewrite as we are discussing in the other thread. > In acquire_inherited_sample_rows(), instead of inserting a whole > stanza of logic just above the existing dispatch on relkind, I think > we can get by with a very slightly update to what's already there. > > You can't use the result of a & b as a bool. You need to write (a & > b) != 0, because the bool should always use 1 for true and 0 for > false; it should not set some higher-numbered bit. Oops, thanks for fixing that. I suppose you are referring to this hunk in the original patch: - relations = get_rel_oids(relid, relation); + relations = get_rel_oids(relid, relation, options & VACOPT_VACUUM); And we need to do it this way in *this* case, because we're passing it as a bool argument. I see that it's OK to do this: stmttype = (options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE"; Or this: if (options & VACOPT_VACUUM) { PreventTransactionChain(isTopLevel, stmttype); > The changes to autovacuum.c look useless, because > relation_needs_vacanalyze() will presumably never fire for a table > with no tuples of its own. Hmm, I guess that makes sense. Inheritance statistics are never collected by autovacuum (at least in the partitioning setups and especially with the partitioned tables now). I recall that you mentioned [1] this as one of the limitations of the system currently. > Updated patch with those changes and a few cosmetic tweaks attached. I have incorporated the changes in your patch in the attached updated 0001, with documentation tweaks as mentioned above. Other than that, in acquire_inherited_sample_rows() I've used a method suggested by Ashutosh Bapat to track if we encountered a child table, which is much less ugly than below: - if (nrels < 2) + if ((onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE && nrels < 2) || + nrels < 1) I keep updated 0002 and 0003 here, but I've changed their order. Don't scan partitioned tables thing is now 0002 and don't allocate file for partitioned tables is 0003. Sorry about the confusion. Thanks, Amit [1] https://postgr.es/m/CA%2BTgmobTxn2%2B0x96h5Le%2BGOK5kw3J37SRveNfzEdx9s5-Yd8vA%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Feb 28, 2017 at 11:35 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > How about the documentation changes in the attached updated 0001? I know > that this page needs a much larger rewrite as we are discussing in the > other thread. Looks good. >> In acquire_inherited_sample_rows(), instead of inserting a whole >> stanza of logic just above the existing dispatch on relkind, I think >> we can get by with a very slightly update to what's already there. >> >> You can't use the result of a & b as a bool. You need to write (a & >> b) != 0, because the bool should always use 1 for true and 0 for >> false; it should not set some higher-numbered bit. > > Oops, thanks for fixing that. I suppose you are referring to this hunk in > the original patch: > > - relations = get_rel_oids(relid, relation); > + relations = get_rel_oids(relid, relation, options & VACOPT_VACUUM); > > And we need to do it this way in *this* case, because we're passing it as > a bool argument. I see that it's OK to do this: > > stmttype = (options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE"; > > Or this: > > if (options & VACOPT_VACUUM) > { > PreventTransactionChain(isTopLevel, stmttype); In those cases it's still clearer, IMHO, to use != 0, but it's not necessary. However, when you're explicitly creating a value of type "bool", then it's necessary. Actually, looking at this again, I now think this part is wrong: + /* + * If only ANALYZE is to be performed, there is no need to include + * partitions in the list. In a database-wide ANALYZE, we only + * update the inheritance statistics of partitioned tables, not + * the statistics of individual partitions. + */ + if (!is_vacuum && classForm->relispartition) continue; I was thinking earlier that an ANALYZE on the parent would also update the statistics for each child, but now I see that's not so. So now I think we should omit this logic (and change the documentation to match). That is, a database-wide ANALYZE should update the statistics for each child as well as for the parent. Otherwise direct queries against the children (and partitionwise joins, once we have that) are going to go haywire. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/03/02 18:36, Robert Haas wrote: > On Tue, Feb 28, 2017 at 11:35 AM, Amit Langote wrote: >>> In acquire_inherited_sample_rows(), instead of inserting a whole >>> stanza of logic just above the existing dispatch on relkind, I think >>> we can get by with a very slightly update to what's already there. >>> >>> You can't use the result of a & b as a bool. You need to write (a & >>> b) != 0, because the bool should always use 1 for true and 0 for >>> false; it should not set some higher-numbered bit. >> >> Oops, thanks for fixing that. I suppose you are referring to this hunk in >> the original patch: >> >> - relations = get_rel_oids(relid, relation); >> + relations = get_rel_oids(relid, relation, options & VACOPT_VACUUM); >> >> And we need to do it this way in *this* case, because we're passing it as >> a bool argument. I see that it's OK to do this: >> >> stmttype = (options & VACOPT_VACUUM) ? "VACUUM" : "ANALYZE"; >> >> Or this: >> >> if (options & VACOPT_VACUUM) >> { >> PreventTransactionChain(isTopLevel, stmttype); > > In those cases it's still clearer, IMHO, to use != 0, but it's not > necessary. However, when you're explicitly creating a value of type > "bool", then it's necessary. Agreed. > Actually, looking at this again, I now think this part is wrong: > > + /* > + * If only ANALYZE is to be performed, there is no need to include > + * partitions in the list. In a database-wide ANALYZE, we only > + * update the inheritance statistics of partitioned tables, not > + * the statistics of individual partitions. > + */ > + if (!is_vacuum && classForm->relispartition) > continue; > > I was thinking earlier that an ANALYZE on the parent would also update > the statistics for each child, but now I see that's not so. So now I Yep, the patch enables ANALYZE to be propagated to partitions when the parent table is specified in the command. The above logic in the patch made the database-wide ANALYZE to ignore partitions, in which case, only the inheritance statistics would be updated. I can also see why that'd be undesirable. > think we should omit this logic (and change the documentation to > match). That is, a database-wide ANALYZE should update the statistics > for each child as well as for the parent. Otherwise direct queries > against the children (and partitionwise joins, once we have that) are > going to go haywire. OK, done. I updated both analyze.sgml and vacuum.sgml to be more up to date. Both pages previously omitted materialized views. Attached updated patches. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Mar 2, 2017 at 3:52 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> think we should omit this logic (and change the documentation to >> match). That is, a database-wide ANALYZE should update the statistics >> for each child as well as for the parent. Otherwise direct queries >> against the children (and partitionwise joins, once we have that) are >> going to go haywire. > > OK, done. I updated both analyze.sgml and vacuum.sgml to be more up to > date. Both pages previously omitted materialized views. > > Attached updated patches. Thanks, committed 0001 with a slight change to the wording. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/03/02 21:48, Robert Haas wrote: > On Thu, Mar 2, 2017 at 3:52 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> think we should omit this logic (and change the documentation to >>> match). That is, a database-wide ANALYZE should update the statistics >>> for each child as well as for the parent. Otherwise direct queries >>> against the children (and partitionwise joins, once we have that) are >>> going to go haywire. >> >> OK, done. I updated both analyze.sgml and vacuum.sgml to be more up to >> date. Both pages previously omitted materialized views. >> >> Attached updated patches. > > Thanks, committed 0001 with a slight change to the wording. Thanks. I noticed that 'and' is duplicated in a line added by the commit to analyze.sgml. Attached 0001 fixes that. 0002 and 0003 same as the last version. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Mar 3, 2017 at 10:02 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks. I noticed that 'and' is duplicated in a line added by the commit > to analyze.sgml. Attached 0001 fixes that. 0002 and 0003 same as the > last version. /* - * If all the children were temp tables, pretend it's a non-inheritance - * situation. The duplicate RTE we added for the parent table is - * harmless, so we don't bother to get rid of it; ditto for the useless - * PlanRowMark node. + * If all the children were temp tables or if the parent is a partitioned + * table without any leaf partitions, pretend it's a non-inheritance + * situation. The duplicate RTE for the parent table we added in the + * non-partitioned table case is harmless, so we don't bother to get rid + * of it; ditto for the useless PlanRowMark node. */ - if (list_length(appinfos) < 2) + if (!has_child) This comment is not completely correct. Children can be temp tables, they just cannot be temp tables of other backends. It seems to me that you could still keep this code simple and remove has_child.. @@ -932,7 +932,6 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc, case RELKIND_RELATION: case RELKIND_TOASTVALUE: case RELKIND_MATVIEW: - case RELKIND_PARTITIONED_TABLE: options = heap_reloptions(classForm->relkind, datum, false); break; Partitioned tables cannot have reloptions? What about all the autovacuum_* parameters then? Those are mainly not storage-related. -- Michael
Thanks for the review. On 2017/03/06 15:41, Michael Paquier wrote: > On Fri, Mar 3, 2017 at 10:02 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Thanks. I noticed that 'and' is duplicated in a line added by the commit >> to analyze.sgml. Attached 0001 fixes that. 0002 and 0003 same as the >> last version. > > /* > - * If all the children were temp tables, pretend it's a non-inheritance > - * situation. The duplicate RTE we added for the parent table is > - * harmless, so we don't bother to get rid of it; ditto for the useless > - * PlanRowMark node. > + * If all the children were temp tables or if the parent is a partitioned > + * table without any leaf partitions, pretend it's a non-inheritance > + * situation. The duplicate RTE for the parent table we added in the > + * non-partitioned table case is harmless, so we don't bother to get rid > + * of it; ditto for the useless PlanRowMark node. > */ > - if (list_length(appinfos) < 2) > + if (!has_child) > This comment is not completely correct. Children can be temp tables, > they just cannot be temp tables of other backends. It seems to me that > you could still keep this code simple and remove has_child.. I updated the comment. I recall having posted a patch for that once, but perhaps went unnoticed. About has_child, the other option is to make the minimum length of appinfos list relkind-based, but the condition quickly becomes ugly. Do you have a suggestion? > @@ -932,7 +932,6 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc, > case RELKIND_RELATION: > case RELKIND_TOASTVALUE: > case RELKIND_MATVIEW: > - case RELKIND_PARTITIONED_TABLE: > options = heap_reloptions(classForm->relkind, datum, false); > break; > Partitioned tables cannot have reloptions? What about all the > autovacuum_* parameters then? Those are mainly not storage-related. AFAIK, none of the heap reloptions will be applicable to partitioned table relations once we eliminate storage. About autovacuum_* parameters - we currently don't handle partitioned tables in autovacuum.c, because no statistics are reported for them. That is, relation_needs_vacanalyze() will never return true for dovacuum, doanalyze and wraparound if it is passed a RELKIND_PARTITIONED_TABLE relation. That's something to be fixed separately though. When we add autovacuum support for partitioned tables, we may want to add a new set of reloptions (new because partitioned tables still won't support all options returned by heap_reloptions()). Am I missing something? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Mar 6, 2017 at 12:48 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks for the review. > > On 2017/03/06 15:41, Michael Paquier wrote: >> On Fri, Mar 3, 2017 at 10:02 AM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> Thanks. I noticed that 'and' is duplicated in a line added by the commit >>> to analyze.sgml. Attached 0001 fixes that. 0002 and 0003 same as the >>> last version. >> >> /* >> - * If all the children were temp tables, pretend it's a non-inheritance >> - * situation. The duplicate RTE we added for the parent table is >> - * harmless, so we don't bother to get rid of it; ditto for the useless >> - * PlanRowMark node. >> + * If all the children were temp tables or if the parent is a partitioned >> + * table without any leaf partitions, pretend it's a non-inheritance >> + * situation. The duplicate RTE for the parent table we added in the >> + * non-partitioned table case is harmless, so we don't bother to get rid >> + * of it; ditto for the useless PlanRowMark node. >> */ >> - if (list_length(appinfos) < 2) >> + if (!has_child) >> This comment is not completely correct. Children can be temp tables, >> they just cannot be temp tables of other backends. It seems to me that >> you could still keep this code simple and remove has_child.. > > I updated the comment. I recall having posted a patch for that once, but > perhaps went unnoticed. The existing comment only specifies "temp tables" and not "temp table of other backends". The new comment keeps that part same and adds partitioned table case. So, I don't see any reason to change the "temp tables" to "temp table of other backends" in this patch. > > About has_child, the other option is to make the minimum length of > appinfos list relkind-based, but the condition quickly becomes ugly. Do > you have a suggestion? > +1. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2017/03/06 16:49, Ashutosh Bapat wrote: > On Mon, Mar 6, 2017 at 12:48 PM, Amit Langote wrote: >> On 2017/03/06 15:41, Michael Paquier wrote: >>> This comment is not completely correct. Children can be temp tables, >>> they just cannot be temp tables of other backends. It seems to me that >>> you could still keep this code simple and remove has_child.. >> >> I updated the comment. I recall having posted a patch for that once, but >> perhaps went unnoticed. > > The existing comment only specifies "temp tables" and not "temp table > of other backends". The new comment keeps that part same and adds > partitioned table case. So, I don't see any reason to change the "temp > tables" to "temp table of other backends" in this patch. Hmm. A separate patch might be fine but why not fix the incorrect part while we are updating the whole comment anyway. Thanks, Amit
On Mon, Mar 6, 2017 at 1:26 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/03/06 16:49, Ashutosh Bapat wrote: >> On Mon, Mar 6, 2017 at 12:48 PM, Amit Langote wrote: >>> On 2017/03/06 15:41, Michael Paquier wrote: >>>> This comment is not completely correct. Children can be temp tables, >>>> they just cannot be temp tables of other backends. It seems to me that >>>> you could still keep this code simple and remove has_child.. >>> >>> I updated the comment. I recall having posted a patch for that once, but >>> perhaps went unnoticed. >> >> The existing comment only specifies "temp tables" and not "temp table >> of other backends". The new comment keeps that part same and adds >> partitioned table case. So, I don't see any reason to change the "temp >> tables" to "temp table of other backends" in this patch. > > Hmm. A separate patch might be fine but why not fix the incorrect part > while we are updating the whole comment anyway. There must be a reason why that comment is written the way it's written. I guess, "from other backends" part of "temp tables" story has been explained just few lines above and the author/s didn't want to repeat it again. There's no point in changing it while we are not touching temp tables handling. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Mon, Mar 6, 2017 at 4:18 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > About autovacuum_* parameters - we currently don't handle partitioned > tables in autovacuum.c, because no statistics are reported for them. That > is, relation_needs_vacanalyze() will never return true for dovacuum, > doanalyze and wraparound if it is passed a RELKIND_PARTITIONED_TABLE > relation. That's something to be fixed separately though. When we add > autovacuum support for partitioned tables, we may want to add a new set of > reloptions (new because partitioned tables still won't support all options > returned by heap_reloptions()). Am I missing something? OK. I got confused by the fact that settings on parents should super-seed the settings of the children. Or if you want if a value is set on the parent by default it would apply to the child if it has no value set, which is where autovacuum_enabled makes sense even for partitioned tables. Leading to the point that parents could have reloptions, with a new category of the type RELOPT_KIND_PARTITION. Still, it is sensible as well to bypass the parents in autovacuum as well, now that I read it. And the handling is more simple. -- Michael
On 2017/03/06 17:01, Ashutosh Bapat wrote: > On Mon, Mar 6, 2017 at 1:26 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/03/06 16:49, Ashutosh Bapat wrote: >>> On Mon, Mar 6, 2017 at 12:48 PM, Amit Langote wrote: >>>> On 2017/03/06 15:41, Michael Paquier wrote: >>>>> This comment is not completely correct. Children can be temp tables, >>>>> they just cannot be temp tables of other backends. It seems to me that >>>>> you could still keep this code simple and remove has_child.. >>>> >>>> I updated the comment. I recall having posted a patch for that once, but >>>> perhaps went unnoticed. >>> >>> The existing comment only specifies "temp tables" and not "temp table >>> of other backends". The new comment keeps that part same and adds >>> partitioned table case. So, I don't see any reason to change the "temp >>> tables" to "temp table of other backends" in this patch. >> >> Hmm. A separate patch might be fine but why not fix the incorrect part >> while we are updating the whole comment anyway. > > There must be a reason why that comment is written the way it's > written. I guess, "from other backends" part of "temp tables" story > has been explained just few lines above and the author/s didn't want > to repeat it again. That's perhaps true. I guess it depends on who reads it. Someone might think the comment is incorrect because *not all* temporary child tables are excluded. > There's no point in changing it while we are not > touching temp tables handling. We can leave it for the committer to decide, maybe. Committers often rewrite surrounding comments to improve wording, correcting factual errors, etc. Thanks, Amit
> > We can leave it for the committer to decide, maybe. Committers often > rewrite surrounding comments to improve wording, correcting factual > errors, etc. > Sure. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2017/03/06 17:22, Michael Paquier wrote: > On Mon, Mar 6, 2017 at 4:18 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> About autovacuum_* parameters - we currently don't handle partitioned >> tables in autovacuum.c, because no statistics are reported for them. That >> is, relation_needs_vacanalyze() will never return true for dovacuum, >> doanalyze and wraparound if it is passed a RELKIND_PARTITIONED_TABLE >> relation. That's something to be fixed separately though. When we add >> autovacuum support for partitioned tables, we may want to add a new set of >> reloptions (new because partitioned tables still won't support all options >> returned by heap_reloptions()). Am I missing something? > > OK. I got confused by the fact that settings on parents should > super-seed the settings of the children. Or if you want if a value is > set on the parent by default it would apply to the child if it has no > value set, which is where autovacuum_enabled makes sense even for > partitioned tables. Hmm, setting autovacuum_enabled on partitioned parent should be made to work after we have fixed autovacuum support for partitioned tables. Using the parent's value as a default for partitions may not be what we'd want eventually. > Leading to the point that parents could have > reloptions, with a new category of the type RELOPT_KIND_PARTITION. > Still, it is sensible as well to bypass the parents in autovacuum as > well, now that I read it. And the handling is more simple. We will need it though, because lack of automatically updated "inheritance" (or whole tree) statistics on partitioned tables is a problem. Thanks, Amit
I see that all the changes by Amit and myself to what was earlier 0003 patch are now part of 0002 patch. The patch looks ready for committer. Some comments about 0003 patch. CREATE TABLE syntax seems to allow specifying reloptions for a partitioned table. But extractRelOptions() and heap_reloptions() seem to be ignoring those. Is that intentional? There are two callers of extractRelOptions(), extract_autovac_opts() and RelationParseRelOptions(). There's an assert in extract_autovac_opts() preventing that function from getting called for partitioned table. RelationParseRelOptions() seems to filter the relations before calling extractRelOptions(). I am wondering whether it's better to leave extractRelOptions() and filter partitioned relations in RelationParseRelOptions(). Do we need similar condition modification in the other caller of i.e. heap_create_init_fork(), ExecuteTruncate()? On Mon, Mar 6, 2017 at 12:48 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks for the review. > > On 2017/03/06 15:41, Michael Paquier wrote: >> On Fri, Mar 3, 2017 at 10:02 AM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> Thanks. I noticed that 'and' is duplicated in a line added by the commit >>> to analyze.sgml. Attached 0001 fixes that. 0002 and 0003 same as the >>> last version. >> >> /* >> - * If all the children were temp tables, pretend it's a non-inheritance >> - * situation. The duplicate RTE we added for the parent table is >> - * harmless, so we don't bother to get rid of it; ditto for the useless >> - * PlanRowMark node. >> + * If all the children were temp tables or if the parent is a partitioned >> + * table without any leaf partitions, pretend it's a non-inheritance >> + * situation. The duplicate RTE for the parent table we added in the >> + * non-partitioned table case is harmless, so we don't bother to get rid >> + * of it; ditto for the useless PlanRowMark node. >> */ >> - if (list_length(appinfos) < 2) >> + if (!has_child) >> This comment is not completely correct. Children can be temp tables, >> they just cannot be temp tables of other backends. It seems to me that >> you could still keep this code simple and remove has_child.. > > I updated the comment. I recall having posted a patch for that once, but > perhaps went unnoticed. > > About has_child, the other option is to make the minimum length of > appinfos list relkind-based, but the condition quickly becomes ugly. Do > you have a suggestion? > >> @@ -932,7 +932,6 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc, >> case RELKIND_RELATION: >> case RELKIND_TOASTVALUE: >> case RELKIND_MATVIEW: >> - case RELKIND_PARTITIONED_TABLE: >> options = heap_reloptions(classForm->relkind, datum, false); >> break; >> Partitioned tables cannot have reloptions? What about all the >> autovacuum_* parameters then? Those are mainly not storage-related. > > AFAIK, none of the heap reloptions will be applicable to partitioned table > relations once we eliminate storage. > > About autovacuum_* parameters - we currently don't handle partitioned > tables in autovacuum.c, because no statistics are reported for them. That > is, relation_needs_vacanalyze() will never return true for dovacuum, > doanalyze and wraparound if it is passed a RELKIND_PARTITIONED_TABLE > relation. That's something to be fixed separately though. When we add > autovacuum support for partitioned tables, we may want to add a new set of > reloptions (new because partitioned tables still won't support all options > returned by heap_reloptions()). Am I missing something? > > Thanks, > Amit -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Thu, Mar 2, 2017 at 8:02 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks. I noticed that 'and' is duplicated in a line added by the commit > to analyze.sgml. Attached 0001 fixes that. 0002 and 0003 same as the > last version. Oh, rats. Thanks for noticing. Committed 0001. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 7, 2017 at 12:11 AM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote: > I see that all the changes by Amit and myself to what was earlier 0003 > patch are now part of 0002 patch. The patch looks ready for committer. Reviewing 0002: This patch seems to have falsified the header comments for expand_inherited_rtentry. I don't quite understand the need for the change to set_rel_size(). The rte->inh case is handled before reaching the new code, and IIUC the !rte->inh case should never happen given the changes to expand_inherited_rtentry. Or am I confused? If not, I think we should just add an Assert() to the "plain relation" case that this is not RELKIND_PARTITIONED_TABLE (with a comment explaining why it can't happen). In inheritance_planner, this comment addition does not seem adequate: + * If the parent is a partitioned table, we already set the nominal + * relation. That basically contradicts the previous paragraph. I think you need to adjust the previous paragraph instead of adding a new one, probably in multiple places. For example, the parenthesized bit now only applies in the non-partitioned case. - rel = mtstate->resultRelInfo->ri_RelationDesc; + nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table); + nominalRel = heap_open(nominalRTE->relid, NoLock); No lock? Another thing that bothers me about this is that, apparently, the use of nominalRelation is no longer strictly for EXPLAIN, as the comments in plannodes.h/relation.h still claim that it is. I'm not sure how to adjust that exactly; there's not a lot of room in those comments to give all the details. Maybe they should simply say something like /* Parent RT index */ instead of /* Parent RT index for use of EXPLAIN */. But we can't just go around changing the meaning of things without updating the comments accordingly. A related question is whether we should really be using nominalRelation for this or whether there's some other way we should be getting at the parent -- I don't have another idea, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/03/08 2:27, Robert Haas wrote: > On Tue, Mar 7, 2017 at 12:11 AM, Ashutosh Bapat > <ashutosh.bapat@enterprisedb.com> wrote: >> I see that all the changes by Amit and myself to what was earlier 0003 >> patch are now part of 0002 patch. The patch looks ready for committer. > > Reviewing 0002: Thanks for the review. > > This patch seems to have falsified the header comments for > expand_inherited_rtentry. Oops, updated. > I don't quite understand the need for the change to set_rel_size(). > The rte->inh case is handled before reaching the new code, and IIUC > the !rte->inh case should never happen given the changes to > expand_inherited_rtentry. Or am I confused? In expand_inherited_rtentry(), patch only changes the rule about the minimum number of child RTEs required to keep rte->inh true. In the traditional case, it is 2 (the table itself and at least one child). For a partitioned table, since we don't want to scan the table itself, that becomes 1 (at least one leaf partition). So, it's still possible for rte->inh to become false if the required minimum is not met, even for partitioned tables. > If not, I think we > should just add an Assert() to the "plain relation" case that this is > not RELKIND_PARTITIONED_TABLE (with a comment explaining why it can't > happen). How about adding Assert(rte->relkind != RELKIND_PARTITIONED_TABLE) at the beginning of the following functions as the updated patch does: set_plain_rel_size() set_tablesample_rel_size() set_plain_rel_pathlist() set_tablesample_rel_pathlist() > In inheritance_planner, this comment addition does not seem adequate: > > + * If the parent is a partitioned table, we already set the nominal > + * relation. > > That basically contradicts the previous paragraph. I think you need > to adjust the previous paragraph instead of adding a new one, probably > in multiple places. For example, the parenthesized bit now only > applies in the non-partitioned case. Hmm, that's true. I rewrote the entire comment. > > - rel = mtstate->resultRelInfo->ri_RelationDesc; > + nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table); > + nominalRel = heap_open(nominalRTE->relid, NoLock); > > No lock? Hmm, I think I missed that a partitioned parent table would not be locked by the *executor* at all after applying this patch. As of now, InitPlan() takes care of locking all the result relations in the PlannedStmt.resultRelations list. This patch removes partitioned parent(s) from appearing in this list. But that makes me wonder - aren't the locks taken by expand_inherited_rtentry() kept long enough? Why does InitPlan need to acquire them again? I see this comment in expand_inherited_rtentry: * If the parent relation is the query's result relation, then we need * RowExclusiveLock. Otherwise, if it's accessed FOR UPDATE/SHARE, we * need RowShareLock; otherwise AccessShareLock. We can't just grab * AccessShareLock because then the executor would be trying to upgrade * the lock, leading to possible deadlocks. (This code should match the * parser and rewriter.) So, I assumed all the necessary locking is being taken care of here. Anyway, I changed NoLock to RowExclusiveLock here for now, but maybe it's either not necessary or a way should be found to do that in InitPlan along with other result relations. > Another thing that bothers me about this is that, apparently, the use > of nominalRelation is no longer strictly for EXPLAIN, as the comments > in plannodes.h/relation.h still claim that it is. I'm not sure how to > adjust that exactly; there's not a lot of room in those comments to > give all the details. Maybe they should simply say something like /* > Parent RT index */ instead of /* Parent RT index for use of EXPLAIN > */. But we can't just go around changing the meaning of things > without updating the comments accordingly. It seems that this comment and one more need to be updated. The other comment change is in explain.c as follows: * Adds the relid of each referenced RTE to *rels_used. The result controls * which RTEs are assigned aliases by select_rtable_names_for_explain. * This ensures that we don't confusingly assign un-suffixed aliases to RTEs - * that never appear in the EXPLAIN output (such as inheritance parents). + * that never appear in the EXPLAIN output (such as non-partitioned + * inheritance parents). */ static bool ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used) I studied the EXPLAIN code a bit to see whether the problem cited for using inheritance parent RTE as nominalRelation (for example, in comments in inheritance_planner) apply to the partitioned parent RTE case, which it doesn't. The partitioned parent RTE won't appear anywhere else in the plan. So, different aliases for what's the same table/RTE won't happen in the partitioned parent case. > A related question is > whether we should really be using nominalRelation for this or whether > there's some other way we should be getting at the parent -- I don't > have another idea, though. This is just for the ModifyTable node. Instead of adding any new member, I thought nominalRelation would work fine. But as I said above regarding locking, we *may* need to find a different place for the partitioned parent RTE (other than ModifyTable.nominalRelation). Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Mar 8, 2017 at 5:36 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I don't quite understand the need for the change to set_rel_size(). >> The rte->inh case is handled before reaching the new code, and IIUC >> the !rte->inh case should never happen given the changes to >> expand_inherited_rtentry. Or am I confused? > > In expand_inherited_rtentry(), patch only changes the rule about the > minimum number of child RTEs required to keep rte->inh true. In the > traditional case, it is 2 (the table itself and at least one child). For > a partitioned table, since we don't want to scan the table itself, that > becomes 1 (at least one leaf partition). > > So, it's still possible for rte->inh to become false if the required > minimum is not met, even for partitioned tables. OK. >> - rel = mtstate->resultRelInfo->ri_RelationDesc; >> + nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table); >> + nominalRel = heap_open(nominalRTE->relid, NoLock); >> >> No lock? > > Hmm, I think I missed that a partitioned parent table would not be locked > by the *executor* at all after applying this patch. As of now, InitPlan() > takes care of locking all the result relations in the > PlannedStmt.resultRelations list. This patch removes partitioned > parent(s) from appearing in this list. But that makes me wonder - aren't > the locks taken by expand_inherited_rtentry() kept long enough? Why does > InitPlan need to acquire them again? I see this comment in > expand_inherited_rtentry: Parse-time locks, plan-time locks, and execution-time locks are all separate. See the header comments for AcquireRewriteLocks in rewriteHandler.c; think about a view, where the parsed query has been stored and is later rewritten and planned. Similarly, a plan can be reused even after the transaction that created that plan has committed; see AcquireExecutorLocks in plancache.c. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/03/08 22:36, Robert Haas wrote: > On Wed, Mar 8, 2017 at 5:36 AM, Amit Langote wrote: >>> - rel = mtstate->resultRelInfo->ri_RelationDesc; >>> + nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table); >>> + nominalRel = heap_open(nominalRTE->relid, NoLock); >>> >>> No lock? >> >> Hmm, I think I missed that a partitioned parent table would not be locked >> by the *executor* at all after applying this patch. As of now, InitPlan() >> takes care of locking all the result relations in the >> PlannedStmt.resultRelations list. This patch removes partitioned >> parent(s) from appearing in this list. But that makes me wonder - aren't >> the locks taken by expand_inherited_rtentry() kept long enough? Why does >> InitPlan need to acquire them again? I see this comment in >> expand_inherited_rtentry: > > Parse-time locks, plan-time locks, and execution-time locks are all > separate. See the header comments for AcquireRewriteLocks in > rewriteHandler.c; think about a view, where the parsed query has been > stored and is later rewritten and planned. Similarly, a plan can be > reused even after the transaction that created that plan has > committed; see AcquireExecutorLocks in plancache.c. Thanks for those references. I took a step back here and thought a bit more about the implications this patch. It occurred to me that the complete absence of partitioned table RT entries in the plan-tree has more consequences than I originally imagined. I will post an updated patch by Monday latest. Thanks, Amit
On 2017/03/10 17:57, Amit Langote wrote: > On 2017/03/08 22:36, Robert Haas wrote: >> On Wed, Mar 8, 2017 at 5:36 AM, Amit Langote wrote: >>>> - rel = mtstate->resultRelInfo->ri_RelationDesc; >>>> + nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table); >>>> + nominalRel = heap_open(nominalRTE->relid, NoLock); >>>> >>>> No lock? >>> >>> Hmm, I think I missed that a partitioned parent table would not be locked >>> by the *executor* at all after applying this patch. As of now, InitPlan() >>> takes care of locking all the result relations in the >>> PlannedStmt.resultRelations list. This patch removes partitioned >>> parent(s) from appearing in this list. But that makes me wonder - aren't >>> the locks taken by expand_inherited_rtentry() kept long enough? Why does >>> InitPlan need to acquire them again? I see this comment in >>> expand_inherited_rtentry: >> >> Parse-time locks, plan-time locks, and execution-time locks are all >> separate. See the header comments for AcquireRewriteLocks in >> rewriteHandler.c; think about a view, where the parsed query has been >> stored and is later rewritten and planned. Similarly, a plan can be >> reused even after the transaction that created that plan has >> committed; see AcquireExecutorLocks in plancache.c. > > Thanks for those references. > > I took a step back here and thought a bit more about the implications this > patch. It occurred to me that the complete absence of partitioned table > RT entries in the plan-tree has more consequences than I originally > imagined. I will post an updated patch by Monday latest. Here is the updated patch. Since this patch proposes to avoid creating scan nodes for non-leaf tables in a partition tree, they won't be referenced anywhere in the resulting plan tree. So the executor will not lock those tables in the select/update/delete cases. Insert is different since we lock all tables in the partition tree when setting up tuple-routing in ExecInitModifyTable. Not taking executor locks on the tables means that the cached plans that should be invalidated upon adding/removing a partition somewhere in the partition tree won't be. First I thought that we could remember just the root table RT index using a new Index field partitionRoot in Append, MergeAppend, and ModifyTable nodes and use it to locate and lock the root table during executor node initialization. But soon realized that that's not enough, because it ignores the fact that adding/removing partitions at lower levels does not require taking a lock on the root table; only the immediate parent. So any cached select/update/delete plans won't be invalidated when a new lower-level partition is added/removed, because the immediate parent would not have been added to the query's range table and hence the PlannedStmt.relationOids list. Remember that the latter is used by plancache.c to remember the relations that a given cached plan depends on remaining unchanged. So the patch now adds a list member called partitioned_rels to Append, MergeAppend, and ModifyTable nodes and stores the RT indexes of all the non-leaf tables in a partition tree with root table RT index at the head (note that these RT indexes are of the RTEs added by expand_inherited_rtenrty; also see below). Since the partitioned_rels list is constructed when building paths and must be propagated to the plan nodes, the same field is also present in the corresponding Path nodes. ExecInit* routines for the aforementioned nodes now locate and lock the non-leaf tables using the RT indexes in partitioned_rels. Leaf tables are locked, as before, either by InitPlan (update/delete result relations case) or by ExecInitAppend or ExecInitMergeAppend when initializing the appendplans/mergeplans (select case). The previous proposal was for expand_inherited_rtentry to not create RT entries and AppendRelInfo's for the non-leaf tables, but I think that doesn't work, as I tried to explain above. We need RTEs because that seems to be the only way currently for informing the executor of the non-leaf tables. We need AppendRelInfo's to store the RT indexes of those RTEs for the latter planning steps to collect them in partitioned_rels mentioned above. So with the latest patch, we do create the RT entry and AppendRelInfo for non-leaf tables. AppendRelInfo created in this case is a minimal one; only parent_relid and child_relid are valid. To make the latter planning steps ignore these minimal AppendRelInfo's, every AppendRelInfo is now marked with child_relkind. Only set_append_rel_pathlist() and inheritance_planner() process them to collect the child_relid into the partitioned_rels list to be stored in AppendPath/MergeAppendPath and ModifyTablePath, respectively. Thoughts? Thanks, Amit
On 2017/03/13 19:24, Amit Langote wrote: > On 2017/03/10 17:57, Amit Langote wrote: >> On 2017/03/08 22:36, Robert Haas wrote: >>> On Wed, Mar 8, 2017 at 5:36 AM, Amit Langote wrote: >>>>> - rel = mtstate->resultRelInfo->ri_RelationDesc; >>>>> + nominalRTE = rt_fetch(node->nominalRelation, estate->es_range_table); >>>>> + nominalRel = heap_open(nominalRTE->relid, NoLock); >>>>> >>>>> No lock? >>>> >>>> Hmm, I think I missed that a partitioned parent table would not be locked >>>> by the *executor* at all after applying this patch. As of now, InitPlan() >>>> takes care of locking all the result relations in the >>>> PlannedStmt.resultRelations list. This patch removes partitioned >>>> parent(s) from appearing in this list. But that makes me wonder - aren't >>>> the locks taken by expand_inherited_rtentry() kept long enough? Why does >>>> InitPlan need to acquire them again? I see this comment in >>>> expand_inherited_rtentry: >>> >>> Parse-time locks, plan-time locks, and execution-time locks are all >>> separate. See the header comments for AcquireRewriteLocks in >>> rewriteHandler.c; think about a view, where the parsed query has been >>> stored and is later rewritten and planned. Similarly, a plan can be >>> reused even after the transaction that created that plan has >>> committed; see AcquireExecutorLocks in plancache.c. >> >> Thanks for those references. >> >> I took a step back here and thought a bit more about the implications this >> patch. It occurred to me that the complete absence of partitioned table >> RT entries in the plan-tree has more consequences than I originally >> imagined. I will post an updated patch by Monday latest. > > Here is the updated patch. > > Since this patch proposes to avoid creating scan nodes for non-leaf tables > in a partition tree, they won't be referenced anywhere in the resulting > plan tree. So the executor will not lock those tables in the > select/update/delete cases. Insert is different since we lock all tables > in the partition tree when setting up tuple-routing in > ExecInitModifyTable. Not taking executor locks on the tables means that > the cached plans that should be invalidated upon adding/removing a > partition somewhere in the partition tree won't be. > > First I thought that we could remember just the root table RT index using > a new Index field partitionRoot in Append, MergeAppend, and ModifyTable > nodes and use it to locate and lock the root table during executor node > initialization. But soon realized that that's not enough, because it > ignores the fact that adding/removing partitions at lower levels does not > require taking a lock on the root table; only the immediate parent. So > any cached select/update/delete plans won't be invalidated when a new > lower-level partition is added/removed, because the immediate parent would > not have been added to the query's range table and hence the > PlannedStmt.relationOids list. Remember that the latter is used by > plancache.c to remember the relations that a given cached plan depends on > remaining unchanged. So the patch now adds a list member called > partitioned_rels to Append, MergeAppend, and ModifyTable nodes and stores > the RT indexes of all the non-leaf tables in a partition tree with root > table RT index at the head (note that these RT indexes are of the RTEs > added by expand_inherited_rtenrty; also see below). Since the > partitioned_rels list is constructed when building paths and must be > propagated to the plan nodes, the same field is also present in the > corresponding Path nodes. ExecInit* routines for the aforementioned nodes > now locate and lock the non-leaf tables using the RT indexes in > partitioned_rels. Leaf tables are locked, as before, either by InitPlan > (update/delete result relations case) or by ExecInitAppend or > ExecInitMergeAppend when initializing the appendplans/mergeplans (select > case). > > The previous proposal was for expand_inherited_rtentry to not create RT > entries and AppendRelInfo's for the non-leaf tables, but I think that > doesn't work, as I tried to explain above. We need RTEs because that > seems to be the only way currently for informing the executor of the > non-leaf tables. We need AppendRelInfo's to store the RT indexes of those > RTEs for the latter planning steps to collect them in partitioned_rels > mentioned above. So with the latest patch, we do create the RT entry and > AppendRelInfo for non-leaf tables. AppendRelInfo created in this case is > a minimal one; only parent_relid and child_relid are valid. To make the > latter planning steps ignore these minimal AppendRelInfo's, every > AppendRelInfo is now marked with child_relkind. Only > set_append_rel_pathlist() and inheritance_planner() process them to > collect the child_relid into the partitioned_rels list to be stored in > AppendPath/MergeAppendPath and ModifyTablePath, respectively. Sorry, forgot to attach the patches themselves. Attached this time. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017/03/13 19:30, Amit Langote wrote: >> Here is the updated patch. >> >> Since this patch proposes to avoid creating scan nodes for non-leaf tables >> in a partition tree, they won't be referenced anywhere in the resulting >> plan tree. So the executor will not lock those tables in the >> select/update/delete cases. Insert is different since we lock all tables >> in the partition tree when setting up tuple-routing in >> ExecInitModifyTable. Not taking executor locks on the tables means that >> the cached plans that should be invalidated upon adding/removing a >> partition somewhere in the partition tree won't be. >> >> First I thought that we could remember just the root table RT index using >> a new Index field partitionRoot in Append, MergeAppend, and ModifyTable >> nodes and use it to locate and lock the root table during executor node >> initialization. But soon realized that that's not enough, because it >> ignores the fact that adding/removing partitions at lower levels does not >> require taking a lock on the root table; only the immediate parent. So >> any cached select/update/delete plans won't be invalidated when a new >> lower-level partition is added/removed, because the immediate parent would >> not have been added to the query's range table and hence the >> PlannedStmt.relationOids list. Remember that the latter is used by >> plancache.c to remember the relations that a given cached plan depends on >> remaining unchanged. So the patch now adds a list member called >> partitioned_rels to Append, MergeAppend, and ModifyTable nodes and stores >> the RT indexes of all the non-leaf tables in a partition tree with root >> table RT index at the head (note that these RT indexes are of the RTEs >> added by expand_inherited_rtenrty; also see below). Since the >> partitioned_rels list is constructed when building paths and must be >> propagated to the plan nodes, the same field is also present in the >> corresponding Path nodes. ExecInit* routines for the aforementioned nodes >> now locate and lock the non-leaf tables using the RT indexes in >> partitioned_rels. Leaf tables are locked, as before, either by InitPlan >> (update/delete result relations case) or by ExecInitAppend or >> ExecInitMergeAppend when initializing the appendplans/mergeplans (select >> case). >> >> The previous proposal was for expand_inherited_rtentry to not create RT >> entries and AppendRelInfo's for the non-leaf tables, but I think that >> doesn't work, as I tried to explain above. We need RTEs because that >> seems to be the only way currently for informing the executor of the >> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those >> RTEs for the latter planning steps to collect them in partitioned_rels >> mentioned above. So with the latest patch, we do create the RT entry and >> AppendRelInfo for non-leaf tables. AppendRelInfo created in this case is >> a minimal one; only parent_relid and child_relid are valid. To make the >> latter planning steps ignore these minimal AppendRelInfo's, every >> AppendRelInfo is now marked with child_relkind. Only >> set_append_rel_pathlist() and inheritance_planner() process them to >> collect the child_relid into the partitioned_rels list to be stored in >> AppendPath/MergeAppendPath and ModifyTablePath, respectively. > > Sorry, forgot to attach the patches themselves. Attached this time. I made some changes to 0001: * No need to create RelOptInfo's for partitioned RTEs added by expand_inherited_rtentry * Instead of storing relkind of child tables in AppendRelInfo's, simply store if they are partitioned; IOW, change child_relkind to child_is_partitioned Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > The previous proposal was for expand_inherited_rtentry to not create RT > entries and AppendRelInfo's for the non-leaf tables, but I think that > doesn't work, as I tried to explain above. We need RTEs because that > seems to be the only way currently for informing the executor of the > non-leaf tables. We need AppendRelInfo's to store the RT indexes of those > RTEs for the latter planning steps to collect them in partitioned_rels > mentioned above. So with the latest patch, we do create the RT entry and > AppendRelInfo for non-leaf tables. AppendRelInfo created in this case is > a minimal one; only parent_relid and child_relid are valid. To make the > latter planning steps ignore these minimal AppendRelInfo's, every > AppendRelInfo is now marked with child_relkind. Only > set_append_rel_pathlist() and inheritance_planner() process them to > collect the child_relid into the partitioned_rels list to be stored in > AppendPath/MergeAppendPath and ModifyTablePath, respectively. I see your point, but I still think this kind of stinks. You've got all kinds of logic that is now conditional on child_is_partitioned, and that seems like a recipe for bugs and future maintenance difficulties. It would be much nicer if we could come up with a design that doesn't create the AppendRelInfo in the first place, because then all of that stuff could just work. Can we get away with creating an RTE for each partitioned table (other than the parent, perhaps; for that one it would be nice to use the inh-flagged RTE we already have) but NOT creating an AppendRelInfo to go with it? If we've got the list of RTIs for the new RTEs associated with the append path in some other form, can't we get by without also having an AppendRelInfo to hold onto that translation? The comments in InitPlan() explain why locks on result relations are taken in that function directly rather than during the ExecInitNode pass over the tree; it's because we need to make sure we take the strongest lock on any given relation first. But the changes in ExecInitAppend and ExecInitMergeAppend are problematic in that regard; some AccessShareLocks may already have been taken, and now we're taking locks that in some case may be RowShareLock, which could cause a lock upgrade. Remember that there's no reason that you couldn't join a table to one of its own partitions, or something like that. I think you need to try to jigger things so that InitPlan() takes all locks stronger than AccessShareLock that are required anywhere in the query, and then other nodes can take anything at AccessShareLock that they need. I think that eliding the Append node when there's only one child may be unsafe in the case where the child's attribute numbers are different from the parent's attribute numbers. I remember Tom making some comment about this when I was working on MergeAppend, although I no longer remember the specific details. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 15, 2017 at 3:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> The previous proposal was for expand_inherited_rtentry to not create RT >> entries and AppendRelInfo's for the non-leaf tables, but I think that >> doesn't work, as I tried to explain above. We need RTEs because that >> seems to be the only way currently for informing the executor of the >> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those >> RTEs for the latter planning steps to collect them in partitioned_rels >> mentioned above. So with the latest patch, we do create the RT entry and >> AppendRelInfo for non-leaf tables. AppendRelInfo created in this case is >> a minimal one; only parent_relid and child_relid are valid. To make the >> latter planning steps ignore these minimal AppendRelInfo's, every >> AppendRelInfo is now marked with child_relkind. Only >> set_append_rel_pathlist() and inheritance_planner() process them to >> collect the child_relid into the partitioned_rels list to be stored in >> AppendPath/MergeAppendPath and ModifyTablePath, respectively. > > I see your point, but I still think this kind of stinks. You've got > all kinds of logic that is now conditional on child_is_partitioned, > and that seems like a recipe for bugs and future maintenance > difficulties. It would be much nicer if we could come up with a > design that doesn't create the AppendRelInfo in the first place, > because then all of that stuff could just work. Can we get away with > creating an RTE for each partitioned table (other than the parent, > perhaps; for that one it would be nice to use the inh-flagged RTE we > already have) but NOT creating an AppendRelInfo to go with it? If > we've got the list of RTIs for the new RTEs associated with the append > path in some other form, can't we get by without also having an > AppendRelInfo to hold onto that translation? > Will it help to retain the partition hierarchy as inheritance hierarchy and then collapse it while creating append paths. That will be needed by partition-wise join, will be helpful in partition pruning without using constraints and so on. So, may be could use that infrastructure to simplify the logic here. The patch is available as 0013 in [1]. [1] CAFjFpRfqotRR6cM3sooBHMHEVdkFfAZ6PyYg4GRZsoMuW08HjQ@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2017/03/15 13:38, Ashutosh Bapat wrote: > On Wed, Mar 15, 2017 at 3:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> The previous proposal was for expand_inherited_rtentry to not create RT >>> entries and AppendRelInfo's for the non-leaf tables, but I think that >>> doesn't work, as I tried to explain above. We need RTEs because that >>> seems to be the only way currently for informing the executor of the >>> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those >>> RTEs for the latter planning steps to collect them in partitioned_rels >>> mentioned above. So with the latest patch, we do create the RT entry and >>> AppendRelInfo for non-leaf tables. AppendRelInfo created in this case is >>> a minimal one; only parent_relid and child_relid are valid. To make the >>> latter planning steps ignore these minimal AppendRelInfo's, every >>> AppendRelInfo is now marked with child_relkind. Only >>> set_append_rel_pathlist() and inheritance_planner() process them to >>> collect the child_relid into the partitioned_rels list to be stored in >>> AppendPath/MergeAppendPath and ModifyTablePath, respectively. >> >> I see your point, but I still think this kind of stinks. You've got >> all kinds of logic that is now conditional on child_is_partitioned, >> and that seems like a recipe for bugs and future maintenance >> difficulties. It would be much nicer if we could come up with a >> design that doesn't create the AppendRelInfo in the first place, >> because then all of that stuff could just work. Can we get away with >> creating an RTE for each partitioned table (other than the parent, >> perhaps; for that one it would be nice to use the inh-flagged RTE we >> already have) but NOT creating an AppendRelInfo to go with it? If >> we've got the list of RTIs for the new RTEs associated with the append >> path in some other form, can't we get by without also having an >> AppendRelInfo to hold onto that translation? >> > > Will it help to retain the partition hierarchy as inheritance > hierarchy and then collapse it while creating append paths. That will > be needed by partition-wise join, will be helpful in partition pruning > without using constraints and so on. So, may be could use that > infrastructure to simplify the logic here. The patch is available as > 0013 in [1]. > > [1] CAFjFpRfqotRR6cM3sooBHMHEVdkFfAZ6PyYg4GRZsoMuW08HjQ@mail.gmail.com IMHO, it would be better to keep those patches separate because the problems being solved are different. By the way, one of the reasons that patch (as I had written it) was skipped was because it didn't cover the inheritance_planner() case [1]. Your patch 0013 at the link should be updated (maybe I should report on the partitionwise joins thread as well) in some way to handle the update/delete case, because this happens: create table p (a int, b char) partition by list (a); create table p1 partition of p for values in (1) partition by list (b); create table p1a partition of p1 for values in ('a'); create table p2 partition of p for values in (2); explain (costs off) update p set a = a, b = 'b'; QUERY PLAN -----------------------------------Update on p Update on p Update on p1 p Update on p2 -> Seq Scan on p -> Result -> Append -> Seq Scan on p1 -> Seq Scan on p1a -> Seq Scan on p2 (10 rows) update p set a = a, b = 'b'; server closed the connection unexpectedlyThis probably means the server terminated abnormallybeforeor while processing the request. The connection to the server was lost. Attempting reset: Failed. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmobMy%3DrqM%3DMTN_FUEfD-PiWSCSonH%2BZ1_SjL6ZmQ2GGz1w%40mail.gmail.com
>> >> Will it help to retain the partition hierarchy as inheritance >> hierarchy and then collapse it while creating append paths. That will >> be needed by partition-wise join, will be helpful in partition pruning >> without using constraints and so on. So, may be could use that >> infrastructure to simplify the logic here. The patch is available as >> 0013 in [1]. >> >> [1] CAFjFpRfqotRR6cM3sooBHMHEVdkFfAZ6PyYg4GRZsoMuW08HjQ@mail.gmail.com > > IMHO, it would be better to keep those patches separate because the > problems being solved are different. By the way, one of the reasons that > patch (as I had written it) was skipped was because it didn't cover the > inheritance_planner() case [1]. Your patch 0013 at the link should be > updated (maybe I should report on the partitionwise joins thread as well) > in some way to handle the update/delete case, because this happens: > > create table p (a int, b char) partition by list (a); > create table p1 partition of p for values in (1) partition by list (b); > create table p1a partition of p1 for values in ('a'); > create table p2 partition of p for values in (2); > > explain (costs off) update p set a = a, b = 'b'; > QUERY PLAN > ----------------------------------- > Update on p > Update on p > Update on p1 p > Update on p2 > -> Seq Scan on p > -> Result > -> Append > -> Seq Scan on p1 > -> Seq Scan on p1a > -> Seq Scan on p2 > (10 rows) > > update p set a = a, b = 'b'; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. Thanks for pointing that out. I am able to reproduce the crash. I think, we will need to teach it to add the indirect children as well. Looks like we are missing a testcase for this scenario. I had run regression with that patch, and didn't catch any failures and crashes. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 2017/03/15 7:09, Robert Haas wrote: > On Mon, Mar 13, 2017 at 6:24 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> The previous proposal was for expand_inherited_rtentry to not create RT >> entries and AppendRelInfo's for the non-leaf tables, but I think that >> doesn't work, as I tried to explain above. We need RTEs because that >> seems to be the only way currently for informing the executor of the >> non-leaf tables. We need AppendRelInfo's to store the RT indexes of those >> RTEs for the latter planning steps to collect them in partitioned_rels >> mentioned above. So with the latest patch, we do create the RT entry and >> AppendRelInfo for non-leaf tables. AppendRelInfo created in this case is >> a minimal one; only parent_relid and child_relid are valid. To make the >> latter planning steps ignore these minimal AppendRelInfo's, every >> AppendRelInfo is now marked with child_relkind. Only >> set_append_rel_pathlist() and inheritance_planner() process them to >> collect the child_relid into the partitioned_rels list to be stored in >> AppendPath/MergeAppendPath and ModifyTablePath, respectively. > > I see your point, but I still think this kind of stinks. You've got > all kinds of logic that is now conditional on child_is_partitioned, > and that seems like a recipe for bugs and future maintenance > difficulties. It would be much nicer if we could come up with a > design that doesn't create the AppendRelInfo in the first place, > because then all of that stuff could just work. Can we get away with > creating an RTE for each partitioned table (other than the parent, > perhaps; for that one it would be nice to use the inh-flagged RTE we > already have) but NOT creating an AppendRelInfo to go with it? If > we've got the list of RTIs for the new RTEs associated with the append > path in some other form, can't we get by without also having an > AppendRelInfo to hold onto that translation? I think we'll need to store *somewhere* the mapping of which inh=false partitioned table RTE is the child of which inh=true (IOW, parent) partitioned table RTE. I've come to think that AppendRelInfos, although contain extraneous information that won't be used, are better than inventing something new altogether for time being. AppendRelInfos are referred to a few times by query_planner() steps before we eventually get to either set_append_rel_pathlist() or inheritance_planner(), so not changing that approach seems less worrisome for now. So now if we both create child RTEs and AppendRelInfos for the partitioned tables, we don't need to change expand_inherited_rtentry() at all with this patch. Finally, set_append_rel_size/pathlist() and inheritance_planner() skip the child partitioned table RTEs, because no path/plan need to be created. We can do away with having to create RelOptInfos for child partitioned table RTEs, which I found to be not that invasive. We seem to agree that we need to carry the partitioned table RT indexes over to the executor somehow so that they are locked properly during execution. To that end, Append, MergeAppend, and ModifyTable (and the corresponding path nodes) nodes will contain a list of those RT indexes in a field called partitioned_rels. In the result relation case, the list from ModifyTable is copied to a field in PlannerGlobal called nonleafResultRelations (next to the old resultRelations) and further to a field of the same name in PlannedStmt. I considered if it can be done without a new field, but realized that the old resultRelations is tied closely with how the EState manages result relations. InitPlan() and AcquireExecutorLocks() lock the tables denoted by the RT indexes in PlannedStmt.nonleafResultRelations using RowExclusiveLock, just like they would those in PlannedStmt.resultRelations. Since expand_inherited_rtentries() adds PlanRowMarks for partitioned tables just like previously, InitPlan() and AcquireExecutorLocks() will lock them with RowShareLock if needed. Finally, ExecInitAppend() and ExecInitMergeAppend() must acquire AccessShareLock on the partition tables, if not already locked by InitPlan(). > The comments in InitPlan() explain why locks on result relations are > taken in that function directly rather than during the ExecInitNode > pass over the tree; it's because we need to make sure we take the > strongest lock on any given relation first. But the changes in > ExecInitAppend and ExecInitMergeAppend are problematic in that regard; > some AccessShareLocks may already have been taken, and now we're > taking locks that in some case may be RowShareLock, which could cause > a lock upgrade. Remember that there's no reason that you couldn't > join a table to one of its own partitions, or something like that. I > think you need to try to jigger things so that InitPlan() takes all > locks stronger than AccessShareLock that are required anywhere in the > query, and then other nodes can take anything at AccessShareLock that > they need. Thanks for the explanation. If the scheme in the new patch that I described above sounds OK, I think it takes care of locking the partitioned tables without the upgrade hazards. > I think that eliding the Append node when there's only one child may > be unsafe in the case where the child's attribute numbers are > different from the parent's attribute numbers. I remember Tom making > some comment about this when I was working on MergeAppend, although I > no longer remember the specific details. Append node elision does not occur in the one-child case. With the patch: create table q (a int) partition by list (a); explain select * from q; QUERY PLAN ------------------------------------------ Result (cost=0.00..0.00 rows=0 width=4) One-Time Filter: false (2 rows) create table q1 partition of q for values in (1); explain select * from q; QUERY PLAN ------------------------------------------------------------ Append (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on q1 (cost=0.00..35.50 rows=2550 width=4) (2 rows) Maybe that should be done, but this patch doesn't implement that. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Mar 16, 2017 at 6:03 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I think we'll need to store *somewhere* the mapping of which inh=false > partitioned table RTE is the child of which inh=true (IOW, parent) > partitioned table RTE. I mean, for the children you're going to scan, that seems to be necessary so that you can do things like translate targetlists to use the correct varno. But for the children you're not going to scan, well, you need to know which ones they are so you can lock them, but do you really need the parent-child mappings? Or just a list of which ones there are? > I've come to think that AppendRelInfos, although > contain extraneous information that won't be used, are better than > inventing something new altogether for time being. AppendRelInfos are > referred to a few times by query_planner() steps before we eventually get > to either set_append_rel_pathlist() or inheritance_planner(), so not > changing that approach seems less worrisome for now. So now if we both > create child RTEs and AppendRelInfos for the partitioned tables, we don't > need to change expand_inherited_rtentry() at all with this patch. > Finally, set_append_rel_size/pathlist() and inheritance_planner() skip the > child partitioned table RTEs, because no path/plan need to be created. We > can do away with having to create RelOptInfos for child partitioned table > RTEs, which I found to be not that invasive. Yes, but on the flip side, you're having to add code in a lot of places -- I think I counted 7 -- where you turn around and ignore those AppendRelInfos. That's a lot; how do we know we've got them all? I'm not sure what the patch would look like the other way, but I'm hoping that you could just keep the list of partitioned table RTIs someplace that mostly gets ignored, and then all of that special handling could be ripped out. > Append node elision does not occur in the one-child case. With the patch: Oh, OK. Somehow the commit message you included led me to the contrary conclusion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/03/16 22:16, Robert Haas wrote: > On Thu, Mar 16, 2017 at 6:03 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I think we'll need to store *somewhere* the mapping of which inh=false >> partitioned table RTE is the child of which inh=true (IOW, parent) >> partitioned table RTE. > > I mean, for the children you're going to scan, that seems to be > necessary so that you can do things like translate targetlists to use > the correct varno. But for the children you're not going to scan, > well, you need to know which ones they are so you can lock them, but > do you really need the parent-child mappings? Or just a list of which > ones there are? I thought we'd want to keep the RT indexes segregated per controlling node, which is why we'd want to store the mapping somehow. There may be multiple partitioned tables in a query and one of them might be the result relation. If we propagate all of them in one list to the executor, InitPlan wouldn't know which ones are target relations, for example. But I guess you don't mean passing all of them in one big list to the executor. >> I've come to think that AppendRelInfos, although >> contain extraneous information that won't be used, are better than >> inventing something new altogether for time being. AppendRelInfos are >> referred to a few times by query_planner() steps before we eventually get >> to either set_append_rel_pathlist() or inheritance_planner(), so not >> changing that approach seems less worrisome for now. So now if we both >> create child RTEs and AppendRelInfos for the partitioned tables, we don't >> need to change expand_inherited_rtentry() at all with this patch. >> Finally, set_append_rel_size/pathlist() and inheritance_planner() skip the >> child partitioned table RTEs, because no path/plan need to be created. We >> can do away with having to create RelOptInfos for child partitioned table >> RTEs, which I found to be not that invasive. > > Yes, but on the flip side, you're having to add code in a lot of > places -- I think I counted 7 -- where you turn around and ignore > those AppendRelInfos. Perhaps you were looking at the previous version with "minimal" appinfos containing the child_is_partitioned field? Following are the instances in the latest patch posted yesterday: - In set_append_re_size, do not count the partitioned child tables - In set_append_rel_pathlist, do not create paths, but add to the list of the partitioned tables to be included in the final Append path - In create_lateral_join_info(), no need to propagate lateral join info to partitioned child tables - In inheritance_planner, no need to create plans for partitioned child tables, but add to the list of the partitioned tables to be included in the final ModifyTable path In the previous version, because AppendRelInfo structure was itself modified, there were more instances because more places would need to know about the new kinds of appinfos (the minimal ones), which isn't the case with the latest patch. Although admittedly, there will be a lot of useless manipulation of those appinfos whose only purpose is to carry the parent-child relid mapping up to where we finally consume that information. Anyway, I've tried to implement an approach that doesn't create AppendRelInfos for partitioned child tables as described below. > That's a lot; how do we know we've got them > all? I'm not sure what the patch would look like the other way, but > I'm hoping that you could just keep the list of partitioned table RTIs > someplace that mostly gets ignored, and then all of that special > handling could be ripped out. Since we'll want to save the RT indexes when we create child RTEs, that is in expand_inherited_rtentry(), we'd do use a data structure that parallels root->append_rel_list. The structure might consist of a the parent_relid and a list of partitioned child RT indexes. Since we build this information early during subquery_planner() it might equally be subject to errors of omission, although that somehow sounds less likely. I've tried to implement something like that in the attached updated 0001. The mapping structure PartitionedChildRelInfo (a Node in fact) consists of parent_relid and the list of partitioned child RT indexes. A global list of these nodes is maintained in the root PlannerInfo called pcinfo_list. In set_append_rel_pathlist() and inheritance_planner(), we search the aforementioned list and copy the correct list to Append/MergeAppend/ModifyTable path created for the parent. Thoughts? I also noticed that PlanRowMarks that are created by expand_inherited_rtentry() for partitioned child tables are problematic - they must somehow be ignored in the executor. One way to do that is to set isParent (making it a dummy entry) when initializing them, which is what the patch does. Comment is updated to say that just like inheritance parent relations, PlanRowMarks for partitioned child tables are also marked as dummy. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Fri, Mar 17, 2017 at 4:57 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Yes, but on the flip side, you're having to add code in a lot of >> places -- I think I counted 7 -- where you turn around and ignore >> those AppendRelInfos. > > Perhaps you were looking at the previous version with "minimal" appinfos > containing the child_is_partitioned field? Yes, I think I was. I think this version looks a lot better. /* + * Close the root partitioned rel if we opened it above, but keep the + * lock. + */ + if (rel != mtstate->resultRelInfo->ri_RelationDesc) + heap_close(rel, NoLock); We didn't take a lock above, though, so drop everything in the comment from "but" onward. - add_paths_to_append_rel(root, rel, live_childrels); + add_paths_to_append_rel(root, rel, live_childrels, partitioned_rels); I think it would make more sense to put the new logic into add_paths_to_append_rel, instead of passing this down as an additional parameter. + * do not appear anywhere else in the plan. Situation is exactly the The situation is... + if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE) + { + foreach(lc, root->pcinfo_list) + { + PartitionedChildRelInfo *pc = lfirst(lc); + + if (pc->parent_relid == parentRTindex) + { + partitioned_rels = pc->child_rels; + break; + } + } + } You seem to have a few copies of this logic. I think it would be worth factoring it out into a separate function. + root->glob->nonleafResultRelations = + list_concat(root->glob->nonleafResultRelations, + list_copy(splan->partitioned_rels)); Please add a brief comment. One line is fine. + newrc->isParent = childrte->relkind == RELKIND_PARTITIONED_TABLE; I'm not sure what project style is, but I personally find these kinds of assignments easier to read with an extra set of parantheses: newrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE); + if (partitioned_rels == NIL) + return; + + foreach(lc, partitioned_rels) I think the if-test is pointless; the foreach loop is going to start by comparing the initial value with NIL. Why doesn't ExecSerializePlan() need to transfer a proper value for nonleafResultRelations to workers? Seems like it should. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/03/21 1:16, Robert Haas wrote: > On Fri, Mar 17, 2017 at 4:57 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> Yes, but on the flip side, you're having to add code in a lot of >>> places -- I think I counted 7 -- where you turn around and ignore >>> those AppendRelInfos. >> >> Perhaps you were looking at the previous version with "minimal" appinfos >> containing the child_is_partitioned field? > > Yes, I think I was. I think this version looks a lot better. Just to clarify, I assume you reviewed the latest version which does not create AppendRelInfos, but instead creates PartitionedChildRelInfos (as also evident from your comments below). Sorry about the confusion. > /* > + * Close the root partitioned rel if we opened it above, but keep the > + * lock. > + */ > + if (rel != mtstate->resultRelInfo->ri_RelationDesc) > + heap_close(rel, NoLock); > > We didn't take a lock above, though, so drop everything in the comment > from "but" onward. Oh, right. > - add_paths_to_append_rel(root, rel, live_childrels); > + add_paths_to_append_rel(root, rel, live_childrels, partitioned_rels); > > I think it would make more sense to put the new logic into > add_paths_to_append_rel, instead of passing this down as an additional > parameter. OK, done. > + * do not appear anywhere else in the plan. Situation is exactly the > > The situation is... Fixed. > > + if (parent_rte->relkind == RELKIND_PARTITIONED_TABLE) > + { > + foreach(lc, root->pcinfo_list) > + { > + PartitionedChildRelInfo *pc = lfirst(lc); > + > + if (pc->parent_relid == parentRTindex) > + { > + partitioned_rels = pc->child_rels; > + break; > + } > + } > + } > > You seem to have a few copies of this logic. I think it would be > worth factoring it out into a separate function. OK, done. Put the definition in in planner.c > + root->glob->nonleafResultRelations = > + list_concat(root->glob->nonleafResultRelations, > + list_copy(splan->partitioned_rels)); > > Please add a brief comment. One line is fine. Done. > > + newrc->isParent = childrte->relkind == RELKIND_PARTITIONED_TABLE; > > I'm not sure what project style is, but I personally find these kinds > of assignments easier to read with an extra set of parantheses: > > newrc->isParent = (childrte->relkind == RELKIND_PARTITIONED_TABLE); Ah, done. > > + if (partitioned_rels == NIL) > + return; > + > + foreach(lc, partitioned_rels) > > I think the if-test is pointless; the foreach loop is going to start > by comparing the initial value with NIL. Right, fixed. > Why doesn't ExecSerializePlan() need to transfer a proper value for > nonleafResultRelations to workers? Seems like it should. It doesn't transfer resultRelations either, presumably because workers do not handle result relations yet. Also, both resultRelations and nonleafResultRelations are set *only* if there is a ModifyTable node. Attached updated patches. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Mar 21, 2017 at 5:05 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Attached updated patches. Committed 0001 after removing a comma. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16 March 2017 at 10:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/03/15 7:09, Robert Haas wrote: >> I think that eliding the Append node when there's only one child may >> be unsafe in the case where the child's attribute numbers are >> different from the parent's attribute numbers. I remember Tom making >> some comment about this when I was working on MergeAppend, although I >> no longer remember the specific details. > > Append node elision does not occur in the one-child case. With the patch: ... > create table q1 partition of q for values in (1); > explain select * from q; > QUERY PLAN > ------------------------------------------------------------ > Append (cost=0.00..35.50 rows=2550 width=4) > -> Seq Scan on q1 (cost=0.00..35.50 rows=2550 width=4) > (2 rows) > > Maybe that should be done, but this patch doesn't implement that. Robert raises the possible problem that removing the Append wouldn't work when the parent and child attribute numbers don't match. Surely that never happens with partitions, by definition? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 21, 2017 at 12:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 16 March 2017 at 10:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/03/15 7:09, Robert Haas wrote: > >>> I think that eliding the Append node when there's only one child may >>> be unsafe in the case where the child's attribute numbers are >>> different from the parent's attribute numbers. I remember Tom making >>> some comment about this when I was working on MergeAppend, although I >>> no longer remember the specific details. >> >> Append node elision does not occur in the one-child case. With the patch: > ... >> create table q1 partition of q for values in (1); >> explain select * from q; >> QUERY PLAN >> ------------------------------------------------------------ >> Append (cost=0.00..35.50 rows=2550 width=4) >> -> Seq Scan on q1 (cost=0.00..35.50 rows=2550 width=4) >> (2 rows) >> >> Maybe that should be done, but this patch doesn't implement that. > > Robert raises the possible problem that removing the Append wouldn't > work when the parent and child attribute numbers don't match. Surely > that never happens with partitions, by definition? No, the attribute numbers don't have to match. This decision was made a long time ago, and there have been a whole bunch of followup commits since the original partitioning patch that were dedicated to fixing up cases where that wasn't working properly in the original commit. It seems superficially attractive to require the attribute numbers to match, but it creates some really unpleasant cases. For example, suppose a user tries to creates a partitioned table, drops a column, then creates a standalone table which matches the apparent column list of the partitioned table, then tries to attach it as a partition. ERROR: the columns you previously dropped from the parent that you can't see and don't know about aren't the same as the ones dropped from the standalone table you're trying to attach as a partition DETAIL: Try recreating your proposed new partition with a pass-by-value column of width 8 after the third column, and then dropping that column before trying to attach it as a partition. HINT: Abandon all hope, ye who enter here. Not cool with that. The decision not to require the attribute numbers to match doesn't necessarily mean we can't get rid of the Append node, though. First of all, in a lot of practical cases the attribute numbers will all match. Second, if they don't, the most that would be required is a projection step, which could usually be done without a separate node because most nodes are projection-capable. And maybe not even that much is needed; I'd have to go back and look at what Tom was worried about the last time this came up. (Hmm, maybe the problem had to do with varnos matching, rather then attribute numbers?) Another and independent problem with eliding the Append node is that, if we did that, we'd still have to guarantee that the parent relation corresponding to the Append node got locked somehow. Otherwise, we'll be accessing the tuple routing information for a table on which we don't have a lock. That's probably a solvable problem, too, but it hasn't been solved yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 21 March 2017 at 16:33, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 21, 2017 at 12:19 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 16 March 2017 at 10:03, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> On 2017/03/15 7:09, Robert Haas wrote: >> >>>> I think that eliding the Append node when there's only one child may >>>> be unsafe in the case where the child's attribute numbers are >>>> different from the parent's attribute numbers. I remember Tom making >>>> some comment about this when I was working on MergeAppend, although I >>>> no longer remember the specific details. >>> >>> Append node elision does not occur in the one-child case. With the patch: >> ... >>> create table q1 partition of q for values in (1); >>> explain select * from q; >>> QUERY PLAN >>> ------------------------------------------------------------ >>> Append (cost=0.00..35.50 rows=2550 width=4) >>> -> Seq Scan on q1 (cost=0.00..35.50 rows=2550 width=4) >>> (2 rows) >>> >>> Maybe that should be done, but this patch doesn't implement that. >> >> Robert raises the possible problem that removing the Append wouldn't >> work when the parent and child attribute numbers don't match. Surely >> that never happens with partitions, by definition? > > No, the attribute numbers don't have to match. This decision was made > a long time ago, and there have been a whole bunch of followup commits > since the original partitioning patch that were dedicated to fixing up > cases where that wasn't working properly in the original commit. It > seems superficially attractive to require the attribute numbers to > match, but it creates some really unpleasant cases. For example, > suppose a user tries to creates a partitioned table, drops a column, > then creates a standalone table which matches the apparent column list > of the partitioned table, then tries to attach it as a partition. > > ERROR: the columns you previously dropped from the parent that you > can't see and don't know about aren't the same as the ones dropped > from the standalone table you're trying to attach as a partition > DETAIL: Try recreating your proposed new partition with a > pass-by-value column of width 8 after the third column, and then > dropping that column before trying to attach it as a partition. > HINT: Abandon all hope, ye who enter here. > > Not cool with that. Thanks for the explanation. > The decision not to require the attribute numbers to match doesn't > necessarily mean we can't get rid of the Append node, though. First > of all, in a lot of practical cases the attribute numbers will all > match. Second, if they don't, the most that would be required is a > projection step, which could usually be done without a separate node > because most nodes are projection-capable. And maybe not even that > much is needed; I'd have to go back and look at what Tom was worried > about the last time this came up. (Hmm, maybe the problem had to do > with varnos matching, rather then attribute numbers?) There used to be some code there to fix them up, not sure where that went. > Another and independent problem with eliding the Append node is that, > if we did that, we'd still have to guarantee that the parent relation > corresponding to the Append node got locked somehow. Otherwise, we'll > be accessing the tuple routing information for a table on which we > don't have a lock. That's probably a solvable problem, too, but it > hasn't been solved yet. Hmm, why would we need to access tuple routing information? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Mar 21, 2017 at 1:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> The decision not to require the attribute numbers to match doesn't >> necessarily mean we can't get rid of the Append node, though. First >> of all, in a lot of practical cases the attribute numbers will all >> match. Second, if they don't, the most that would be required is a >> projection step, which could usually be done without a separate node >> because most nodes are projection-capable. And maybe not even that >> much is needed; I'd have to go back and look at what Tom was worried >> about the last time this came up. (Hmm, maybe the problem had to do >> with varnos matching, rather then attribute numbers?) > > There used to be some code there to fix them up, not sure where that went. Me neither. To be clear in case I haven't been already, I'm totally fine with somebody doing the work to get rid of the Append node; I just think it'll take some investigation and work that hasn't been done yet. (I'm also a little skeptical about the value of the work. The Append node doesn't cost much; what's expensive is that the planner isn't smart about planning queries that involve Append nodes and so getting rid of one can improve the whole plan shape. But I think the answer to that problem is optimizations like partition-wise join and partition-wise aggregate, which can handle cases where an Append has any number of surviving children. Eliminating the Append only helps when the number of surviving children is exactly one. Now, that's not to say I'm going to fight a patch if somebody writes one, but I think to some extent it's just a band-aid.) >> Another and independent problem with eliding the Append node is that, >> if we did that, we'd still have to guarantee that the parent relation >> corresponding to the Append node got locked somehow. Otherwise, we'll >> be accessing the tuple routing information for a table on which we >> don't have a lock. That's probably a solvable problem, too, but it >> hasn't been solved yet. > > Hmm, why would we need to access tuple routing information? I didn't state that very well. It's not so much that we need access to the tuple routing information as that we need to replan if it changes, because if the tuple routing information changes then we might need to include partitions that were previously being pruned. If we haven't got some kind of a lock on the parent, I'm pretty sure that's not going to work reliably. Synchronization of invalidation traffic relies on DDL statements holding a lock that conflicts with the lock held by the process using the table; if there is no such lock, we might fail to notice that we need to replan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Off-list by accident. Re-adding the list. On Tue, Mar 21, 2017 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 21, 2017 at 9:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Mar 21, 2017 at 5:05 AM, Amit Langote >> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>> Attached updated patches. >> >> Committed 0001 after removing a comma. > > Regarding 0002, I notice this surprising behavior: > > rhaas=# create table foo (a int, b text) partition by list (a); > CREATE TABLE > rhaas=# select relfilenode from pg_class where oid = 'foo'::regclass; > relfilenode > ------------- > 16385 > (1 row) > > Why do we end up with a relfilenode if there's no storage? I would > have expected to see 0 there. > > Other than that, there's not much to see here. I'm a little worried > that you might have missed some case that can result in an access to > the file, but I can't find one. Stuff I tried: > > VACUUM > VACUUM FULL > CLUSTER > ANALYZE > CREATE INDEX > ALTER TABLE .. ALTER COLUMN .. TYPE > TRUNCATE > > It would be good to go through and make sure all of those - and any > others you can think of - are represented in the regression tests. > > Also, a documentation update is probably in order to explain that > we're not going to accept heap_reloptions on the parent because the > parent has no storage; such options must be set on the child in order > to have effect. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Tue, Mar 21, 2017 at 10:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Mar 21, 2017 at 9:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Tue, Mar 21, 2017 at 5:05 AM, Amit Langote >>> <Langote_Amit_f8@lab.ntt.co.jp> wrote: >>>> Attached updated patches. >>> >>> Committed 0001 after removing a comma. >> >> Regarding 0002, Thanks for the review. >> I notice this surprising behavior: >> >> rhaas=# create table foo (a int, b text) partition by list (a); >> CREATE TABLE >> rhaas=# select relfilenode from pg_class where oid = 'foo'::regclass; >> relfilenode >> ------------- >> 16385 >> (1 row) >> >> Why do we end up with a relfilenode if there's no storage? I would >> have expected to see 0 there. Hmm, I see that happening even with views (and other relkinds that do not have storage): create table foo (a int); create view foov as select * from foo; select relfilenode from pg_class where relname = 'foov'; -[ RECORD 1 ]------ relfilenode | 24606 create type typ as (a int); select relfilenode from pg_class where relname = 'typ'; -[ RECORD 1 ]------ relfilenode | 24610 After staring at the relevant code, fixing things so that relfilenode is 0 for relations without storage seems to be slightly involved. In the header comment of relmapper.c, there is a line: Mapped catalogs have zero in their pg_class.relfilenode entries. which is kind of significant for the relcache initialization code. RelationInitPhysicalAddr() called when building a relcache entry initializes the rd_rel.relfilenode from pg_class.relfilenode if it's valid and from relation mapper if invalid (i.e. 0). So when relcache entries for the mapped catalogs are being built, pg_class.relfilenode being InvalidOid is a signal to get the same from the relmapper. Since the same code path is exercised in other cases, all relations supposedly without storage would wrongly be looked up in the relmapper. It may be possible to fix that, but will require non-trivial rearrangement of relevant code. >> Other than that, there's not much to see here. I'm a little worried >> that you might have missed some case that can result in an access to >> the file, but I can't find one. Stuff I tried: >> >> VACUUM >> VACUUM FULL >> CLUSTER >> ANALYZE >> CREATE INDEX >> ALTER TABLE .. ALTER COLUMN .. TYPE >> TRUNCATE >> >> It would be good to go through and make sure all of those - and any >> others you can think of - are represented in the regression tests. We now have tests that cover commands that previously would try to access file for partitioned tables (most of those your listed above). The commit yesterday to eliminate scan nodes is also covered by tests. Speaking of - do you think the following error message is reasonable when a a partitioned table is passed to pg_prewarm(): select pg_prewarm('p'); ERROR: fork "main" does not exist for this relation It's the same if a view name is passed. >> Also, a documentation update is probably in order to explain that >> we're not going to accept heap_reloptions on the parent because the >> parent has no storage; such options must be set on the child in order >> to have effect. Hmm, actually I am no longer sure if we should completely get rid of the reloptions. Some options like fillfactor don't make sense, but we might want to use the values for autovacuum_* options when we will improve autovacuum's handling of partitioned tables. Maybe even parallel_workers could be useful to specify for parent tables. So, I took out reloptions.c changes from the patch for now and documented that specifying the storage options for partitioned parents is ineffective currently. Does that make sense? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi! I have noticed that there is scheduled unlinking of nonexistent physical storage under partitioned table when we execute DROP TABLE statement on this partitioned table. Though this action doesn't generate any error under typical behavior of postgres because the error of storage's lack is caught through if-statement [1] I think it is not safe. My patch fixes this issue. 1. src/backend/storage/smgr/md.c:1385 -- Maksim Milyutin Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Hi, On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin <m.milyutin@postgrespro.ru> wrote: > Hi! > > I have noticed that there is scheduled unlinking of nonexistent physical > storage under partitioned table when we execute DROP TABLE statement on this > partitioned table. Though this action doesn't generate any error under > typical behavior of postgres because the error of storage's lack is caught > through if-statement [1] I think it is not safe. > > My patch fixes this issue. > > 1. src/backend/storage/smgr/md.c:1385 Good catch, will incorporate that in the main patch. Thanks, Amit
On 2017/03/23 23:47, Amit Langote wrote: > On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin > <m.milyutin@postgrespro.ru> wrote: >> Hi! >> >> I have noticed that there is scheduled unlinking of nonexistent physical >> storage under partitioned table when we execute DROP TABLE statement on this >> partitioned table. Though this action doesn't generate any error under >> typical behavior of postgres because the error of storage's lack is caught >> through if-statement [1] I think it is not safe. >> >> My patch fixes this issue. >> >> 1. src/backend/storage/smgr/md.c:1385 > > Good catch, will incorporate that in the main patch. And here is the updated patch. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Mar 23, 2017 at 8:54 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/03/23 23:47, Amit Langote wrote: >> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin >> <m.milyutin@postgrespro.ru> wrote: >>> Hi! >>> >>> I have noticed that there is scheduled unlinking of nonexistent physical >>> storage under partitioned table when we execute DROP TABLE statement on this >>> partitioned table. Though this action doesn't generate any error under >>> typical behavior of postgres because the error of storage's lack is caught >>> through if-statement [1] I think it is not safe. >>> >>> My patch fixes this issue. >>> >>> 1. src/backend/storage/smgr/md.c:1385 >> >> Good catch, will incorporate that in the main patch. > > And here is the updated patch. I think you should go back to the earlier strategy of disallowing heap reloptions to be set on the partitioned table. The thing is, we don't really know which way we're going to want to go in the future. Maybe we'll come up with a system where you can set options on the partitioned table, and those options will cascade to the children. Or maybe we'll come up with a system where partitioned tables have a completely different set of options to control behaviors specific to partitioned tables. If we do the latter, then we don't want to also have to support useless heap reloptions for forward compatibility, nor do we want to break backward-compatibility to remove support. If we do the former, then it's better if we allow it in the same release where it starts working. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 24.03.2017 03:54, Amit Langote wrote: > > And here is the updated patch. > Perhaps you forgot my fix in the updated patch diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 3999e6e..36917c8 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -1823,7 +1823,8 @@ heap_drop_with_catalog(Oid relid) */ if (rel->rd_rel->relkind != RELKIND_VIEW && rel->rd_rel->relkind!= RELKIND_COMPOSITE_TYPE && - rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE) + rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE && + rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) { RelationDropStorage(rel); } -- Maksim Milyutin Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Hi, On 2017/03/28 0:13, Maksim Milyutin wrote: > On 24.03.2017 03:54, Amit Langote wrote: >> >> And here is the updated patch. >> > > Perhaps you forgot my fix in the updated patch > > diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c > index 3999e6e..36917c8 100644 > --- a/src/backend/catalog/heap.c > +++ b/src/backend/catalog/heap.c > @@ -1823,7 +1823,8 @@ heap_drop_with_catalog(Oid relid) > */ > if (rel->rd_rel->relkind != RELKIND_VIEW && > rel->rd_rel->relkind != RELKIND_COMPOSITE_TYPE && > - rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE) > + rel->rd_rel->relkind != RELKIND_FOREIGN_TABLE && > + rel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE) > { > RelationDropStorage(rel); > } Oops, my bad. I will include it in the patch I'll send after addressing Robert's comments. Thanks again! Regards, Amit
On Mon, Mar 27, 2017 at 8:59 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Oops, my bad. I will include it in the patch I'll send after addressing > Robert's comments. Thanks again! That patch coming soon? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/03/27 23:27, Robert Haas wrote: > On Thu, Mar 23, 2017 at 8:54 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2017/03/23 23:47, Amit Langote wrote: >>> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin >>> <m.milyutin@postgrespro.ru> wrote: >>>> Hi! >>>> >>>> I have noticed that there is scheduled unlinking of nonexistent physical >>>> storage under partitioned table when we execute DROP TABLE statement on this >>>> partitioned table. Though this action doesn't generate any error under >>>> typical behavior of postgres because the error of storage's lack is caught >>>> through if-statement [1] I think it is not safe. >>>> >>>> My patch fixes this issue. >>>> >>>> 1. src/backend/storage/smgr/md.c:1385 >>> >>> Good catch, will incorporate that in the main patch. >> >> And here is the updated patch. > > I think you should go back to the earlier strategy of disallowing heap > reloptions to be set on the partitioned table. The thing is, we don't > really know which way we're going to want to go in the future. Maybe > we'll come up with a system where you can set options on the > partitioned table, and those options will cascade to the children. Or > maybe we'll come up with a system where partitioned tables have a > completely different set of options to control behaviors specific to > partitioned tables. If we do the latter, then we don't want to also > have to support useless heap reloptions for forward compatibility, nor > do we want to break backward-compatibility to remove support. If we > do the former, then it's better if we allow it in the same release > where it starts working. You're right, modified the patch accordingly. By the way, the previous version of the patch didn't really "disallow" specifying heap reloptions though. What I'd imagine that should entail is CREATE TABLE or ALTER TABLE raising error if one of those reloptions is specified, which didn't really occur with the patch. The options were silently accepted and stored into pg_class, but their values were never used. I modified the patch so that an error occurs instead of silently accepting the user input. create table p (a int) partition by list (a) with (fillfactor = 10); ERROR: unrecognized parameter "fillfactor" for a partitioned table Thanks, Amit
Hello, At Wed, 29 Mar 2017 15:40:20 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <c4d71df2-9e0e-3912-dc81-9a72e080c238@lab.ntt.co.jp> > On 2017/03/27 23:27, Robert Haas wrote: > > On Thu, Mar 23, 2017 at 8:54 PM, Amit Langote > > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> On 2017/03/23 23:47, Amit Langote wrote: > >>> On Thu, Mar 23, 2017 at 11:27 PM, Maksim Milyutin > >>> <m.milyutin@postgrespro.ru> wrote: > >>>> Hi! > >>>> > >>>> I have noticed that there is scheduled unlinking of nonexistent physical > >>>> storage under partitioned table when we execute DROP TABLE statement on this > >>>> partitioned table. Though this action doesn't generate any error under > >>>> typical behavior of postgres because the error of storage's lack is caught > >>>> through if-statement [1] I think it is not safe. > >>>> > >>>> My patch fixes this issue. > >>>> > >>>> 1. src/backend/storage/smgr/md.c:1385 > >>> > >>> Good catch, will incorporate that in the main patch. > >> > >> And here is the updated patch. > > > > I think you should go back to the earlier strategy of disallowing heap > > reloptions to be set on the partitioned table. The thing is, we don't > > really know which way we're going to want to go in the future. Maybe > > we'll come up with a system where you can set options on the > > partitioned table, and those options will cascade to the children. Or > > maybe we'll come up with a system where partitioned tables have a > > completely different set of options to control behaviors specific to > > partitioned tables. If we do the latter, then we don't want to also > > have to support useless heap reloptions for forward compatibility, nor > > do we want to break backward-compatibility to remove support. If we > > do the former, then it's better if we allow it in the same release > > where it starts working. > > You're right, modified the patch accordingly. > > By the way, the previous version of the patch didn't really "disallow" > specifying heap reloptions though. What I'd imagine that should entail is > CREATE TABLE or ALTER TABLE raising error if one of those reloptions is > specified, which didn't really occur with the patch. The options were > silently accepted and stored into pg_class, but their values were never > used. I modified the patch so that an error occurs instead of silently > accepting the user input. > > create table p (a int) partition by list (a) with (fillfactor = 10); > ERROR: unrecognized parameter "fillfactor" for a partitioned table The following attracted my eyes. + if (def->defnamespace == NULL && + pg_strcasecmp(def->defname, "oids") != 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("unrecognized parameter \"%s\" for a partitioned table", This works since defnamespace is always NULL here, but if I understand correctly what we should do here is "reject any option other than "(default).OID"". So I think that the condition should be like the following. + if (def->defnamespace != NULL || + pg_strcasecmp(def->defname, "oids") != 0) regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Horiguchi-san, Thanks for taking a look. On 2017/03/29 16:49, Kyotaro HORIGUCHI wrote: > At Wed, 29 Mar 2017 15:40:20 +0900, Amit Langote wrote: >> On 2017/03/27 23:27, Robert Haas wrote: >>>> And here is the updated patch. >>> >>> I think you should go back to the earlier strategy of disallowing heap >>> reloptions to be set on the partitioned table. The thing is, we don't >>> really know which way we're going to want to go in the future. Maybe >>> we'll come up with a system where you can set options on the >>> partitioned table, and those options will cascade to the children. Or >>> maybe we'll come up with a system where partitioned tables have a >>> completely different set of options to control behaviors specific to >>> partitioned tables. If we do the latter, then we don't want to also >>> have to support useless heap reloptions for forward compatibility, nor >>> do we want to break backward-compatibility to remove support. If we >>> do the former, then it's better if we allow it in the same release >>> where it starts working. >> >> You're right, modified the patch accordingly. >> >> By the way, the previous version of the patch didn't really "disallow" >> specifying heap reloptions though. What I'd imagine that should entail is >> CREATE TABLE or ALTER TABLE raising error if one of those reloptions is >> specified, which didn't really occur with the patch. The options were >> silently accepted and stored into pg_class, but their values were never >> used. I modified the patch so that an error occurs instead of silently >> accepting the user input. >> >> create table p (a int) partition by list (a) with (fillfactor = 10); >> ERROR: unrecognized parameter "fillfactor" for a partitioned table > > The following attracted my eyes. > > + if (def->defnamespace == NULL && > + pg_strcasecmp(def->defname, "oids") != 0) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("unrecognized parameter \"%s\" for a partitioned table", > > This works since defnamespace is always NULL here, but if I > understand correctly what we should do here is "reject any option > other than "(default).OID"". So I think that the condition should > be like the following. You're right. The following *wrongly* succeeds: create table p (a int) partition by list (a) with (toast.autovacuum_enabled = true); CREATE TABLE > + if (def->defnamespace != NULL || > + pg_strcasecmp(def->defname, "oids") != 0) Looks correct, so incorporated in the attached updated patch. Thanks. Regards, Amit
At Wed, 29 Mar 2017 17:21:26 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <b1234f04-53e0-011d-9f02-2437b909cce4@lab.ntt.co.jp> > > Thanks for taking a look. This patch is small enough to look at in a short time:p > > The following attracted my eyes. > > > > + if (def->defnamespace == NULL && > > + pg_strcasecmp(def->defname, "oids") != 0) > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("unrecognized parameter \"%s\" for a partitioned table", > > > > This works since defnamespace is always NULL here, but if I > > understand correctly what we should do here is "reject any option > > other than "(default).OID"". So I think that the condition should > > be like the following. > > You're right. The following *wrongly* succeeds: Ouch! I briefly checked that by "hoge.oids" without confirming around. > create table p (a int) partition by list (a) with > (toast.autovacuum_enabled = true); > CREATE TABLE > > > + if (def->defnamespace != NULL || > > + pg_strcasecmp(def->defname, "oids") != 0) > > Looks correct, so incorporated in the attached updated patch. Thanks. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Looks correct, so incorporated in the attached updated patch. Thanks. This seems like a hacky way to limit the reloptions to just OIDs. Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something like that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/03/29 23:58, Robert Haas wrote: > On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Looks correct, so incorporated in the attached updated patch. Thanks. > > This seems like a hacky way to limit the reloptions to just OIDs. > Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something > like that? OK, I tried that in the updated patch. DefineRelation() and ATExecSetRelOptions() still calls heap_reloptions(), but passes the new relopt_kind. None of the options listed in reloptions.c are of this kind as of now, so parseRelOptions() simply outputs the "unrecognized parameter %s" in the case of partitioned tables (except in some cases described below, but not because of the limitations of this patch). If and when partitioned tables start supporting the existing parameters, we'll update the relopt_gen.kinds bitmask of those parameters to allow them to be specified for partitioned tables. With the patch: create table parted (a int) partition by list (a) with (fillfactor = 10); ERROR: unrecognized parameter "fillfactor" create table parted (a int) partition by list (a) with (toast.fillfactor = 10); ERROR: unrecognized parameter "fillfactor" and: create table parted (a int) partition by list (a) with (oids = true); alter table parted set (fillfactor = 90); ERROR: unrecognized parameter "fillfactor" but: -- appears to succeed, whereas an error should have been reported (I think) alter table parted set (toast.fillfactor = 10); ALTER TABLE -- even with views create table foo (a int); create view foov with (toast.fillfactor = 10) as select * from foo; CREATE VIEW alter view foov set (toast.fillfactor = 20); ALTER VIEW Nothing is stored in pg_class.reloptions really, but the validation that should have occurred in parseRelOptions() doesn't. This happens because of the way transformRelOptions() works. It will ignore the DefElem's that don't apply to the specified relation based on the value of the namspace parameter ("toast" or NULL). That means it won't be included in the array of options later received by parseRelOptions(), which is where the validation occurs. Also, alter table reset (xxx) never validates anything: alter table foo reset (foo); ALTER TABLE alter table foo reset (foo.bar); ALTER TABLE Again, no pg_class.reloptions update occurs in this case. The reason this happens is because transformRelOptions() never includes the options to be reset in the array of options received by parseRelOptions(), so no validation occurs. But since both are existing behaviors, perhaps we can worry about it some other time. Thanks, Amit
Hello, At Thu, 30 Mar 2017 18:24:16 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <ee5f1cd4-4783-42e8-0263-648ae9c11264@lab.ntt.co.jp> > On 2017/03/29 23:58, Robert Haas wrote: > > On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote > > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> Looks correct, so incorporated in the attached updated patch. Thanks. > > > > This seems like a hacky way to limit the reloptions to just OIDs. > > Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something > > like that? > > OK, I tried that in the updated patch. The name RELOPT_KIND_PARTITIONED looks somewhat odd. RELKIND for partitioned tables is RELKIND_PARTITIONED_TABLE, so is this better to be _TABLE, even if a bit longer? parseRelOptions seems to return garbage pointer if no option of the kind is available. > DefineRelation() and ATExecSetRelOptions() still calls heap_reloptions(), > but passes the new relopt_kind. None of the options listed in > reloptions.c are of this kind as of now, so parseRelOptions() simply > outputs the "unrecognized parameter %s" in the case of partitioned tables > (except in some cases described below, but not because of the limitations > of this patch). If and when partitioned tables start supporting the > existing parameters, we'll update the relopt_gen.kinds bitmask of those > parameters to allow them to be specified for partitioned tables. > > With the patch: > > create table parted (a int) partition by list (a) with (fillfactor = 10); > ERROR: unrecognized parameter "fillfactor" > > create table parted (a int) partition by list (a) with (toast.fillfactor = > 10); > ERROR: unrecognized parameter "fillfactor" > > and: > > create table parted (a int) partition by list (a) with (oids = true); > alter table parted set (fillfactor = 90); > ERROR: unrecognized parameter "fillfactor" > > but: > > -- appears to succeed, whereas an error should have been reported (I think) > alter table parted set (toast.fillfactor = 10); > ALTER TABLE > > -- even with views > create table foo (a int); > create view foov with (toast.fillfactor = 10) as select * from foo; > CREATE VIEW > alter view foov set (toast.fillfactor = 20); > ALTER VIEW > > Nothing is stored in pg_class.reloptions really, but the validation that > should have occurred in parseRelOptions() doesn't. This happens because > of the way transformRelOptions() works. It will ignore the DefElem's that > don't apply to the specified relation based on the value of the namspace > parameter ("toast" or NULL). That means it won't be included in the array > of options later received by parseRelOptions(), which is where the > validation occurs. > > Also, alter table reset (xxx) never validates anything: > > alter table foo reset (foo); > ALTER TABLE > alter table foo reset (foo.bar); > ALTER TABLE > > Again, no pg_class.reloptions update occurs in this case. The reason this > happens is because transformRelOptions() never includes the options to be > reset in the array of options received by parseRelOptions(), so no > validation occurs. > > But since both are existing behaviors, perhaps we can worry about it some > other time. -- Kyotaro Horiguchi NTT Open Source Software Center
Thanks for the review. On Thu, Mar 30, 2017 at 7:37 PM, Kyotaro HORIGUCHI wrote: > At Thu, 30 Mar 2017 18:24:16 +0900, Amit Langote wrote: >> On 2017/03/29 23:58, Robert Haas wrote: >> > On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote >> > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> Looks correct, so incorporated in the attached updated patch. Thanks. >> > >> > This seems like a hacky way to limit the reloptions to just OIDs. >> > Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something >> > like that? >> >> OK, I tried that in the updated patch. > > The name RELOPT_KIND_PARTITIONED looks somewhat odd. RELKIND for > partitioned tables is RELKIND_PARTITIONED_TABLE, so is this > better to be _TABLE, even if a bit longer? Hm, OK. Done. > parseRelOptions seems to return garbage pointer if no option of > the kind is available. Oops, fixed that too. Updated patch attached. Thanks, Amit
Attachment
At Thu, 30 Mar 2017 20:58:35 +0900, Amit Langote <amitlangote09@gmail.com> wrote in <CA+HiwqFmrJGe4eid2EaqnCjjQxrAYR5hJfzc5Cz929WA9p=A+w@mail.gmail.com> > Thanks for the review. > > On Thu, Mar 30, 2017 at 7:37 PM, Kyotaro HORIGUCHI wrote: > > At Thu, 30 Mar 2017 18:24:16 +0900, Amit Langote wrote: > >> On 2017/03/29 23:58, Robert Haas wrote: > >> > On Wed, Mar 29, 2017 at 4:21 AM, Amit Langote > >> > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> >> Looks correct, so incorporated in the attached updated patch. Thanks. > >> > > >> > This seems like a hacky way to limit the reloptions to just OIDs. > >> > Shouldn't we instead have a new RELOPT_KIND_PARTITIONED or something > >> > like that? > >> > >> OK, I tried that in the updated patch. > > > > The name RELOPT_KIND_PARTITIONED looks somewhat odd. RELKIND for > > partitioned tables is RELKIND_PARTITIONED_TABLE, so is this > > better to be _TABLE, even if a bit longer? > > Hm, OK. Done. > > > parseRelOptions seems to return garbage pointer if no option of > > the kind is available. > > Oops, fixed that too. > > Updated patch attached. Thank you. - Applies cleanly on master (f90d23d) - Compiled without error - Code seems fine. - Documentaion seems fine.. for me. - Passes regression test. - Leaving the ALTER-on-toast.* problem is fine for me. The regression contains the tests to fail with several reloptions only for partitioned tables. Are they still required even though it is now in the same framework with other kind of reloptions? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 2017/04/03 11:39, Amit Langote wrote: > On 2017/04/01 5:29, Robert Haas wrote: >> Thanks for reviewing, and thanks to Maksim as well, and thanks to Amit >> for writing the patch. > > Thanks for committing. :) I noticed that I had missed a couple of places that would try to scan partitioned tables, resulting in file access. 1. In validateCheckConstraint(), along with foreign tables, must ignore partitioned tables. 2. DefineQueryRewrite() may try to scan a partitioned table in the case of converting a table to view, where we must make sure that the table being converted is empty. It's checked by scanning the heap, which we should not do for a partitioned table. Nor should we try to drop the storage once ready to make the table into a REKIND_VIEW relation (because all other checks passed okaying the conversion). Tests are added for both the cases. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Looking at the number of issues where we have to fix tests based on the relkind checks, I think, we have to consider creating macros as described in my reply to thread with subject " Allowing extended stats on foreign and partitioned tables". On Tue, Apr 11, 2017 at 2:46 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2017/04/03 11:39, Amit Langote wrote: >> On 2017/04/01 5:29, Robert Haas wrote: >>> Thanks for reviewing, and thanks to Maksim as well, and thanks to Amit >>> for writing the patch. >> >> Thanks for committing. :) > > I noticed that I had missed a couple of places that would try to scan > partitioned tables, resulting in file access. > > 1. In validateCheckConstraint(), along with foreign tables, must ignore > partitioned tables. > > 2. DefineQueryRewrite() may try to scan a partitioned table in the case of > converting a table to view, where we must make sure that the table being > converted is empty. It's checked by scanning the heap, which we should > not do for a partitioned table. Nor should we try to drop the storage > once ready to make the table into a REKIND_VIEW relation (because all > other checks passed okaying the conversion). > > Tests are added for both the cases. > > Thanks, > Amit > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > 2. DefineQueryRewrite() may try to scan a partitioned table in the case of > converting a table to view, where we must make sure that the table being > converted is empty. It's checked by scanning the heap, which we should > not do for a partitioned table. Nor should we try to drop the storage > once ready to make the table into a REKIND_VIEW relation (because all > other checks passed okaying the conversion). It looks like this patch intends to allow converting a partitioned table to a view. I would lobby for refusing the command, instead. There is no good reason to allow it, and it might well be a user error. regards, tom lane
On Tue, Apr 11, 2017 at 9:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> 2. DefineQueryRewrite() may try to scan a partitioned table in the case of >> converting a table to view, where we must make sure that the table being >> converted is empty. It's checked by scanning the heap, which we should >> not do for a partitioned table. Nor should we try to drop the storage >> once ready to make the table into a REKIND_VIEW relation (because all >> other checks passed okaying the conversion). > > It looks like this patch intends to allow converting a partitioned table > to a view. I would lobby for refusing the command, instead. There is > no good reason to allow it, and it might well be a user error. Yeah, I agree. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/04/12 2:41, Robert Haas wrote: > On Tue, Apr 11, 2017 at 9:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >>> 2. DefineQueryRewrite() may try to scan a partitioned table in the case of >>> converting a table to view, where we must make sure that the table being >>> converted is empty. It's checked by scanning the heap, which we should >>> not do for a partitioned table. Nor should we try to drop the storage >>> once ready to make the table into a REKIND_VIEW relation (because all >>> other checks passed okaying the conversion). >> >> It looks like this patch intends to allow converting a partitioned table >> to a view. I would lobby for refusing the command, instead. There is >> no good reason to allow it, and it might well be a user error. > > Yeah, I agree. Alright. So I made it into two patches instead: 0001 fixes the bug that validateCheckConstraint() tries to scan partitioned tables and 0002 makes trying to convert a partitioned table to a view a user error. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Tue, Apr 11, 2017 at 10:15 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Alright. So I made it into two patches instead: 0001 fixes the bug that > validateCheckConstraint() tries to scan partitioned tables and 0002 makes > trying to convert a partitioned table to a view a user error. Committed together, after updating the regression test outputs to make the tests pass. You forgot to update the expected output file in 0002. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017/04/13 0:36, Robert Haas wrote: > On Tue, Apr 11, 2017 at 10:15 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Alright. So I made it into two patches instead: 0001 fixes the bug that >> validateCheckConstraint() tries to scan partitioned tables and 0002 makes >> trying to convert a partitioned table to a view a user error. > > Committed together, after updating the regression test outputs to make > the tests pass. You forgot to update the expected output file in > 0002. Oops, sorry about that. Thanks for fixing and committing. Regards, Amit