Thread: Using multiple extended statistics for estimates
Hi, PostgreSQL 10 introduced extended statistics, allowing us to consider correlation between columns to improve estimates, and PostgreSQL 12 added support for MCV statistics. But we still had the limitation that we only allowed using a single extended statistics per relation, i.e. given a table with two extended stats CREATE TABLE t (a int, b int, c int, d int); CREATE STATISTICS s1 (mcv) ON a, b FROM t; CREATE STATISTICS s2 (mcv) ON c, d FROM t; and a query SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; we only ever used one of the statistics (and we considered them in a not particularly well determined order). This patch addresses this by using as many extended stats as possible, by adding a loop to statext_mcv_clauselist_selectivity(). In each step we pick the "best" applicable statistics (in the sense of covering the most attributes) and factor it into the oveall estimate. All this happens where we'd originally consider applying a single MCV list, i.e. before even considering the functional dependencies, so roughly like this: while () { ... apply another MCV list ... } ... apply functional dependencies ... I've both in the loop, but I think that'd be wrong - the MCV list is expected to contain more information about individual values (compared to functional deps, which are column-level). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: >Hi, > >PostgreSQL 10 introduced extended statistics, allowing us to consider >correlation between columns to improve estimates, and PostgreSQL 12 >added support for MCV statistics. But we still had the limitation that >we only allowed using a single extended statistics per relation, i.e. >given a table with two extended stats > > CREATE TABLE t (a int, b int, c int, d int); > CREATE STATISTICS s1 (mcv) ON a, b FROM t; > CREATE STATISTICS s2 (mcv) ON c, d FROM t; > >and a query > > SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; > >we only ever used one of the statistics (and we considered them in a not >particularly well determined order). > >This patch addresses this by using as many extended stats as possible, >by adding a loop to statext_mcv_clauselist_selectivity(). In each step >we pick the "best" applicable statistics (in the sense of covering the >most attributes) and factor it into the oveall estimate. > >All this happens where we'd originally consider applying a single MCV >list, i.e. before even considering the functional dependencies, so >roughly like this: > > while () > { > ... apply another MCV list ... > } > > ... apply functional dependencies ... > > >I've both in the loop, but I think that'd be wrong - the MCV list is >expected to contain more information about individual values (compared >to functional deps, which are column-level). > Here is a slightly polished v2 of the patch, the main difference being that computing clause_attnums was moved to a separate function. This is a fairly simple patch, and it's not entirely new functionality (applying multiple statistics was part of the very first patch seris, although of course in a very different form). So unless there are objections, I'd like to get this committed sometime next week. There's room for improvement, of course, for example when handling overlapping statistics. Consider a table with columns (a,b,c) and two extended statistics on (a,b) and (b,c), and query with one clause per column SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 In this case the patch does not help, because we apply (a,b) and then we have just a single clause remaining. What we could do is still apply the (b,c) statistic, using the already-estimated clause on b as a condition. So essentially we'd compute P(a=1 && b=1) * P(c=1 | b=1) But that'll require larger changes, and I see it as an evolution of the current patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote: >On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: >>Hi, >> >>PostgreSQL 10 introduced extended statistics, allowing us to consider >>correlation between columns to improve estimates, and PostgreSQL 12 >>added support for MCV statistics. But we still had the limitation that >>we only allowed using a single extended statistics per relation, i.e. >>given a table with two extended stats >> >> CREATE TABLE t (a int, b int, c int, d int); >> CREATE STATISTICS s1 (mcv) ON a, b FROM t; >> CREATE STATISTICS s2 (mcv) ON c, d FROM t; >> >>and a query >> >> SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; >> >>we only ever used one of the statistics (and we considered them in a not >>particularly well determined order). >> >>This patch addresses this by using as many extended stats as possible, >>by adding a loop to statext_mcv_clauselist_selectivity(). In each step >>we pick the "best" applicable statistics (in the sense of covering the >>most attributes) and factor it into the oveall estimate. >> >>All this happens where we'd originally consider applying a single MCV >>list, i.e. before even considering the functional dependencies, so >>roughly like this: >> >> while () >> { >> ... apply another MCV list ... >> } >> >> ... apply functional dependencies ... >> >> >>I've both in the loop, but I think that'd be wrong - the MCV list is >>expected to contain more information about individual values (compared >>to functional deps, which are column-level). >> > >Here is a slightly polished v2 of the patch, the main difference being >that computing clause_attnums was moved to a separate function. > This time with the attachment ;-) -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hello. At Wed, 6 Nov 2019 20:58:49 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in > >Here is a slightly polished v2 of the patch, the main difference being > >that computing clause_attnums was moved to a separate function. > > > > This time with the attachment ;-) This patch is a kind of straight-forward, which repeats what the previous statext_mcv_clauselist_selectivity did as long as remaining clauses matches any of MV-MCVs. Almost no regression in the cases where zero or just one MV-MCV applies to the given clause list. It applies cleanly on the current master and seems working as expected. I have some comments. Could we have description in the documentation on what multiple MV-MCVs are used in a query? And don't we need some regression tests? +/* + * statext_mcv_clause_attnums + * Recalculate attnums from compatible but not-yet-estimated clauses. It returns attnums collected from multiple clause*s*. Is the name OK with "clause_attnums"? The comment says as if it checks the compatibility of each clause but the work is done in the caller side. I'm not sure such strictness is required, but it might be better that the comment represents what exactly the function does. + */ +static Bitmapset * +statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses, + Bitmapset **list_attnums) The last two parameters are in the same type in notation but in different actual types.. that is, one is a pointer to Bitmapset*, and another is an array of Bitmaptset*. The code in the function itself suggests that, but it would be helpful if a brief explanation of the parameters is seen in the function comment. + /* + * Recompute attnums in the remaining clauses (we simply use the bitmaps + * computed earlier, so that we don't have to inspect the clauses again). + */ + clauses_attnums = statext_mcv_clause_attnums(list_length(clauses), Couldn't we avoid calling this function twice with the same parameters at the first round in the loop? + foreach(l, clauses) { - stat_clauses = lappend(stat_clauses, (Node *) lfirst(l)); - *estimatedclauses = bms_add_member(*estimatedclauses, listidx); + /* + * If the clause is compatible with the selected statistics, mark it + * as estimated and add it to the list to estimate. + */ + if (list_attnums[listidx] != NULL && + bms_is_subset(list_attnums[listidx], stat->keys)) + { + stat_clauses = lappend(stat_clauses, (Node *) lfirst(l)); + *estimatedclauses = bms_add_member(*estimatedclauses, listidx); + } The loop runs through all clauses every time. I agree that that is better than using a copy of the clauses to avoid to step on already estimated clauses, but maybe we need an Assertion that the listidx is not a part of estimatedclauses to make sure no clauses are not estimated twice. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Nov 07, 2019 at 01:38:20PM +0900, Kyotaro Horiguchi wrote: >Hello. > >At Wed, 6 Nov 2019 20:58:49 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in >> >Here is a slightly polished v2 of the patch, the main difference being >> >that computing clause_attnums was moved to a separate function. >> > >> >> This time with the attachment ;-) > >This patch is a kind of straight-forward, which repeats what the >previous statext_mcv_clauselist_selectivity did as long as remaining >clauses matches any of MV-MCVs. Almost no regression in the cases >where zero or just one MV-MCV applies to the given clause list. > >It applies cleanly on the current master and seems working as >expected. > > >I have some comments. > >Could we have description in the documentation on what multiple >MV-MCVs are used in a query? And don't we need some regression tests? > Yes, regression tests are certainly needed - I though I've added them, but it seems I failed to include them in the patch. Will fix. I agree it's probably worth mentioning we can consider multiple stats, but I'm a bit hesitant to put the exact rules how we pick the "best" statistic to the docs. It's not 100% deterministic and it's likely we'll need to tweak it a bit in the future. I'd prefer showing the stats in EXPLAIN, but that's a separate patch. > >+/* >+ * statext_mcv_clause_attnums >+ * Recalculate attnums from compatible but not-yet-estimated clauses. > >It returns attnums collected from multiple clause*s*. Is the name OK >with "clause_attnums"? > >The comment says as if it checks the compatibility of each clause but >the work is done in the caller side. I'm not sure such strictness is >required, but it might be better that the comment represents what >exactly the function does. > But the incompatible clauses have the pre-computed attnums set to NULL, so technically the comment is correct. But I'll clarify. > >+ */ >+static Bitmapset * >+statext_mcv_clause_attnums(int nclauses, Bitmapset **estimatedclauses, >+ Bitmapset **list_attnums) > >The last two parameters are in the same type in notation but in >different actual types.. that is, one is a pointer to Bitmapset*, and >another is an array of Bitmaptset*. The code in the function itself >suggests that, but it would be helpful if a brief explanation of the >parameters is seen in the function comment. > OK, will explain in a comment. >+ /* >+ * Recompute attnums in the remaining clauses (we simply use the bitmaps >+ * computed earlier, so that we don't have to inspect the clauses again). >+ */ >+ clauses_attnums = statext_mcv_clause_attnums(list_length(clauses), > >Couldn't we avoid calling this function twice with the same parameters >at the first round in the loop? > Hmmm, yeah. That's a good point. >+ foreach(l, clauses) > { >- stat_clauses = lappend(stat_clauses, (Node *) lfirst(l)); >- *estimatedclauses = bms_add_member(*estimatedclauses, listidx); >+ /* >+ * If the clause is compatible with the selected statistics, mark it >+ * as estimated and add it to the list to estimate. >+ */ >+ if (list_attnums[listidx] != NULL && >+ bms_is_subset(list_attnums[listidx], stat->keys)) >+ { >+ stat_clauses = lappend(stat_clauses, (Node *) lfirst(l)); >+ *estimatedclauses = bms_add_member(*estimatedclauses, listidx); >+ } > >The loop runs through all clauses every time. I agree that that is >better than using a copy of the clauses to avoid to step on already >estimated clauses, but maybe we need an Assertion that the listidx is >not a part of estimatedclauses to make sure no clauses are not >estimated twice. > Well, we can't really operate on a smaller "copy" of the list anyway, because that would break the precalculation logic (the listidx value would be incorrect for the new list), and tweaking it would be more expensive than just iterating over all clauses. The assumption is that we won't see extremely large number of clauses here. Adding an assert seems reasonable. And maybe a comment why we should not see any already-estimated clauses here. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/6/19 11:58 AM, Tomas Vondra wrote: > On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote: >> On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: >>> Hi, >>> >>> PostgreSQL 10 introduced extended statistics, allowing us to consider >>> correlation between columns to improve estimates, and PostgreSQL 12 >>> added support for MCV statistics. But we still had the limitation that >>> we only allowed using a single extended statistics per relation, i.e. >>> given a table with two extended stats >>> >>> CREATE TABLE t (a int, b int, c int, d int); >>> CREATE STATISTICS s1 (mcv) ON a, b FROM t; >>> CREATE STATISTICS s2 (mcv) ON c, d FROM t; >>> >>> and a query >>> >>> SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; >>> >>> we only ever used one of the statistics (and we considered them in a not >>> particularly well determined order). >>> >>> This patch addresses this by using as many extended stats as possible, >>> by adding a loop to statext_mcv_clauselist_selectivity(). In each step >>> we pick the "best" applicable statistics (in the sense of covering the >>> most attributes) and factor it into the oveall estimate. Tomas, Your patch compiles and passes the regression tests for me on debian linux under master. Since your patch does not include modified regression tests, I wrote a test that I expected to improve under this new code, but running it both before and after applying your patch, there is no change. Please find the modified test attached. Am I wrong to expect some change in this test's output? If so, can you provide a test example that works differently under your patch? Thanks! -- Mark Dilger
Attachment
On 11/9/19 12:33 PM, Mark Dilger wrote: > > > On 11/6/19 11:58 AM, Tomas Vondra wrote: >> On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote: >>> On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: >>>> Hi, >>>> >>>> PostgreSQL 10 introduced extended statistics, allowing us to consider >>>> correlation between columns to improve estimates, and PostgreSQL 12 >>>> added support for MCV statistics. But we still had the limitation that >>>> we only allowed using a single extended statistics per relation, i.e. >>>> given a table with two extended stats >>>> >>>> CREATE TABLE t (a int, b int, c int, d int); >>>> CREATE STATISTICS s1 (mcv) ON a, b FROM t; >>>> CREATE STATISTICS s2 (mcv) ON c, d FROM t; >>>> >>>> and a query >>>> >>>> SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; >>>> >>>> we only ever used one of the statistics (and we considered them in a >>>> not >>>> particularly well determined order). >>>> >>>> This patch addresses this by using as many extended stats as possible, >>>> by adding a loop to statext_mcv_clauselist_selectivity(). In each step >>>> we pick the "best" applicable statistics (in the sense of covering the >>>> most attributes) and factor it into the oveall estimate. > > Tomas, > > Your patch compiles and passes the regression tests for me on debian > linux under master. > > Since your patch does not include modified regression tests, I wrote a > test that I expected to improve under this new code, but running it both > before and after applying your patch, there is no change. Ok, the attached test passes before applying your patch and fails afterward owing to the estimates improving and no longer matching the expected output. To be clear, this confirms your patch working as expected. I haven't seen any crashes in several hours of running different tests, so I think it looks good. -- Mark Dilger
Attachment
On Sat, Nov 09, 2019 at 12:33:05PM -0800, Mark Dilger wrote: > > >On 11/6/19 11:58 AM, Tomas Vondra wrote: >>On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote: >>>On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: >>>>Hi, >>>> >>>>PostgreSQL 10 introduced extended statistics, allowing us to consider >>>>correlation between columns to improve estimates, and PostgreSQL 12 >>>>added support for MCV statistics. But we still had the limitation that >>>>we only allowed using a single extended statistics per relation, i.e. >>>>given a table with two extended stats >>>> >>>> CREATE TABLE t (a int, b int, c int, d int); >>>> CREATE STATISTICS s1 (mcv) ON a, b FROM t; >>>> CREATE STATISTICS s2 (mcv) ON c, d FROM t; >>>> >>>>and a query >>>> >>>> SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; >>>> >>>>we only ever used one of the statistics (and we considered them in a not >>>>particularly well determined order). >>>> >>>>This patch addresses this by using as many extended stats as possible, >>>>by adding a loop to statext_mcv_clauselist_selectivity(). In each step >>>>we pick the "best" applicable statistics (in the sense of covering the >>>>most attributes) and factor it into the oveall estimate. > >Tomas, > >Your patch compiles and passes the regression tests for me on debian >linux under master. > Thanks. >Since your patch does not include modified regression tests, I wrote a >test that I expected to improve under this new code, but running it >both before and after applying your patch, there is no change. Please >find the modified test attached. Am I wrong to expect some change in >this test's output? If so, can you provide a test example that works >differently under your patch? > Those queries are not improved by the patch, because we only support clauses "Var op Const" for now - your tests are using "Var op Var" so that doesn't work. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Nov 09, 2019 at 02:32:27PM -0800, Mark Dilger wrote: > > >On 11/9/19 12:33 PM, Mark Dilger wrote: >> >> >>On 11/6/19 11:58 AM, Tomas Vondra wrote: >>>On Wed, Nov 06, 2019 at 08:54:40PM +0100, Tomas Vondra wrote: >>>>On Mon, Oct 28, 2019 at 04:20:48PM +0100, Tomas Vondra wrote: >>>>>Hi, >>>>> >>>>>PostgreSQL 10 introduced extended statistics, allowing us to consider >>>>>correlation between columns to improve estimates, and PostgreSQL 12 >>>>>added support for MCV statistics. But we still had the limitation that >>>>>we only allowed using a single extended statistics per relation, i.e. >>>>>given a table with two extended stats >>>>> >>>>> CREATE TABLE t (a int, b int, c int, d int); >>>>> CREATE STATISTICS s1 (mcv) ON a, b FROM t; >>>>> CREATE STATISTICS s2 (mcv) ON c, d FROM t; >>>>> >>>>>and a query >>>>> >>>>> SELECT * FROM t WHERE a = 1 AND b = 1 AND c = 1 AND d = 1; >>>>> >>>>>we only ever used one of the statistics (and we considered >>>>>them in a not >>>>>particularly well determined order). >>>>> >>>>>This patch addresses this by using as many extended stats as possible, >>>>>by adding a loop to statext_mcv_clauselist_selectivity(). In each step >>>>>we pick the "best" applicable statistics (in the sense of covering the >>>>>most attributes) and factor it into the oveall estimate. >> >>Tomas, >> >>Your patch compiles and passes the regression tests for me on debian >>linux under master. >> >>Since your patch does not include modified regression tests, I wrote >>a test that I expected to improve under this new code, but running >>it both before and after applying your patch, there is no change. > >Ok, the attached test passes before applying your patch and fails >afterward owing to the estimates improving and no longer matching the >expected output. To be clear, this confirms your patch working as >expected. > >I haven't seen any crashes in several hours of running different >tests, so I think it looks good. > Yep, thanks for adding the tests. I'll include them into the patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, here's an updated patch, with some minor tweaks based on the review and added tests (I ended up reworking those a bit, to make them more like the existing ones). There's also a new piece, dealing with functional dependencies. Until now we did the same thing as for MCV lists - we picketd the "best" extended statistics (with functional dependencies built) and just used that. At first I thought we might simply do the same loop as for MCV lists, but that does not really make sense because we might end up applying "weaker" dependency first. Say for example we have table with columns (a,b,c,d,e) and functional dependencies on (a,b,c,d) and (c,d,e) where all the dependencies on (a,b,c,d) are weaker than (c,d => e). In a query with clauses on all attributes this is guaranteed to apply all dependencies from the first statistic first, which si clearly wrong. So what this does instead is simply merging all the dependencies from all the relevant stats, and treating them as a single collection. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 11/13/19 7:28 AM, Tomas Vondra wrote: > Hi, > > here's an updated patch, with some minor tweaks based on the review and > added tests (I ended up reworking those a bit, to make them more like > the existing ones). Thanks, Tomas, for the new patch set! Attached are my review comments so far, in the form of a patch applied on top of yours. -- Mark Dilger
Attachment
On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote: > > >On 11/13/19 7:28 AM, Tomas Vondra wrote: >>Hi, >> >>here's an updated patch, with some minor tweaks based on the review and >>added tests (I ended up reworking those a bit, to make them more like >>the existing ones). > >Thanks, Tomas, for the new patch set! > >Attached are my review comments so far, in the form of a patch applied >on top of yours. > Thanks. 1) It's not clear to me why adding 'const' to the List parameters would be useful? Can you explain? 2) I think you're right we can change find_strongest_dependency to do /* also skip weaker dependencies when attribute count matches */ if (strongest->nattributes == dependency->nattributes && strongest->degree >= dependency->degree) continue; That'll skip some additional dependencies, which seems OK. 3) It's not clear to me what you mean by * TODO: Improve this code comment. Specifically, why would we * ignore that no rows will match? It seems that such a discovery * would allow us to return an estimate of 0 rows, and that would * be useful. added to dependencies_clauselist_selectivity. Are you saying we should also compute selectivity estimates for individual clauses and use Min() as a limit? Maybe, but that seems unrelated to the patch. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/14/19 7:55 AM, Tomas Vondra wrote: > On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote: >> >> >> On 11/13/19 7:28 AM, Tomas Vondra wrote: >>> Hi, >>> >>> here's an updated patch, with some minor tweaks based on the review and >>> added tests (I ended up reworking those a bit, to make them more like >>> the existing ones). >> >> Thanks, Tomas, for the new patch set! >> >> Attached are my review comments so far, in the form of a patch applied >> on top of yours. >> > > Thanks. > > 1) It's not clear to me why adding 'const' to the List parameters would > be useful? Can you explain? When I first started reviewing the functions, I didn't know if those lists were intended to be modified by the function. Adding 'const' helps document that the function does not intend to change them. > 2) I think you're right we can change find_strongest_dependency to do > > /* also skip weaker dependencies when attribute count matches */ > if (strongest->nattributes == dependency->nattributes && > strongest->degree >= dependency->degree) > continue; > > That'll skip some additional dependencies, which seems OK. > > 3) It's not clear to me what you mean by > > * TODO: Improve this code comment. Specifically, why would we > * ignore that no rows will match? It seems that such a discovery > * would allow us to return an estimate of 0 rows, and that would > * be useful. > > added to dependencies_clauselist_selectivity. Are you saying we > should also compute selectivity estimates for individual clauses and > use Min() as a limit? Maybe, but that seems unrelated to the patch. I mean that the comment right above that TODO is hard to understand. You seem to be saying that it is good and proper to only take the selectivity estimate from the final clause in the list, but then go on to say that other clauses might prove that no rows will match. So that implies that by ignoring all but the last clause, we're ignoring such other clauses that prove no rows can match. But why would we be ignoring those? I am not arguing that your code is wrong. I'm just critiquing the hard-to-understand phrasing of that code comment. -- Mark Dilger
On Thu, Nov 14, 2019 at 10:23:44AM -0800, Mark Dilger wrote: > > >On 11/14/19 7:55 AM, Tomas Vondra wrote: >>On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote: >>> >>> >>>On 11/13/19 7:28 AM, Tomas Vondra wrote: >>>>Hi, >>>> >>>>here's an updated patch, with some minor tweaks based on the review and >>>>added tests (I ended up reworking those a bit, to make them more like >>>>the existing ones). >>> >>>Thanks, Tomas, for the new patch set! >>> >>>Attached are my review comments so far, in the form of a patch >>>applied on top of yours. >>> >> >>Thanks. >> >>1) It's not clear to me why adding 'const' to the List parameters would >> be useful? Can you explain? > >When I first started reviewing the functions, I didn't know if those >lists were intended to be modified by the function. Adding 'const' >helps document that the function does not intend to change them. > Hmmm, ok. I'll think about it, but we're not really using const* in this way very much I think - at least not in the surrounding code. >>2) I think you're right we can change find_strongest_dependency to do >> >> /* also skip weaker dependencies when attribute count matches */ >> if (strongest->nattributes == dependency->nattributes && >> strongest->degree >= dependency->degree) >> continue; >> >> That'll skip some additional dependencies, which seems OK. >> >>3) It's not clear to me what you mean by >> >> * TODO: Improve this code comment. Specifically, why would we >> * ignore that no rows will match? It seems that such a discovery >> * would allow us to return an estimate of 0 rows, and that would >> * be useful. >> >> added to dependencies_clauselist_selectivity. Are you saying we >> should also compute selectivity estimates for individual clauses and >> use Min() as a limit? Maybe, but that seems unrelated to the patch. > >I mean that the comment right above that TODO is hard to understand. >You seem to be saying that it is good and proper to only take the >selectivity estimate from the final clause in the list, but then go on >to say that other clauses might prove that no rows will match. So >that implies that by ignoring all but the last clause, we're ignoring >such other clauses that prove no rows can match. But why would we be >ignoring those? > >I am not arguing that your code is wrong. I'm just critiquing the >hard-to-understand phrasing of that code comment. > Aha, I think I understand now - thanks for the explanation. You're right the comment is trying to explain why just taking the last clause for a given attnum is fine. I'll try to make the comment clearer. For the case with equal Const values that should be mostly obvious, i.e. "a=1 AND a=1 AND a=1" has the same selectivity as "a=1". The case with different Const values is harder, unfortunately. It might seem obvious that "a=1 AND a=2" means there are no matching rows, but that heavily relies on the semantics of the equality operator. And we can't simply compare the Const values either, I'm afraid, because there are cases with cross-type operators like a = 1::int AND a = 1.0::numeric where the Consts are of different type, yet both conditions can be true. So it would be pretty tricky to do this, and the current code does not even try to do that. Instead, it just assumes that it's mostly fine to overestimate, because then at runtime we'll simply end up with 0 rows here. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > For the case with equal Const values that should be mostly obvious, i.e. > "a=1 AND a=1 AND a=1" has the same selectivity as "a=1". > The case with different Const values is harder, unfortunately. It might > seem obvious that "a=1 AND a=2" means there are no matching rows, but > that heavily relies on the semantics of the equality operator. And we > can't simply compare the Const values either, I'm afraid, because there > are cases with cross-type operators like > a = 1::int AND a = 1.0::numeric > where the Consts are of different type, yet both conditions can be true. FWIW, there's code in predtest.c to handle exactly that, at least for types sharing a btree opfamily. Whether it's worth applying that logic here is unclear, but note that we've had the ability to recognize redundant and contradictory clauses for a long time: regression=# explain select * from tenk1 where two = 1; QUERY PLAN ------------------------------------------------------------ Seq Scan on tenk1 (cost=0.00..470.00 rows=5000 width=244) Filter: (two = 1) (2 rows) regression=# explain select * from tenk1 where two = 1 and two = 1::bigint; QUERY PLAN ------------------------------------------------------------ Seq Scan on tenk1 (cost=0.00..470.00 rows=5000 width=244) Filter: (two = 1) (2 rows) regression=# explain select * from tenk1 where two = 1 and two = 2::bigint; QUERY PLAN --------------------------------------------------------------- Result (cost=0.00..470.00 rows=1 width=244) One-Time Filter: false -> Seq Scan on tenk1 (cost=0.00..470.00 rows=1 width=244) Filter: (two = 1) (4 rows) It falls down on regression=# explain select * from tenk1 where two = 1 and two = 2::numeric; QUERY PLAN ----------------------------------------------------------- Seq Scan on tenk1 (cost=0.00..520.00 rows=25 width=244) Filter: ((two = 1) AND ((two)::numeric = '2'::numeric)) (2 rows) because numeric isn't in the same opfamily, so these clauses can't be compared easily. regards, tom lane
On 11/14/19 12:04 PM, Tomas Vondra wrote: > Aha, I think I understand now - thanks for the explanation. You're right > the comment is trying to explain why just taking the last clause for a > given attnum is fine. I'll try to make the comment clearer. > > For the case with equal Const values that should be mostly obvious, i.e. > "a=1 AND a=1 AND a=1" has the same selectivity as "a=1". > > The case with different Const values is harder, unfortunately. It might > seem obvious that "a=1 AND a=2" means there are no matching rows, but > that heavily relies on the semantics of the equality operator. And we > can't simply compare the Const values either, I'm afraid, because there > are cases with cross-type operators like > > a = 1::int AND a = 1.0::numeric > > where the Consts are of different type, yet both conditions can be true. > > So it would be pretty tricky to do this, and the current code does not > even try to do that. > > Instead, it just assumes that it's mostly fine to overestimate, because > then at runtime we'll simply end up with 0 rows here. I'm unsure whether that could be a performance problem at runtime. I could imagine the planner short-circuiting additional planning when it finds a plan with zero rows, and so we'd save planner time if we avoid overestimating. I don't recall if the planner does anything like that, or if there are plans to implement such logic, but it might be good not to rule it out. Tom's suggestion elsewhere in this thread to use code in predtest.c sounds good to me. I don't know if you want to expand the scope of this particular patch to include that, though. -- Mark Dilger
On Thu, Nov 14, 2019 at 03:16:04PM -0500, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> For the case with equal Const values that should be mostly obvious, i.e. >> "a=1 AND a=1 AND a=1" has the same selectivity as "a=1". > >> The case with different Const values is harder, unfortunately. It might >> seem obvious that "a=1 AND a=2" means there are no matching rows, but >> that heavily relies on the semantics of the equality operator. And we >> can't simply compare the Const values either, I'm afraid, because there >> are cases with cross-type operators like >> a = 1::int AND a = 1.0::numeric >> where the Consts are of different type, yet both conditions can be true. > >FWIW, there's code in predtest.c to handle exactly that, at least for >types sharing a btree opfamily. Whether it's worth applying that logic >here is unclear, but note that we've had the ability to recognize >redundant and contradictory clauses for a long time: > >regression=# explain select * from tenk1 where two = 1; > QUERY PLAN >------------------------------------------------------------ > Seq Scan on tenk1 (cost=0.00..470.00 rows=5000 width=244) > Filter: (two = 1) >(2 rows) > >regression=# explain select * from tenk1 where two = 1 and two = 1::bigint; > QUERY PLAN >------------------------------------------------------------ > Seq Scan on tenk1 (cost=0.00..470.00 rows=5000 width=244) > Filter: (two = 1) >(2 rows) > >regression=# explain select * from tenk1 where two = 1 and two = 2::bigint; > QUERY PLAN >--------------------------------------------------------------- > Result (cost=0.00..470.00 rows=1 width=244) > One-Time Filter: false > -> Seq Scan on tenk1 (cost=0.00..470.00 rows=1 width=244) > Filter: (two = 1) >(4 rows) > >It falls down on > >regression=# explain select * from tenk1 where two = 1 and two = 2::numeric; > QUERY PLAN >----------------------------------------------------------- > Seq Scan on tenk1 (cost=0.00..520.00 rows=25 width=244) > Filter: ((two = 1) AND ((two)::numeric = '2'::numeric)) >(2 rows) > >because numeric isn't in the same opfamily, so these clauses can't be >compared easily. > > regards, tom lane Yeah, and this logic still works - the redundant clauses won't even get to the selectivity estimation, I think. So maybe the comment is not quite necessary, because the problem does not even exist ... Maybe we could do something about the cases that predtest.c can't solve, but it's not clear if we can be much smarter for types with different opfamilies. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 14, 2019 at 01:17:02PM -0800, Mark Dilger wrote: > > >On 11/14/19 12:04 PM, Tomas Vondra wrote: >>Aha, I think I understand now - thanks for the explanation. You're right >>the comment is trying to explain why just taking the last clause for a >>given attnum is fine. I'll try to make the comment clearer. >> >>For the case with equal Const values that should be mostly obvious, i.e. >>"a=1 AND a=1 AND a=1" has the same selectivity as "a=1". >> >>The case with different Const values is harder, unfortunately. It might >>seem obvious that "a=1 AND a=2" means there are no matching rows, but >>that heavily relies on the semantics of the equality operator. And we >>can't simply compare the Const values either, I'm afraid, because there >>are cases with cross-type operators like >> >> a = 1::int AND a = 1.0::numeric >> >>where the Consts are of different type, yet both conditions can be true. >> >>So it would be pretty tricky to do this, and the current code does not >>even try to do that. >> >>Instead, it just assumes that it's mostly fine to overestimate, because >>then at runtime we'll simply end up with 0 rows here. > >I'm unsure whether that could be a performance problem at runtime. > >I could imagine the planner short-circuiting additional planning when >it finds a plan with zero rows, and so we'd save planner time if we >avoid overestimating. I don't recall if the planner does anything like >that, or if there are plans to implement such logic, but it might be >good not to rule it out. Tom's suggestion elsewhere in this thread to >use code in predtest.c sounds good to me. > No, AFAIK the planner does not do anything like that - it might probaly do that if it could prove there are no such rows, but that's hardly the case for estimates based on approximate information (i.e. statistics). If could do that based on the predicate analysis in predtest.c mentioned by Tom, although I don't think it does anything beyond tweaking the row estimate to ~1 row. >I don't know if you want to expand the scope of this particular patch to >include that, though. > Certainly not. It's an interesting but surprisingly complicated problem, and this patch simply aims to add different improvement. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/14/19 12:04 PM, Tomas Vondra wrote: > On Thu, Nov 14, 2019 at 10:23:44AM -0800, Mark Dilger wrote: >> >> >> On 11/14/19 7:55 AM, Tomas Vondra wrote: >>> On Wed, Nov 13, 2019 at 10:04:36AM -0800, Mark Dilger wrote: >>>> >>>> >>>> On 11/13/19 7:28 AM, Tomas Vondra wrote: >>>>> Hi, >>>>> >>>>> here's an updated patch, with some minor tweaks based on the review >>>>> and >>>>> added tests (I ended up reworking those a bit, to make them more like >>>>> the existing ones). >>>> >>>> Thanks, Tomas, for the new patch set! >>>> >>>> Attached are my review comments so far, in the form of a patch >>>> applied on top of yours. >>>> >>> >>> Thanks. >>> >>> 1) It's not clear to me why adding 'const' to the List parameters would >>> be useful? Can you explain? >> >> When I first started reviewing the functions, I didn't know if those >> lists were intended to be modified by the function. Adding 'const' >> helps document that the function does not intend to change them. >> > > Hmmm, ok. I'll think about it, but we're not really using const* in this > way very much I think - at least not in the surrounding code. > >>> 2) I think you're right we can change find_strongest_dependency to do >>> >>> /* also skip weaker dependencies when attribute count matches */ >>> if (strongest->nattributes == dependency->nattributes && >>> strongest->degree >= dependency->degree) >>> continue; >>> >>> That'll skip some additional dependencies, which seems OK. >>> >>> 3) It's not clear to me what you mean by >>> >>> * TODO: Improve this code comment. Specifically, why would we >>> * ignore that no rows will match? It seems that such a discovery >>> * would allow us to return an estimate of 0 rows, and that would >>> * be useful. >>> >>> added to dependencies_clauselist_selectivity. Are you saying we >>> should also compute selectivity estimates for individual clauses and >>> use Min() as a limit? Maybe, but that seems unrelated to the patch. >> >> I mean that the comment right above that TODO is hard to understand. >> You seem to be saying that it is good and proper to only take the >> selectivity estimate from the final clause in the list, but then go on >> to say that other clauses might prove that no rows will match. So >> that implies that by ignoring all but the last clause, we're ignoring >> such other clauses that prove no rows can match. But why would we be >> ignoring those? >> >> I am not arguing that your code is wrong. I'm just critiquing the >> hard-to-understand phrasing of that code comment. >> > > Aha, I think I understand now - thanks for the explanation. You're right > the comment is trying to explain why just taking the last clause for a > given attnum is fine. I'll try to make the comment clearer. Are you planning to submit a revised patch for this? -- Mark Dilger
On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote: > >Are you planning to submit a revised patch for this? > Yes, I'll submit a rebased version of this patch shortly. I got broken because of the recent fix in choose_best_statistics, shouldn't take long to update the patch. I do have a couple more related patches in the queue, so I want to submit them all at once. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote: >On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote: >> >>Are you planning to submit a revised patch for this? >> > >Yes, I'll submit a rebased version of this patch shortly. I got broken >because of the recent fix in choose_best_statistics, shouldn't take long >to update the patch. I do have a couple more related patches in the >queue, so I want to submit them all at once. > OK, here we go - these two patched allow applying multiple extended statistics, both for MCV and functional dependencies. Functional dependencies are simply merged and then applied at once (so withouth choose_best_statistics), statistics are considered in greedy manner by calling choose_best_statistics in a loop. I do have some additional enhancements in the queue, but those are not fully baked yet, so I'll post them later in separate patches. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote: >On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote: >>On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote: >>> >>>Are you planning to submit a revised patch for this? >>> >> >>Yes, I'll submit a rebased version of this patch shortly. I got broken >>because of the recent fix in choose_best_statistics, shouldn't take long >>to update the patch. I do have a couple more related patches in the >>queue, so I want to submit them all at once. >> > >OK, here we go - these two patched allow applying multiple extended >statistics, both for MCV and functional dependencies. Functional >dependencies are simply merged and then applied at once (so withouth >choose_best_statistics), statistics are considered in greedy manner by >calling choose_best_statistics in a loop. > OK, this time with the patches actually attached ;-) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 12/5/19 9:51 AM, Tomas Vondra wrote: > On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote: >> On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote: >>> On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote: >>>> >>>> Are you planning to submit a revised patch for this? >>>> >>> >>> Yes, I'll submit a rebased version of this patch shortly. I got broken >>> because of the recent fix in choose_best_statistics, shouldn't take long >>> to update the patch. I do have a couple more related patches in the >>> queue, so I want to submit them all at once. >>> >> >> OK, here we go - these two patched allow applying multiple extended >> statistics, both for MCV and functional dependencies. Functional >> dependencies are simply merged and then applied at once (so withouth >> choose_best_statistics), statistics are considered in greedy manner by >> calling choose_best_statistics in a loop. >> > > OK, this time with the patches actually attached ;-) These look good to me. I added extra tests (not included in this email) to verify the code on more interesting test cases, such as partitioned tables and within joins. Your test cases are pretty trivial, just being selects from a single table. I'll go mark this "ready for committer". -- Mark Dilger
On Mon, Dec 09, 2019 at 11:56:39AM -0800, Mark Dilger wrote: > > >On 12/5/19 9:51 AM, Tomas Vondra wrote: >>On Thu, Dec 05, 2019 at 06:15:54PM +0100, Tomas Vondra wrote: >>>On Sun, Dec 01, 2019 at 08:08:58PM +0100, Tomas Vondra wrote: >>>>On Sat, Nov 30, 2019 at 03:01:31PM -0800, Mark Dilger wrote: >>>>> >>>>>Are you planning to submit a revised patch for this? >>>>> >>>> >>>>Yes, I'll submit a rebased version of this patch shortly. I got broken >>>>because of the recent fix in choose_best_statistics, shouldn't take long >>>>to update the patch. I do have a couple more related patches in the >>>>queue, so I want to submit them all at once. >>>> >>> >>>OK, here we go - these two patched allow applying multiple extended >>>statistics, both for MCV and functional dependencies. Functional >>>dependencies are simply merged and then applied at once (so withouth >>>choose_best_statistics), statistics are considered in greedy manner by >>>calling choose_best_statistics in a loop. >>> >> >>OK, this time with the patches actually attached ;-) > >These look good to me. I added extra tests (not included in this email) >to verify the code on more interesting test cases, such as partitioned >tables and within joins. Your test cases are pretty trivial, just being >selects from a single table. > Adding such more complex tests seem like a good idea, maybe you'd like to share them? >I'll go mark this "ready for committer". > Thanks for the review. I'll hold-off with the commit until the next CF, though, just to give others a proper opportunity to look at it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/9/19 2:00 PM, Tomas Vondra wrote: >> >> These look good to me. I added extra tests (not included in this email) >> to verify the code on more interesting test cases, such as partitioned >> tables and within joins. Your test cases are pretty trivial, just being >> selects from a single table. >> > > Adding such more complex tests seem like a good idea, maybe you'd like > to share them? You can find them attached. I did not include them in my earlier email because they seem a bit unrefined, taking too many lines of code for the amount of coverage they provide. But you can prune them down and add them to the patch if you like. These only test the functional dependencies. If you want to include something like them in your commit, you might create similar tests for the mcv statistics, too. -- Mark Dilger
Attachment
Hi, I've pushed these two improvements after some minor improvements, mostly to comments. I ended up not using the extra tests, as it wasn't clear to me it's worth the extra duration. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services