Thread: plan time of MASSIVE partitioning ...
hello everybody, we came across an issue which turned out to be more serious than previously expected. imagine a system with, say, 1000 partitions (heavily indexed) or so. the time taken by the planner is already fairly heavyin this case. i tried this one with 5000 unindexed tables (just one col): test=# \timing Timing is on. test=# prepare x(int4) AS select * from t_data order by id desc; PREPARE Time: 361.552 ms you will see similar or higher runtimes in case of 500 partitions and a handful of indexes. does anybody see a potential way to do a shortcut through the planner? a prepared query is no solution here as constraint exclusion would be dead in this case (making the entire thing an evenbigger drama). did anybody think of a solution to this problem. or more precisely: can there be a solution to this problem? many thanks, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de
* PostgreSQL - Hans-Jürgen Schönig (postgres@cybertec.at) wrote: > did anybody think of a solution to this problem. > or more precisely: can there be a solution to this problem? Please post to the correct list (-performance) and provide information like PG version, postgresql.conf, the actual table definition, the resulting query plan, etc, etc... Thanks, Stephen
On Sep 3, 2010, at 2:04 PM, Stephen Frost wrote: > * PostgreSQL - Hans-Jürgen Schönig (postgres@cybertec.at) wrote: >> did anybody think of a solution to this problem. >> or more precisely: can there be a solution to this problem? > > Please post to the correct list (-performance) and provide information > like PG version, postgresql.conf, the actual table definition, the > resulting query plan, etc, etc... > > Thanks, > > Stephen hello stephen, this seems like more a developer question to me than a pre performance one. it is not related to the table structure at all - it is basically an issue with incredibly large inheritance lists. it applies to postgres 9 and most likely to everything before. postgresql.conf is not relevant at all at this point. the plan is pretty fine. the question is rather: does anybody see a chance to handle such lists more efficiently inside postgres? also, it is not the point if my data structure is sane or not. it is really more generic - namely a shortcut for this caseinside the planing process. many thanks, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de
* PostgreSQL - Hans-Jürgen Schönig (postgres@cybertec.at) wrote: > this seems like more a developer question to me than a pre performance one. > it is not related to the table structure at all - it is basically an issue with incredibly large inheritance lists. > it applies to postgres 9 and most likely to everything before. > postgresql.conf is not relevant at all at this point. Really? What's constraint_exclusion set to? Is GEQO getting used? What are the GEQO parameters set to? Do you have any CHECK constraints on the tables? If you want someone else to build a test case and start looking into it, it's best to not make them have to guess at what you've done. > the plan is pretty fine. > the question is rather: does anybody see a chance to handle such lists more efficiently inside postgres? > also, it is not the point if my data structure is sane or not. it is really more generic - namely a shortcut for this caseinside the planing process. Coming up with cases where PG doesn't perform well in a completely contrived unrealistic environment isn't likely to impress anyone to do anything. A small (but complete..) test case which mimics a real world environment and real world problem would go alot farther. I can certainly believe that people out there have partitions set up with lots of tables and that it will likely grow- but they probably also have CHECK constraints, have tweaked what constraint_exclusion is set to, have adjusted their work_mem and related settings, maybe tweaked some planner GUCs, etc, etc. This is an area I'm actually interested in and curious about, I'd rather work together on it than be told that the questions I'm asking aren't relevant. Thanks, Stephen
2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>: > i tried this one with 5000 unindexed tables (just one col): > > test=# \timing > Timing is on. > test=# prepare x(int4) AS select * from t_data order by id desc; > PREPARE > Time: 361.552 ms > > you will see similar or higher runtimes in case of 500 partitions and a handful of indexes. I'd like to see (1) a script to reproduce your test environment (as Stephen also requested) and (2) gprof or oprofile results. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at> writes: > imagine a system with, say, 1000 partitions (heavily indexed) or so. the time taken by the planner is already fairly heavyin this case. As the fine manual points out, the current scheme for managing partitioned tables isn't intended to scale past a few dozen partitions. I think we'll be able to do better when we have an explicit representation of partitioning, since then the planner won't have to expend large amounts of effort reverse-engineering knowledge about how an inheritance tree is partitioned. Before that happens, it's not really worth the trouble to worry about such cases. regards, tom lane
On Sep 3, 2010, at 4:40 PM, Tom Lane wrote: > PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at> writes: >> imagine a system with, say, 1000 partitions (heavily indexed) or so. the time taken by the planner is already fairly heavyin this case. > > As the fine manual points out, the current scheme for managing > partitioned tables isn't intended to scale past a few dozen partitions. > > I think we'll be able to do better when we have an explicit > representation of partitioning, since then the planner won't > have to expend large amounts of effort reverse-engineering knowledge > about how an inheritance tree is partitioned. Before that happens, > it's not really worth the trouble to worry about such cases. > > regards, tom lane > thank you ... - the manual is clear here but we wanted to see if there is some reasonably low hanging fruit to get aroundthis. it is no solution but at least a clear statement ... many thanks, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de
On Tue, Sep 7, 2010 at 2:14 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: > Hi, > > Robert Haas írta: >> 2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>: >> >>> i tried this one with 5000 unindexed tables (just one col): >>> >>> test=# \timing >>> Timing is on. >>> test=# prepare x(int4) AS select * from t_data order by id desc; >>> PREPARE >>> Time: 361.552 ms >>> >>> you will see similar or higher runtimes in case of 500 partitions and a handful of indexes. >>> >> >> I'd like to see (1) a script to reproduce your test environment (as >> Stephen also requested) and (2) gprof or oprofile results. >> > > attached are the test scripts, create_tables.sql and childtables.sql. > The following query takes 4.7 seconds according to psql with \timing on: > EXPLAIN SELECT * FROM qdrs > WHERE streamstart BETWEEN '2010-04-06' AND '2010-06-25' > ORDER BY streamhash; Neat. Have you checked what effect this has on memory consumption? Also, don't forget to add it to https://commitfest.postgresql.org/action/commitfest_view/open -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
hello ... no, we have not checked memory consumption. there is still some stuff left to optimize away - it seems we are going close to O(n^2) somewhere. "equal" is called really often in our sample case as well: ach sample counts as 0.01 seconds.% cumulative self self total time seconds seconds calls s/call s/call name 18.87 0.80 0.80 4812 0.00 0.00 make_canonical_pathkey 15.33 1.45 0.65 4811 0.00 0.00 get_eclass_for_sort_expr 14.15 2.05 0.60 8342410 0.00 0.00 equal6.13 2.31 0.26 229172 0.00 0.00 SearchCatCache3.66 2.47 0.16 5788835 0.00 0.00 _equalList3.07 2.60 0.13 1450043 0.00 0.00 hash_search_with_hash_value2.36 2.70 0.10 2272545 0.00 0.00 AllocSetAlloc2.12 2.79 0.09 811460 0.00 0.00 hash_any1.89 2.87 0.08 3014983 0.00 0.00 list_head1.89 2.95 0.08 574526 0.00 0.00 _bt_compare1.77 3.02 0.08 11577670 0.00 0.00 list_head1.42 3.08 0.06 1136 0.00 0.00 tzload0.94 3.12 0.04 2992373 0.00 0.00 AllocSetFreeIndex look at the number of calls ... "equal" is scary ... make_canonical_pathkey is fixed it seems. get_eclass_for_sort_expr seems a little more complex to fix. great you like it ... regards, hans On Sep 8, 2010, at 3:54 PM, Robert Haas wrote: > On Tue, Sep 7, 2010 at 2:14 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote: >> Hi, >> >> Robert Haas írta: >>> 2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>: >>> >>>> i tried this one with 5000 unindexed tables (just one col): >>>> >>>> test=# \timing >>>> Timing is on. >>>> test=# prepare x(int4) AS select * from t_data order by id desc; >>>> PREPARE >>>> Time: 361.552 ms >>>> >>>> you will see similar or higher runtimes in case of 500 partitions and a handful of indexes. >>>> >>> >>> I'd like to see (1) a script to reproduce your test environment (as >>> Stephen also requested) and (2) gprof or oprofile results. >>> >> >> attached are the test scripts, create_tables.sql and childtables.sql. >> The following query takes 4.7 seconds according to psql with \timing on: >> EXPLAIN SELECT * FROM qdrs >> WHERE streamstart BETWEEN '2010-04-06' AND '2010-06-25' >> ORDER BY streamhash; > > Neat. Have you checked what effect this has on memory consumption? > > Also, don't forget to add it to > https://commitfest.postgresql.org/action/commitfest_view/open > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise Postgres Company > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
* Hans-Jürgen Schönig (postgres@cybertec.at) wrote: > no, we have not checked memory consumption. > there is still some stuff left to optimize away - it seems we are going close to O(n^2) somewhere. > "equal" is called really often in our sample case as well: Did the mail with the scripts, etc, get hung up due to size or something..? I didn't see it on the mailing list nor in the archives.. If so, could you post them somewhere so others could look..? Thanks, Stephen
here is the patch again. we accidentally attached a wrong test file to the original posting so it grew to big. we had to revoke it from the moderator(this happens if you code from 8am to 10pm). here is just the patch - it is nice and small. you can easily test it by making yourself a nice parent table, many subtables (hundreds or thousands) and a decent numberof indexes per partition. then run PREPARE with \timing to see what happens. you should get scary planning times. the more potential indexes and tables the more scary it will be. using this wonderful RB tree the time for this function call goes down to basically zero. i hope this is something which is useful to some folks out there. many thanks, hans On Sep 8, 2010, at 4:18 PM, Stephen Frost wrote: > * Hans-Jürgen Schönig (postgres@cybertec.at) wrote: >> no, we have not checked memory consumption. >> there is still some stuff left to optimize away - it seems we are going close to O(n^2) somewhere. >> "equal" is called really often in our sample case as well: > > Did the mail with the scripts, etc, get hung up due to size or > something..? I didn't see it on the mailing list nor in the archives.. > If so, could you post them somewhere so others could look..? > > Thanks, > > Stephen -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Attachment
* Robert Haas (robertmhaas@gmail.com) wrote: > Neat. Have you checked what effect this has on memory consumption? > > Also, don't forget to add it to > https://commitfest.postgresql.org/action/commitfest_view/open Would be good to have the patch updated to be against HEAD before posting to the commitfest. Thanks, Stephen
On Sep 8, 2010, at 4:57 PM, Stephen Frost wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: >> Neat. Have you checked what effect this has on memory consumption? >> >> Also, don't forget to add it to >> https://commitfest.postgresql.org/action/commitfest_view/open > > Would be good to have the patch updated to be against HEAD before > posting to the commitfest. > > Thanks, > > Stephen we will definitely provide something which is for HEAD. but, it seems the problem we are looking is not sufficiently fixed yet. in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that morecan be done to reduce this. i guess we have to attack this as well. regards, hans -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
* Hans-Jürgen Schönig (postgres@cybertec.at) wrote: > but, it seems the problem we are looking is not sufficiently fixed yet. > in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling thatmore can be done to reduce this. i guess we have to attack this as well. An 18% increase is certainly nice, provided it doesn't slow down or break other things.. I'm looking through the patch now actually and I'm not really happy with the naming, comments, or some of the code flow, but I think the concept looks reasonable. Thanks, Stephen
2010/9/8 Hans-Jürgen Schönig <postgres@cybertec.at>: > On Sep 8, 2010, at 4:57 PM, Stephen Frost wrote: > >> * Robert Haas (robertmhaas@gmail.com) wrote: >>> Neat. Have you checked what effect this has on memory consumption? >>> >>> Also, don't forget to add it to >>> https://commitfest.postgresql.org/action/commitfest_view/open >> >> Would be good to have the patch updated to be against HEAD before >> posting to the commitfest. > > > we will definitely provide something which is for HEAD. > but, it seems the problem we are looking is not sufficiently fixed yet. > in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling thatmore can be done to reduce this. i guess we have to attack this as well. Just remember that four small patches (say) are apt to get committed faster than one big one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
* Robert Haas (robertmhaas@gmail.com) wrote: > 2010/9/8 Hans-Jürgen Schönig <postgres@cybertec.at>: > > but, it seems the problem we are looking is not sufficiently fixed yet. > > in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling thatmore can be done to reduce this. i guess we have to attack this as well. > > Just remember that four small patches (say) are apt to get committed > faster than one big one. Indeed, but code like this makes me wonder if this is really working the way it's supposed to: + val1 = (long)pk_left->pk_eclass; + val2 = (long)pk_right->pk_eclass; + + if (val1 < val2) + return -1; + else if (val1 > val2) + return 1; Have you compared how big the tree gets to the size of the list it's supposed to be replacing..? Thanks, Stephen
Excerpts from Stephen Frost's message of mié sep 08 11:26:55 -0400 2010: > * Hans-Jürgen Schönig (postgres@cybertec.at) wrote: > > but, it seems the problem we are looking is not sufficiently fixed yet. > > in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling thatmore can be done to reduce this. i guess we have to attack this as well. > > An 18% increase is certainly nice, provided it doesn't slow down or > break other things.. I'm looking through the patch now actually and > I'm not really happy with the naming, comments, or some of the code > flow, but I think the concept looks reasonable. I don't understand the layering between pg_tree and rbtree. Why does it exist at all? At first I thought this was another implementation of rbtrees, but then I noticed it sits on top of it. Is this really necessary? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Hans-Jürgen Schönig <postgres@cybertec.at> writes: > here is the patch again. This patch seems remarkably documentation-free. What is it trying to accomplish and what is it doing to the planner data structures? (Which do have documentation BTW.) Also, what will it do to runtime in normal cases where the pathkeys list isn't that long? regards, tom lane
Stephen Frost írta: > * Robert Haas (robertmhaas@gmail.com) wrote: > >> 2010/9/8 Hans-Jürgen Schönig <postgres@cybertec.at>: >> >>> but, it seems the problem we are looking is not sufficiently fixed yet. >>> in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling thatmore can be done to reduce this. i guess we have to attack this as well. >>> >> Just remember that four small patches (say) are apt to get committed >> faster than one big one. >> > > Indeed, but code like this makes me wonder if this is really working the > way it's supposed to: > > + val1 = (long)pk_left->pk_eclass; > + val2 = (long)pk_right->pk_eclass; > + > + if (val1 < val2) > + return -1; > + else if (val1 > val2) > + return 1; > The original code checked for pointers being equal among other conditions. This was an (almost) straight conversion to a comparison function for rbtree. Do you mean casting the pointer to long? Yes, e.g. on 64-bit Windows it wouldn't work. Back to plain pointer comparison. > Have you compared how big the tree gets to the size of the list it's > supposed to be replacing..? > No, but I think it's obvious. Now we have one TreeCell where we had one ListCell. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Alvaro Herrera írta: > Excerpts from Stephen Frost's message of mié sep 08 11:26:55 -0400 2010: > >> * Hans-Jürgen Schönig (postgres@cybertec.at) wrote: >> >>> but, it seems the problem we are looking is not sufficiently fixed yet. >>> in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling thatmore can be done to reduce this. i guess we have to attack this as well. >>> >> An 18% increase is certainly nice, provided it doesn't slow down or >> break other things.. I'm looking through the patch now actually and >> I'm not really happy with the naming, comments, or some of the code >> flow, but I think the concept looks reasonable. >> > > I don't understand the layering between pg_tree and rbtree. Why does it > exist at all? At first I thought this was another implementation of > rbtrees, but then I noticed it sits on top of it. Is this really > necessary? > No, if it's acceptable to omit PlannerInfo from outfuncs.c. Or at least its canon_pathkeys member. Otherwise yes, it's necessary. We need to store (Node *) in a fast searchable way. This applies to anything else that may need to be converted from list to tree to decrease planning time. Like ec_members in EquivalenceClass. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Stephen Frost <sfrost@snowman.net> writes: > Indeed, but code like this makes me wonder if this is really working the > way it's supposed to: > + val1 = (long)pk_left->pk_eclass; > + val2 = (long)pk_right->pk_eclass; Ugh. Casting a pointer to long? We do have portable ways to do what this is trying to do, but that is not one. (For example, this is guaranteed to misbehave on 64-bit Windows.) Offhand I think PointerGetDatum might be the best way. regards, tom lane
Boszormenyi Zoltan <zb@cybertec.at> writes: > This applies to anything else that may need to be converted > from list to tree to decrease planning time. Like ec_members > in EquivalenceClass. AFAIR, canonical pathkeys are the *only* thing in the planner where pure pointer equality is interesting. So I doubt this hack is of any use for EquivalenceClass, even if the lists were likely to be long which they aren't. regards, tom lane
Stephen Frost <sfrost@snowman.net> writes: > I'm not really happy with the naming, comments, or some of the code > flow, but I think the concept looks reasonable. There seems to be a lot of unnecessary palloc/pfree traffic in this implementation. Getting rid of that might help the speed. regards, tom lane
Tom Lane írta: > Boszormenyi Zoltan <zb@cybertec.at> writes: > >> This applies to anything else that may need to be converted >> from list to tree to decrease planning time. Like ec_members >> in EquivalenceClass. >> > > AFAIR, canonical pathkeys are the *only* thing in the planner where pure > pointer equality is interesting. So I doubt this hack is of any use for > EquivalenceClass, even if the lists were likely to be long which they > aren't. > > regards, tom lane > No, for EquivalanceClass->ec_member, I need to do something funnier, like implement compare(Node *, Node *) and use that instead of equal(Node *, Node *)... Something like nodeToString() on both Node * and strcmp() the resulting strings. -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Tom Lane írta: > Stephen Frost <sfrost@snowman.net> writes: > >> I'm not really happy with the naming, comments, or some of the code >> flow, but I think the concept looks reasonable. >> > > There seems to be a lot of unnecessary palloc/pfree traffic in this > implementation. Getting rid of that might help the speed. > > regards, tom lane > This was a first WIP implementation, I will look into it, thanks. -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Boszormenyi Zoltan <zb@cybertec.at> writes: > Tom Lane �rta: >> AFAIR, canonical pathkeys are the *only* thing in the planner where pure >> pointer equality is interesting. So I doubt this hack is of any use for >> EquivalenceClass, even if the lists were likely to be long which they >> aren't. > No, for EquivalanceClass->ec_member, I need to do something > funnier, like implement compare(Node *, Node *) and use that > instead of equal(Node *, Node *)... Something like nodeToString() > on both Node * and strcmp() the resulting strings. Well, (a) that doesn't work (hint: there are fields in nodes that are intentionally ignored by equal()), and (b) I still don't believe that there's an actual bottleneck there. ECs generally aren't very big. regards, tom lane
Tom Lane írta: > Boszormenyi Zoltan <zb@cybertec.at> writes: > >> Tom Lane írta: >> >>> AFAIR, canonical pathkeys are the *only* thing in the planner where pure >>> pointer equality is interesting. So I doubt this hack is of any use for >>> EquivalenceClass, even if the lists were likely to be long which they >>> aren't. >>> > > >> No, for EquivalanceClass->ec_member, I need to do something >> funnier, like implement compare(Node *, Node *) and use that >> instead of equal(Node *, Node *)... Something like nodeToString() >> on both Node * and strcmp() the resulting strings. >> > > Well, (a) that doesn't work (hint: there are fields in nodes that are > intentionally ignored by equal()), Then this compare() needs to work like equal(), which can ignore the fields that are ignored by equal(), too. nodeToString would need more space anyway and comparing non-equal Nodes can return the !0 result faster. > and (b) I still don't believe that > there's an actual bottleneck there. ECs generally aren't very big. > Actually, PlannerInfo->eq_classes needs to be a Tree somehow, the comparator function is not yet final in my head. equal() is called over 8 million times with or without our patch: without % cumulative self self total time seconds seconds calls s/call s/call name 18.87 0.80 0.80 4812 0.00 0.00 make_canonical_pathkey15.33 1.45 0.65 4811 0.00 0.00 get_eclass_for_sort_expr14.15 2.05 0.60 8342410 0.00 0.00 equal 6.13 2.31 0.26 229172 0.00 0.00 SearchCatCache 3.66 2.47 0.16 5788835 0.00 0.00 _equalList 3.07 2.60 0.13 1450043 0.00 0.00 hash_search_with_hash_value 2.36 2.70 0.10 2272545 0.00 0.00 AllocSetAlloc 2.12 2.79 0.09 811460 0.00 0.00 hash_any 1.89 2.87 0.08 3014983 0.00 0.00 list_head 1.89 2.95 0.08 574526 0.00 0.00 _bt_compare 1.77 3.02 0.08 11577670 0.00 0.00 list_head 1.42 3.08 0.06 1136 0.00 0.00 tzload 0.94 3.12 0.04 2992373 0.00 0.00 AllocSetFreeIndex 0.94 3.16 0.04 91427 0.00 0.00 _bt_checkkeys ... with % cumulative self self total time seconds seconds calls s/call s/call name 24.51 0.88 0.88 4811 0.00 0.00 get_eclass_for_sort_expr14.21 1.39 0.51 8342410 0.00 0.00 equal 8.22 1.69 0.30 5788835 0.00 0.00 _equalList 5.29 1.88 0.19 229172 0.00 0.00 SearchCatCache 2.51 1.97 0.09 1136 0.00 0.00 tzload 2.23 2.05 0.08 3014983 0.00 0.00 list_head 2.23 2.13 0.08 2283130 0.00 0.00 AllocSetAlloc 2.09 2.20 0.08 811547 0.00 0.00 hash_any 2.09 2.28 0.0811577670 0.00 0.00 list_head 1.95 2.35 0.07 1450180 0.00 0.00 hash_search_with_hash_value 1.39 2.40 0.05 640690 0.00 0.00 _bt_compare 1.39 2.45 0.05 157944 0.00 0.00 LockAcquireExtended 1.39 2.50 0.05 11164 0.00 0.00 AllocSetCheck 1.11 2.54 0.04 3010547 0.00 0.00 AllocSetFreeIndex 1.11 2.58 0.04 874975 0.00 0.00 AllocSetFree1.11 2.62 0.04 66211 0.00 0.00 heap_form_tuple 0.84 2.65 0.03 888128 0.00 0.00 LWLockRelease ... The number of calls are the same for equal and _equalList in both cases as you can see. And if you compare the number of AllocSetAlloc calls, it's actually smaller with our patch, so it's not that the conversion to Tree caused this. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Boszormenyi Zoltan <zb@cybertec.at> writes: > equal() is called over 8 million times with or without our patch: From where, though? You've provided not a shred of evidence that searching large ec_member lists is the problem. Also, did the test case you're using ever make it to the list? regards, tom lane
Tom Lane írta: > Boszormenyi Zoltan <zb@cybertec.at> writes: > >> equal() is called over 8 million times with or without our patch: >> > > From where, though? You've provided not a shred of evidence that > searching large ec_member lists is the problem. > Indeed. I have put elog(NOTICE) calls in there to see which lists is how long. It turned out that the length of ec_members is either 0 or 1, mostly 1, but the length of eq_classes is constantly growing. This is what I need to attack then. > Also, did the test case you're using ever make it to the list? > No, because it was too large and because of the test case accidentally contained confidential information, we asked Bruce to delete it from the moderator queue. Attached is a shortened test case that does the same and also shows the same problem. The create_table.sql creates the parent table, the table_generator.sql produces the list of constraint excluded child tables and their indexes. The gprof output of this is only slightly modified, the number of equal() calls is still over 8 million, it is also attached. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Attachment
Hi, attached is a WIP patch against 9.1 current GIT that converts eq_classes and canon_pathkeys in PlannerInfo. Also attached is the test case again the slow query is: explain select * from inh_parent where timestamp1 between '2010-04-06' and '2010-06-25' order by timestamp2; There is intentionally no data, the planning time is slow. The currect GIT version plans this query in 2.4 seconds, the patched version does it in 0.59 seconds according to gprof. The gprof outputs are also attached. There is one problem with the patch, it doesn't survive "make check". One of the regression tests fails the Assert(!cur_em->em_is_child); line in process_equivalence() in equivclass.c, but I couldn't yet find it what causes it. The "why" is vaguely clear: something modifies the ec_members list in the eq_classes' tree nodes while the node is in the tree. Because I didn't find the offender yet, I couldn't fix it, so I send this patch as is. I'll try to fix it if someone doesn't beat me in fixing it. :) The query produces the same EXPLAIN output for both the stock and the patched version, they were checked with diff. I didn't attach it to this mail because of the size constraints. Almost all files are compressed because of this. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Attachment
Hi, Boszormenyi Zoltan írta: > There is one problem with the patch, it doesn't survive > "make check". One of the regression tests fails the > Assert(!cur_em->em_is_child); > line in process_equivalence() in equivclass.c, but I couldn't > yet find it what causes it. The "why" is vaguely clear: > something modifies the ec_members list in the eq_classes' > tree nodes while the node is in the tree. Because I didn't find > the offender yet, I couldn't fix it, so I send this patch as is. > I'll try to fix it if someone doesn't beat me in fixing it. :) > I am a little closer to this bug, maybe I even found the cause of it. I found that process_equivalence() is called from: path/equivclass.c: reconsider_outer_join_clause() reconsider_full_join_clause() plan/initsplan.c: distribute_qual_to_rels() The problem is with the two functions in path/equivclass.c, as process_equivalance() and those functions are all walk the tree, and the current RBTree code can only deal with one walk at a time. We need to push/pop the iterator state to be able to serve more than one walkers. Also, we need to split out the tree modifying part from process_equivalence() somehow, as the tree walking also cannot deal with node additions and deletions. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Boszormenyi Zoltan <zb@cybertec.at> writes: > The problem is with the two functions in path/equivclass.c, > as process_equivalance() and those functions are all walk > the tree, and the current RBTree code can only deal with > one walk at a time. We need to push/pop the iterator state > to be able to serve more than one walkers. Good luck with that --- the iteration state is embedded in the rbtree. > Also, we need to split out the tree modifying part from > process_equivalence() somehow, as the tree walking > also cannot deal with node additions and deletions. That's not happening either. Maybe you need to think of some other data structure to use. Hashtable maybe? dynahash.c at least has reasonably well-defined limitations in this area. regards, tom lane
Hi, Tom Lane írta: > Boszormenyi Zoltan <zb@cybertec.at> writes: > >> The problem is with the two functions in path/equivclass.c, >> as process_equivalance() and those functions are all walk >> the tree, and the current RBTree code can only deal with >> one walk at a time. We need to push/pop the iterator state >> to be able to serve more than one walkers. >> > > Good luck with that --- the iteration state is embedded in the rbtree. > > >> Also, we need to split out the tree modifying part from >> process_equivalence() somehow, as the tree walking >> also cannot deal with node additions and deletions. >> > > That's not happening either. Maybe you need to think of some other data > structure to use. Hashtable maybe? dynahash.c at least has reasonably > well-defined limitations in this area. > > regards, tom lane > thank you very much for pointing me to dynahash, here is the next version that finally seems to work. Two patches are attached, the first is the absolute minimum for making it work, this still has the Tree type for canon_pathkeys and eq_classes got the same treatment as join_rel_list/join_rel_hash has in the current sources: if the list grows larger than 32, a hash table is created. It seems to be be enough for doing in for get_eclass_for_sort_expr() only, the other users of eq_classes aren't bothered by this change. The total speedup figure is in the 70+ percent range from these two changes, a little later GIT version than the previous tree I tested with before shows 1.74 vs. 0.41 second runtime for the example query. These are with asserts and profiling enabled of course. Without asserts and profiling enabled, the "time psql" figures are: $ time psql -p 54321 -c "explain select * from inh_parent where timestamp1 between '2010-04-06' and '2010-06-25' order by timestamp2" >/dev/null real 0m1.932s user 0m0.035s sys 0m0.002s vs. real 0m0.630s user 0m0.033s sys 0m0.002s The second patch contains extra infrastructure for the Tree type, it's currently unused, it was created for experimenting with eq_classes being a tree. It may be useful for someone, though. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Attachment
On 26.10.2010 18:34, Boszormenyi Zoltan wrote: > thank you very much for pointing me to dynahash, here is the > next version that finally seems to work. > > Two patches are attached, the first is the absolute minimum for > making it work, this still has the Tree type for canon_pathkeys > and eq_classes got the same treatment as join_rel_list/join_rel_hash > has in the current sources: if the list grows larger than 32, a hash table > is created. It seems to be be enough for doing in for > get_eclass_for_sort_expr() > only, the other users of eq_classes aren't bothered by this change. That's better, but can't you use dynahash for canon_pathkeys as well? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas írta: > On 26.10.2010 18:34, Boszormenyi Zoltan wrote: >> thank you very much for pointing me to dynahash, here is the >> next version that finally seems to work. >> >> Two patches are attached, the first is the absolute minimum for >> making it work, this still has the Tree type for canon_pathkeys >> and eq_classes got the same treatment as join_rel_list/join_rel_hash >> has in the current sources: if the list grows larger than 32, a hash >> table >> is created. It seems to be be enough for doing in for >> get_eclass_for_sort_expr() >> only, the other users of eq_classes aren't bothered by this change. > > That's better, but can't you use dynahash for canon_pathkeys as well? Here's a purely dynahash solution. It's somewhat slower than the tree version, 0.45 vs 0.41 seconds in the cached case for the previously posted test case. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Attachment
Boszormenyi Zoltan írta: > Heikki Linnakangas írta: > >> On 26.10.2010 18:34, Boszormenyi Zoltan wrote: >> >>> thank you very much for pointing me to dynahash, here is the >>> next version that finally seems to work. >>> >>> Two patches are attached, the first is the absolute minimum for >>> making it work, this still has the Tree type for canon_pathkeys >>> and eq_classes got the same treatment as join_rel_list/join_rel_hash >>> has in the current sources: if the list grows larger than 32, a hash >>> table >>> is created. It seems to be be enough for doing in for >>> get_eclass_for_sort_expr() >>> only, the other users of eq_classes aren't bothered by this change. >>> >> That's better, but can't you use dynahash for canon_pathkeys as well? >> > > Here's a purely dynahash solution. It's somewhat slower than > the tree version, 0.45 vs 0.41 seconds in the cached case for the > previously posted test case. > And now in context diff, sorry for my affection towards unified diffs. :-) > Best regards, > Zoltán Böszörményi > > -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Attachment
Boszormenyi Zoltan írta: > Boszormenyi Zoltan írta: > >> Heikki Linnakangas írta: >> >> >>> On 26.10.2010 18:34, Boszormenyi Zoltan wrote: >>> >>> >>>> thank you very much for pointing me to dynahash, here is the >>>> next version that finally seems to work. >>>> >>>> Two patches are attached, the first is the absolute minimum for >>>> making it work, this still has the Tree type for canon_pathkeys >>>> and eq_classes got the same treatment as join_rel_list/join_rel_hash >>>> has in the current sources: if the list grows larger than 32, a hash >>>> table >>>> is created. It seems to be be enough for doing in for >>>> get_eclass_for_sort_expr() >>>> only, the other users of eq_classes aren't bothered by this change. >>>> >>>> >>> That's better, but can't you use dynahash for canon_pathkeys as well? >>> >>> >> Here's a purely dynahash solution. It's somewhat slower than >> the tree version, 0.45 vs 0.41 seconds in the cached case for the >> previously posted test case. >> >> > > And now in context diff, sorry for my affection towards unified diffs. :-) > A little better version, no need for the heavy hash_any, hash_uint32 on the lower 32 bits on pk_eclass is enough. The profiling runtime is now 0.42 seconds vs the previous 0.41 seconds for the tree version. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Attachment
Boszormenyi Zoltan írta: > Boszormenyi Zoltan írta: > >> Boszormenyi Zoltan írta: >> >> >>> Heikki Linnakangas írta: >>> >>> >>> >>>> On 26.10.2010 18:34, Boszormenyi Zoltan wrote: >>>> >>>> >>>> >>>>> thank you very much for pointing me to dynahash, here is the >>>>> next version that finally seems to work. >>>>> >>>>> Two patches are attached, the first is the absolute minimum for >>>>> making it work, this still has the Tree type for canon_pathkeys >>>>> and eq_classes got the same treatment as join_rel_list/join_rel_hash >>>>> has in the current sources: if the list grows larger than 32, a hash >>>>> table >>>>> is created. It seems to be be enough for doing in for >>>>> get_eclass_for_sort_expr() >>>>> only, the other users of eq_classes aren't bothered by this change. >>>>> >>>>> >>>>> >>>> That's better, but can't you use dynahash for canon_pathkeys as well? >>>> >>>> >>>> >>> Here's a purely dynahash solution. It's somewhat slower than >>> the tree version, 0.45 vs 0.41 seconds in the cached case for the >>> previously posted test case. >>> >>> >>> >> And now in context diff, sorry for my affection towards unified diffs. :-) >> >> > > A little better version, no need for the heavy hash_any, hash_uint32 > on the lower 32 bits on pk_eclass is enough. The profiling runtime > is now 0.42 seconds vs the previous 0.41 seconds for the tree version. > > Best regards, > Zoltán Böszörményi > Btw, the top entries in the current gprof output are: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 19.05 0.08 0.08 482 0.17 0.29 add_child_rel_equivalences11.90 0.13 0.05 1133447 0.00 0.00 bms_is_subset 9.52 0.17 0.04 331162 0.00 0.00 hash_search_with_hash_value 7.14 0.20 0.03 548971 0.00 0.00 AllocSetAlloc 4.76 0.22 0.02 2858 0.01 0.01 get_tabstat_entry 4.76 0.24 0.02 1136 0.02 0.02 tzload This means add_child_rel_equivalences() is still takes too much time, the previously posted test case calls this function 482 times, it's called for almost every 10th entry added to eq_classes. The elog() I put into this function says that at the last call list_length(eq_classes) == 4754. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
On 28.10.2010 13:54, Boszormenyi Zoltan wrote: > A little better version, no need for the heavy hash_any, hash_uint32 > on the lower 32 bits on pk_eclass is enough. The profiling runtime > is now 0.42 seconds vs the previous 0.41 seconds for the tree version. Actually, I wonder if we could just have a separate canon_pathkeys list for each EquivalenceClass, instead of one big list in PlannerInfo. I'm not too familiar with equivalence classes and all that, but the attached patch at least passes the regressions. I haven't done any performance testing, but I would expect this to be even faster than the hashtable or tree implementations, and a lot simpler. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Attachment
Boszormenyi Zoltan <zb@cybertec.at> writes: > This means add_child_rel_equivalences() is still takes > too much time, the previously posted test case calls this > function 482 times, it's called for almost every 10th entry > added to eq_classes. The elog() I put into this function says > that at the last call list_length(eq_classes) == 4754. That seems like a ridiculously large number of ECs. What is the test query again? regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Actually, I wonder if we could just have a separate canon_pathkeys list > for each EquivalenceClass, instead of one big list in PlannerInfo. I'm > not too familiar with equivalence classes and all that, Hm. I don't like getting rid of the main canon_pathkeys list like that. The whole point of a canonical pathkey is that there is only one, so it seems like we need a central list. But it might be sane for each EC to have an additional, side list of PKs made from it. regards, tom lane
Tom Lane írta: > Boszormenyi Zoltan <zb@cybertec.at> writes: > >> This means add_child_rel_equivalences() is still takes >> too much time, the previously posted test case calls this >> function 482 times, it's called for almost every 10th entry >> added to eq_classes. The elog() I put into this function says >> that at the last call list_length(eq_classes) == 4754. >> > > That seems like a ridiculously large number of ECs. What is the > test query again? > > regards, tom lane > The test case is here: http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at create_table.sql for the main table plus childtables.sql.gz, the EXPLAIN query is in the message body. Basically, it's a model for a lot of data for three months, partitioned by 4 hour intervals for every day. Imagine the call list handled by a phone company in a large country. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Boszormenyi Zoltan <zb@cybertec.at> writes: > Tom Lane �rta: >> That seems like a ridiculously large number of ECs. What is the >> test query again? > The test case is here: > http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at After poking through that a bit, I think that the real issue is in this division of labor: index_pathkeys = build_index_pathkeys(root, index, ForwardScanDirection); useful_pathkeys = truncate_useless_pathkeys(root, rel, index_pathkeys); If you trace what is happening here, the index pathkeys that actually survive the "usefulness" test all refer to exactly ONE equivalence class, namely the one arising from the query's "order by timestamp2" clause. All the other pathkeys that get created are immediately discarded as being irrelevant to the query. The reason that we end up with so many equivalence classes is that there is nothing causing the variables of the different child tables to be recognized as all sort-equivalent. Maybe that's a bug in itself, but I would argue that the right way to make this faster is to refactor things so that we don't generate useless equivalence classes in the first place, or at least don't keep them around in the planner's lists once we realize they're useless. I like Heikki's hack to cut down on searching in make_canonical_pathkey, but I think that complicating the data structure searching beyond that is just a band-aid. Reasonably-sized queries shouldn't contain very many equivalence classes: they should only come from equality clauses or sort conditions that appeared in the query text. Therefore, there also shouldn't be all that many distinct pathkeys. regards, tom lane
I wrote: > the right way to make this faster is to refactor things so that we > don't generate useless equivalence classes in the first place, or > at least don't keep them around in the planner's lists once we realize > they're useless. After a bit of hacking, I propose the attached patch. > I like Heikki's hack to cut down on searching in make_canonical_pathkey, > but I think that complicating the data structure searching beyond that > is just a band-aid. With the given test case and this patch, we end up with exactly two canonical pathkeys referencing a single EquivalenceClass. So as far as I can tell there's not a lot of point in refining the pathkey searching. Now, the EquivalenceClass has got 483 members, which means that there's still some O(N^2) behavior in get_eclass_for_sort_expr. There might be some use in refining the search for a matching eclass member. It's not sticking out in profiling like it did before though. regards, tom lane diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README index d6402cf911817b1b8c17da91019a1fac19bf051a..5c0786f2fe6dea9a72ad66ba93aa8833ab0e26ba 100644 *** a/src/backend/optimizer/README --- b/src/backend/optimizer/README *************** sort ordering was important; and so usin *** 632,640 **** orderings doesn't create any real problem. - Though Bob Devine <bob.devine@worldnet.att.net> was not involved in the - coding of our optimizer, he is available to field questions about - optimizer topics. -- bjm & tgl --- 632,670 ---- orderings doesn't create any real problem. + Order of processing for EquivalenceClasses and PathKeys + ------------------------------------------------------- + + As alluded to above, there is a specific sequence of phases in the + processing of EquivalenceClasses and PathKeys during planning. During the + initial scanning of the query's quals (deconstruct_jointree followed by + reconsider_outer_join_clauses), we construct EquivalenceClasses based on + mergejoinable clauses found in the quals. At the end of this process, + we know all we can know about equivalence of different variables, so + subsequently there will be no further merging of EquivalenceClasses. + At that point it is possible to consider the EquivalenceClasses as + "canonical" and build canonical PathKeys that reference them. Before + we reach that point (actually, before entering query_planner at all) + we also ensure that we have constructed EquivalenceClasses for all the + expressions used in the query's ORDER BY and related clauses. These + classes might or might not get merged together, depending on what we + find in the quals. + + Because all the EquivalenceClasses are known before we begin path + generation, we can use them as a guide to which indexes are of interest: + if an index's column is not mentioned in any EquivalenceClass then that + index's sort order cannot possibly be helpful for the query. This allows + short-circuiting of much of the processing of create_index_paths() for + irrelevant indexes. + + There are some cases where planner.c constructs additional + EquivalenceClasses and PathKeys after query_planner has completed. + In these cases, the extra ECs/PKs are needed to represent sort orders + that were not considered during query_planner. Such situations should be + minimized since it is impossible for query_planner to return a plan + producing such a sort order, meaning a explicit sort will always be needed. + Currently this happens only for queries involving multiple window functions + with different orderings, so extra sorts are needed anyway. -- bjm & tgl diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c index e44e960b5454d4698ed82e4e857794ffe2a9adf2..c101c272a14b2f1b9d92a54670688df057d84a13 100644 *** a/src/backend/optimizer/path/equivclass.c --- b/src/backend/optimizer/path/equivclass.c *************** static bool reconsider_full_join_clause( *** 78,83 **** --- 78,87 ---- * join. (This is the reason why we need a failure return. It's more * convenient to check this case here than at the call sites...) * + * On success return, we have also initialized the clause's left_ec/right_ec + * fields to point to the EquivalenceClass built from it. This saves lookup + * effort later. + * * Note: constructing merged EquivalenceClasses is a standard UNION-FIND * problem, for which there exist better data structures than simple lists. * If this code ever proves to be a bottleneck then it could be sped up --- *************** process_equivalence(PlannerInfo *root, R *** 106,111 **** --- 110,119 ---- *em2; ListCell *lc1; + /* Should not already be marked as having generated an eclass */ + Assert(restrictinfo->left_ec == NULL); + Assert(restrictinfo->right_ec == NULL); + /* Extract info from given clause */ Assert(is_opclause(clause)); opno = ((OpExpr *) clause)->opno; *************** process_equivalence(PlannerInfo *root, R *** 236,243 **** { ec1->ec_sources = lappend(ec1->ec_sources, restrictinfo); ec1->ec_below_outer_join |= below_outer_join; /* mark the RI as usable with this pair of EMs */ - /* NB: can't set left_ec/right_ec until merging is finished */ restrictinfo->left_em = em1; restrictinfo->right_em = em2; return true; --- 244,253 ---- { ec1->ec_sources = lappend(ec1->ec_sources, restrictinfo); ec1->ec_below_outer_join |= below_outer_join; + /* mark the RI as associated with this eclass */ + restrictinfo->left_ec = ec1; + restrictinfo->right_ec = ec1; /* mark the RI as usable with this pair of EMs */ restrictinfo->left_em = em1; restrictinfo->right_em = em2; return true; *************** process_equivalence(PlannerInfo *root, R *** 266,271 **** --- 276,284 ---- ec2->ec_relids = NULL; ec1->ec_sources = lappend(ec1->ec_sources, restrictinfo); ec1->ec_below_outer_join |= below_outer_join; + /* mark the RI as associated with this eclass */ + restrictinfo->left_ec = ec1; + restrictinfo->right_ec = ec1; /* mark the RI as usable with this pair of EMs */ restrictinfo->left_em = em1; restrictinfo->right_em = em2; *************** process_equivalence(PlannerInfo *root, R *** 276,281 **** --- 289,297 ---- em2 = add_eq_member(ec1, item2, item2_relids, false, item2_type); ec1->ec_sources = lappend(ec1->ec_sources, restrictinfo); ec1->ec_below_outer_join |= below_outer_join; + /* mark the RI as associated with this eclass */ + restrictinfo->left_ec = ec1; + restrictinfo->right_ec = ec1; /* mark the RI as usable with this pair of EMs */ restrictinfo->left_em = em1; restrictinfo->right_em = em2; *************** process_equivalence(PlannerInfo *root, R *** 286,291 **** --- 302,310 ---- em1 = add_eq_member(ec2, item1, item1_relids, false, item1_type); ec2->ec_sources = lappend(ec2->ec_sources, restrictinfo); ec2->ec_below_outer_join |= below_outer_join; + /* mark the RI as associated with this eclass */ + restrictinfo->left_ec = ec2; + restrictinfo->right_ec = ec2; /* mark the RI as usable with this pair of EMs */ restrictinfo->left_em = em1; restrictinfo->right_em = em2; *************** process_equivalence(PlannerInfo *root, R *** 311,316 **** --- 330,338 ---- root->eq_classes = lappend(root->eq_classes, ec); + /* mark the RI as associated with this eclass */ + restrictinfo->left_ec = ec; + restrictinfo->right_ec = ec; /* mark the RI as usable with this pair of EMs */ restrictinfo->left_em = em1; restrictinfo->right_em = em2; *************** add_eq_member(EquivalenceClass *ec, Expr *** 362,376 **** /* * get_eclass_for_sort_expr * Given an expression and opfamily info, find an existing equivalence ! * class it is a member of; if none, build a new single-member * EquivalenceClass for it. * * sortref is the SortGroupRef of the originating SortGroupClause, if any, * or zero if not. (It should never be zero if the expression is volatile!) * * This can be used safely both before and after EquivalenceClass merging; * since it never causes merging it does not invalidate any existing ECs ! * or PathKeys. * * Note: opfamilies must be chosen consistently with the way * process_equivalence() would do; that is, generated from a mergejoinable --- 384,402 ---- /* * get_eclass_for_sort_expr * Given an expression and opfamily info, find an existing equivalence ! * class it is a member of; if none, optionally build a new single-member * EquivalenceClass for it. * * sortref is the SortGroupRef of the originating SortGroupClause, if any, * or zero if not. (It should never be zero if the expression is volatile!) * + * If create_it is TRUE, we'll build a new EquivalenceClass when there is no + * match. If create_it is FALSE, we just return NULL when no match. + * * This can be used safely both before and after EquivalenceClass merging; * since it never causes merging it does not invalidate any existing ECs ! * or PathKeys. However, ECs added after path generation has begun are ! * of limited usefulness, so usually it's best to create them beforehand. * * Note: opfamilies must be chosen consistently with the way * process_equivalence() would do; that is, generated from a mergejoinable *************** get_eclass_for_sort_expr(PlannerInfo *ro *** 382,388 **** Expr *expr, Oid expr_datatype, List *opfamilies, ! Index sortref) { EquivalenceClass *newec; EquivalenceMember *newem; --- 408,415 ---- Expr *expr, Oid expr_datatype, List *opfamilies, ! Index sortref, ! bool create_it) { EquivalenceClass *newec; EquivalenceMember *newem; *************** get_eclass_for_sort_expr(PlannerInfo *ro *** 426,433 **** } } /* ! * No match, so build a new single-member EC * * Here, we must be sure that we construct the EC in the right context. We * can assume, however, that the passed expr is long-lived. --- 453,464 ---- } } + /* No match; does caller want a NULL result? */ + if (!create_it) + return NULL; + /* ! * OK, build a new single-member EC * * Here, we must be sure that we construct the EC in the right context. We * can assume, however, that the passed expr is long-lived. *************** create_join_clause(PlannerInfo *root, *** 1094,1101 **** rinfo->parent_ec = parent_ec; /* ! * We can set these now, rather than letting them be looked up later, ! * since this is only used after EC merging is complete. */ rinfo->left_ec = ec; rinfo->right_ec = ec; --- 1125,1132 ---- rinfo->parent_ec = parent_ec; /* ! * We know the correct values for left_ec/right_ec, ie this particular EC, ! * so we can just set them directly instead of forcing another lookup. */ rinfo->left_ec = ec; rinfo->right_ec = ec; diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c index ab3b9cd394d9af54bcd59e9830a9fbcc0a773fa8..bea134b3f1a02db8c611a6fc46d5bba041961682 100644 *** a/src/backend/optimizer/path/joinpath.c --- b/src/backend/optimizer/path/joinpath.c *************** select_mergejoin_clauses(PlannerInfo *ro *** 1041,1047 **** * mergejoin is not really all that big a deal, and so it's not clear * that improving this is important. */ ! cache_mergeclause_eclasses(root, restrictinfo); if (EC_MUST_BE_REDUNDANT(restrictinfo->left_ec) || EC_MUST_BE_REDUNDANT(restrictinfo->right_ec)) --- 1041,1047 ---- * mergejoin is not really all that big a deal, and so it's not clear * that improving this is important. */ ! update_mergeclause_eclasses(root, restrictinfo); if (EC_MUST_BE_REDUNDANT(restrictinfo->left_ec) || EC_MUST_BE_REDUNDANT(restrictinfo->right_ec)) diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 643d57a92dc83f6adf4e2d1da223909528d59805..20c6d73617d507b88fce427a681320a684b47cd0 100644 *** a/src/backend/optimizer/path/pathkeys.c --- b/src/backend/optimizer/path/pathkeys.c *************** static PathKey *make_pathkey_from_sortin *** 40,45 **** --- 40,46 ---- Expr *expr, Oid ordering_op, bool nulls_first, Index sortref, + bool create_it, bool canonicalize); static Var *find_indexkey_var(PlannerInfo *root, RelOptInfo *rel, AttrNumber varattno); *************** canonicalize_pathkeys(PlannerInfo *root, *** 230,235 **** --- 231,240 ---- * If the PathKey is being generated from a SortGroupClause, sortref should be * the SortGroupClause's SortGroupRef; otherwise zero. * + * create_it is TRUE if we should create any missing EquivalenceClass + * needed to represent the sort key. If it's FALSE, we return NULL if the + * sort key isn't already present in any EquivalenceClass. + * * canonicalize should always be TRUE after EquivalenceClass merging has * been performed, but FALSE if we haven't done EquivalenceClass merging yet. */ *************** make_pathkey_from_sortinfo(PlannerInfo * *** 238,243 **** --- 243,249 ---- Expr *expr, Oid ordering_op, bool nulls_first, Index sortref, + bool create_it, bool canonicalize) { Oid opfamily, *************** make_pathkey_from_sortinfo(PlannerInfo * *** 300,308 **** COERCE_DONTCARE); } ! /* Now find or create a matching EquivalenceClass */ eclass = get_eclass_for_sort_expr(root, expr, opcintype, opfamilies, ! sortref); /* And finally we can find or create a PathKey node */ if (canonicalize) --- 306,318 ---- COERCE_DONTCARE); } ! /* Now find or (optionally) create a matching EquivalenceClass */ eclass = get_eclass_for_sort_expr(root, expr, opcintype, opfamilies, ! sortref, create_it); ! ! /* Fail if no EC and !create_it */ ! if (!eclass) ! return NULL; /* And finally we can find or create a PathKey node */ if (canonicalize) *************** get_cheapest_fractional_path_for_pathkey *** 478,485 **** * The result is canonical, meaning that redundant pathkeys are removed; * it may therefore have fewer entries than there are index columns. * ! * We generate the full pathkeys list whether or not all are useful for the ! * current query. Caller should do truncate_useless_pathkeys(). */ List * build_index_pathkeys(PlannerInfo *root, --- 488,498 ---- * The result is canonical, meaning that redundant pathkeys are removed; * it may therefore have fewer entries than there are index columns. * ! * Another reason for stopping early is that we may be able to tell that ! * an index column's sort order is uninteresting for this query. However, ! * that test is just based on the existence of an EquivalenceClass and not ! * on position in pathkey lists, so it's not complete. Caller should call ! * truncate_useless_pathkeys() to possibly remove more pathkeys. */ List * build_index_pathkeys(PlannerInfo *root, *************** build_index_pathkeys(PlannerInfo *root, *** 527,540 **** indexprs_item = lnext(indexprs_item); } ! /* OK, make a canonical pathkey for this sort key */ cpathkey = make_pathkey_from_sortinfo(root, indexkey, sortop, nulls_first, 0, true); /* Add to list unless redundant */ if (!pathkey_is_redundant(cpathkey, retval)) retval = lappend(retval, cpathkey); --- 540,562 ---- indexprs_item = lnext(indexprs_item); } ! /* OK, try to make a canonical pathkey for this sort key */ cpathkey = make_pathkey_from_sortinfo(root, indexkey, sortop, nulls_first, 0, + false, true); + /* + * If the sort key isn't already present in any EquivalenceClass, + * then it's not an interesting sort order for this query. So + * we can stop now --- lower-order sort keys aren't useful either. + */ + if (!cpathkey) + break; + /* Add to list unless redundant */ if (!pathkey_is_redundant(cpathkey, retval)) retval = lappend(retval, cpathkey); *************** convert_subquery_pathkeys(PlannerInfo *r *** 644,656 **** outer_expr, sub_member->em_datatype, sub_eclass->ec_opfamilies, ! 0); ! best_pathkey = ! make_canonical_pathkey(root, ! outer_ec, ! sub_pathkey->pk_opfamily, ! sub_pathkey->pk_strategy, ! sub_pathkey->pk_nulls_first); } } else --- 666,685 ---- outer_expr, sub_member->em_datatype, sub_eclass->ec_opfamilies, ! 0, ! false); ! ! /* ! * If we don't find a matching EC, sub-pathkey isn't ! * interesting to the outer query ! */ ! if (outer_ec) ! best_pathkey = ! make_canonical_pathkey(root, ! outer_ec, ! sub_pathkey->pk_opfamily, ! sub_pathkey->pk_strategy, ! sub_pathkey->pk_nulls_first); } } else *************** convert_subquery_pathkeys(PlannerInfo *r *** 738,744 **** outer_expr, sub_member->em_datatype, sub_eclass->ec_opfamilies, ! 0); outer_pk = make_canonical_pathkey(root, outer_ec, sub_pathkey->pk_opfamily, --- 767,782 ---- outer_expr, sub_member->em_datatype, sub_eclass->ec_opfamilies, ! 0, ! false); ! ! /* ! * If we don't find a matching EC, this sub-pathkey isn't ! * interesting to the outer query ! */ ! if (!outer_ec) ! continue; ! outer_pk = make_canonical_pathkey(root, outer_ec, sub_pathkey->pk_opfamily, *************** make_pathkeys_for_sortclauses(PlannerInf *** 859,864 **** --- 897,903 ---- sortcl->sortop, sortcl->nulls_first, sortcl->tleSortGroupRef, + true, canonicalize); /* Canonical form eliminates redundant ordering keys */ *************** make_pathkeys_for_sortclauses(PlannerInf *** 878,923 **** ****************************************************************************/ /* ! * cache_mergeclause_eclasses ! * Make the cached EquivalenceClass links valid in a mergeclause ! * restrictinfo. * * RestrictInfo contains fields in which we may cache pointers to * EquivalenceClasses for the left and right inputs of the mergeclause. * (If the mergeclause is a true equivalence clause these will be the ! * same EquivalenceClass, otherwise not.) */ void ! cache_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo) { Assert(restrictinfo->mergeopfamilies != NIL); ! /* the cached values should be either both set or both not */ ! if (restrictinfo->left_ec == NULL) ! { ! Expr *clause = restrictinfo->clause; ! Oid lefttype, ! righttype; ! /* Need the declared input types of the operator */ ! op_input_types(((OpExpr *) clause)->opno, &lefttype, &righttype); ! /* Find or create a matching EquivalenceClass for each side */ ! restrictinfo->left_ec = ! get_eclass_for_sort_expr(root, ! (Expr *) get_leftop(clause), ! lefttype, ! restrictinfo->mergeopfamilies, ! 0); ! restrictinfo->right_ec = ! get_eclass_for_sort_expr(root, ! (Expr *) get_rightop(clause), ! righttype, ! restrictinfo->mergeopfamilies, ! 0); ! } ! else ! Assert(restrictinfo->right_ec != NULL); } /* --- 917,997 ---- ****************************************************************************/ /* ! * initialize_mergeclause_eclasses ! * Set the EquivalenceClass links in a mergeclause restrictinfo. * * RestrictInfo contains fields in which we may cache pointers to * EquivalenceClasses for the left and right inputs of the mergeclause. * (If the mergeclause is a true equivalence clause these will be the ! * same EquivalenceClass, otherwise not.) If the mergeclause is either ! * used to generate an EquivalenceClass, or derived from an EquivalenceClass, ! * then it's easy to set up the left_ec and right_ec members --- otherwise, ! * this function should be called to set them up. We will generate new ! * EquivalenceClauses if necessary to represent the mergeclause's left and ! * right sides. ! * ! * Note this is called before EC merging is complete, so the links won't ! * necessarily point to canonical ECs. Before they are actually used for ! * anything, update_mergeclause_eclasses must be called to ensure that ! * they've been updated to point to canonical ECs. */ void ! initialize_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo) { + Expr *clause = restrictinfo->clause; + Oid lefttype, + righttype; + + /* Should be a mergeclause ... */ Assert(restrictinfo->mergeopfamilies != NIL); + /* ... with links not yet set */ + Assert(restrictinfo->left_ec == NULL); + Assert(restrictinfo->right_ec == NULL); ! /* Need the declared input types of the operator */ ! op_input_types(((OpExpr *) clause)->opno, &lefttype, &righttype); ! /* Find or create a matching EquivalenceClass for each side */ ! restrictinfo->left_ec = ! get_eclass_for_sort_expr(root, ! (Expr *) get_leftop(clause), ! lefttype, ! restrictinfo->mergeopfamilies, ! 0, ! true); ! restrictinfo->right_ec = ! get_eclass_for_sort_expr(root, ! (Expr *) get_rightop(clause), ! righttype, ! restrictinfo->mergeopfamilies, ! 0, ! true); ! } ! /* ! * update_mergeclause_eclasses ! * Make the cached EquivalenceClass links valid in a mergeclause ! * restrictinfo. ! * ! * These pointers should have been set by process_equivalence or ! * initialize_mergeclause_eclasses, but they might have been set to ! * non-canonical ECs that got merged later. Chase up to the canonical ! * merged parent if so. ! */ ! void ! update_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo) ! { ! /* Should be a merge clause ... */ ! Assert(restrictinfo->mergeopfamilies != NIL); ! /* ... with pointers already set */ ! Assert(restrictinfo->left_ec != NULL); ! Assert(restrictinfo->right_ec != NULL); ! ! /* Chase up to the top as needed */ ! while (restrictinfo->left_ec->ec_merged) ! restrictinfo->left_ec = restrictinfo->left_ec->ec_merged; ! while (restrictinfo->right_ec->ec_merged) ! restrictinfo->right_ec = restrictinfo->right_ec->ec_merged; } /* *************** find_mergeclauses_for_pathkeys(PlannerIn *** 953,959 **** { RestrictInfo *rinfo = (RestrictInfo *) lfirst(i); ! cache_mergeclause_eclasses(root, rinfo); } foreach(i, pathkeys) --- 1027,1033 ---- { RestrictInfo *rinfo = (RestrictInfo *) lfirst(i); ! update_mergeclause_eclasses(root, rinfo); } foreach(i, pathkeys) *************** select_outer_pathkeys_for_merge(PlannerI *** 1089,1095 **** ListCell *lc2; /* get the outer eclass */ ! cache_mergeclause_eclasses(root, rinfo); if (rinfo->outer_is_left) oeclass = rinfo->left_ec; --- 1163,1169 ---- ListCell *lc2; /* get the outer eclass */ ! update_mergeclause_eclasses(root, rinfo); if (rinfo->outer_is_left) oeclass = rinfo->left_ec; *************** make_inner_pathkeys_for_merge(PlannerInf *** 1250,1256 **** EquivalenceClass *ieclass; PathKey *pathkey; ! cache_mergeclause_eclasses(root, rinfo); if (rinfo->outer_is_left) { --- 1324,1330 ---- EquivalenceClass *ieclass; PathKey *pathkey; ! update_mergeclause_eclasses(root, rinfo); if (rinfo->outer_is_left) { *************** pathkeys_useful_for_merging(PlannerInfo *** 1366,1372 **** if (restrictinfo->mergeopfamilies == NIL) continue; ! cache_mergeclause_eclasses(root, restrictinfo); if (pathkey->pk_eclass == restrictinfo->left_ec || pathkey->pk_eclass == restrictinfo->right_ec) --- 1440,1446 ---- if (restrictinfo->mergeopfamilies == NIL) continue; ! update_mergeclause_eclasses(root, restrictinfo); if (pathkey->pk_eclass == restrictinfo->left_ec || pathkey->pk_eclass == restrictinfo->right_ec) diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c index a2ff314679daa30003d59d04ec01066049e1cb13..77ebadd5cc6a634f5e07886c19a3e4c71dbe3c31 100644 *** a/src/backend/optimizer/plan/initsplan.c --- b/src/backend/optimizer/plan/initsplan.c *************** distribute_qual_to_rels(PlannerInfo *roo *** 1066,1071 **** --- 1066,1077 ---- * * If none of the above hold, pass it off to * distribute_restrictinfo_to_rels(). + * + * In all cases, it's important to initialize the left_ec and right_ec + * fields of a mergejoinable clause, so that all possibly mergejoinable + * expressions have representations in EquivalenceClasses. If + * process_equivalence is successful, it will take care of that; + * otherwise, we have to call initialize_mergeclause_eclasses to do it. */ if (restrictinfo->mergeopfamilies) { *************** distribute_qual_to_rels(PlannerInfo *roo *** 1073,1082 **** { if (process_equivalence(root, restrictinfo, below_outer_join)) return; ! /* EC rejected it, so pass to distribute_restrictinfo_to_rels */ } else if (maybe_outer_join && restrictinfo->can_join) { if (bms_is_subset(restrictinfo->left_relids, outerjoin_nonnullable) && !bms_overlap(restrictinfo->right_relids, --- 1079,1093 ---- { if (process_equivalence(root, restrictinfo, below_outer_join)) return; ! /* EC rejected it, so set left_ec/right_ec the hard way ... */ ! initialize_mergeclause_eclasses(root, restrictinfo); ! /* ... and fall through to distribute_restrictinfo_to_rels */ } else if (maybe_outer_join && restrictinfo->can_join) { + /* we need to set up left_ec/right_ec the hard way */ + initialize_mergeclause_eclasses(root, restrictinfo); + /* now see if it should go to any outer-join lists */ if (bms_is_subset(restrictinfo->left_relids, outerjoin_nonnullable) && !bms_overlap(restrictinfo->right_relids, *************** distribute_qual_to_rels(PlannerInfo *roo *** 1104,1109 **** --- 1115,1126 ---- restrictinfo); return; } + /* nope, so fall through to distribute_restrictinfo_to_rels */ + } + else + { + /* we still need to set up left_ec/right_ec */ + initialize_mergeclause_eclasses(root, restrictinfo); } } *************** process_implied_equality(PlannerInfo *ro *** 1404,1409 **** --- 1421,1429 ---- * * This overlaps the functionality of process_implied_equality(), but we * must return the RestrictInfo, not push it into the joininfo tree. + * + * Note: we do not do initialize_mergeclause_eclasses() here. It is + * caller's responsibility that left_ec/right_ec be set as necessary. */ RestrictInfo * build_implied_join_equality(Oid opno, diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h index 5f628eeeb96174a3c9051257e235baff43efcacd..6bce53cce57d5c3d8551ad12c371e36100c009ff 100644 *** a/src/include/optimizer/paths.h --- b/src/include/optimizer/paths.h *************** extern EquivalenceClass *get_eclass_for_ *** 116,122 **** Expr *expr, Oid expr_datatype, List *opfamilies, ! Index sortref); extern void generate_base_implied_equalities(PlannerInfo *root); extern List *generate_join_implied_equalities(PlannerInfo *root, RelOptInfo *joinrel, --- 116,123 ---- Expr *expr, Oid expr_datatype, List *opfamilies, ! Index sortref, ! bool create_it); extern void generate_base_implied_equalities(PlannerInfo *root); extern List *generate_join_implied_equalities(PlannerInfo *root, RelOptInfo *joinrel, *************** extern List *make_pathkeys_for_sortclaus *** 172,178 **** List *sortclauses, List *tlist, bool canonicalize); ! extern void cache_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo); extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root, List *pathkeys, --- 173,181 ---- List *sortclauses, List *tlist, bool canonicalize); ! extern void initialize_mergeclause_eclasses(PlannerInfo *root, ! RestrictInfo *restrictinfo); ! extern void update_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo); extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root, List *pathkeys,
Tom Lane írta: > I wrote: > >> the right way to make this faster is to refactor things so that we >> don't generate useless equivalence classes in the first place, or >> at least don't keep them around in the planner's lists once we realize >> they're useless. >> > > After a bit of hacking, I propose the attached patch. > > >> I like Heikki's hack to cut down on searching in make_canonical_pathkey, >> but I think that complicating the data structure searching beyond that >> is just a band-aid. >> > > With the given test case and this patch, we end up with exactly two > canonical pathkeys referencing a single EquivalenceClass. So as far > as I can tell there's not a lot of point in refining the pathkey > searching. Now, the EquivalenceClass has got 483 members, which > means that there's still some O(N^2) behavior in > get_eclass_for_sort_expr. There might be some use in refining the > search for a matching eclass member. It's not sticking out in > profiling like it did before though. > > regards, tom lane > Thanks, this patch made get_eclass_from_sort_expr almost, make_canonical_pathkeys and add_child_rel_equivalences completely disappear from the gprof timing. +1 for including this into 9.1. On the other hand, if I use a similar test case to my original one (i.e. the tables are much wider) then the query planning takes 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds as we observed it using PostgreSQL 9.0.0. The beginning of the gprof output now looks like this: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 21.13 0.30 0.30 235091 0.00 0.00 SearchCatCache 7.04 0.40 0.10 1507206 0.00 0.00 hash_search_with_hash_value 3.52 0.45 0.05 2308219 0.00 0.00 AllocSetAlloc 3.52 0.50 0.05 845776 0.00 0.00 hash_any 3.52 0.55 0.05 341637 0.00 0.00 HeapTupleSatisfiesNow 3.52 0.60 0.05 1136 0.00 0.00 tzload 2.82 0.64 0.04 547 0.00 0.00 get_rel_data_width 2.11 0.67 0.03 669414 0.00 0.00 hash_search 2.11 0.70 0.03 235091 0.00 0.00 SearchSysCache 2.11 0.73 0.03 192590 0.00 0.00 copyObject 2.11 0.76 0.03 164457 0.00 0.00 pgstat_initstats 2.11 0.79 0.03 152999 0.00 0.00 index_getnext ... Use the attached synthetic create_table_wide.sql together with the previous childtables.sql. The full compressed gprof output is attached. Your patch creates a 70% speedup in planning time, which is excellent. Best regards, Zoltán Böszörményi -- ---------------------------------- Zoltán Böszörményi Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/
Attachment
> On the other hand, if I use a similar test case to my original one > (i.e. the tables are much wider) then the query planning takes > 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds > as we observed it using PostgreSQL 9.0.0. The beginning of the gprof > output now looks like this: Hi, I'm really interested in this patch. I tried a simple test case: create table t (a integer, b text); DO $$DECLARE i int; BEGIN FOR i IN 0..9000 LOOP EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || ' and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)'; EXECUTE 'create index tidx' || i || ' ON t' || i || ' (a)'; END LOOP; END$$; explain select * from t where a > 1060 and a < 1090; but I don't get any gain from the patch... explain time is still around 250 ms. Tried with 9000 partitions, time is still 2 secs. Maybe I've missed completely the patch purpose? (I tried the test case at http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at and that, in fact, gets a boost with this patch). Leonardo
> but I don't get any gain from the patch... explain time is still around 250 >ms. > Tried with 9000 partitions, time is still 2 secs. Small correction: I tried with 3000 partitions (FOR i IN 0..3000 ...) and got 250ms with both versions, with 9000 partitions 2 secs (again no gain from the patch)
Leonardo Francalanci <m_lists@yahoo.it> writes: > I tried a simple test case: > create table t (a integer, b text); > DO $$DECLARE i int; > BEGIN > FOR i IN 0..9000 LOOP > EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || ' > and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)'; > EXECUTE 'create index tidx' || i || ' ON t' || i || ' (a)'; > END LOOP; > END$$; > explain select * from t where a > 1060 and a < 1090; This is going to be dominated by constraint exclusion checking. There's basically no fix for that except a more explicit representation of the partitioning rules. If the planner has to make 8999 theorem proofs to remove the 8999 unwanted partitions from the plan, it's gonna take awhile. regards, tom lane
> This is going to be dominated by constraint exclusion checking. There's > basically no fix for that except a more explicit representation of the > partitioning rules. Damn, I knew that was going to be more complicated :) So in which case does this patch help? I guess in a multi-index scenario? childtables.sql is kind of hard to read (I think a FOR loop would have been more auto-explaining?).
Boszormenyi Zoltan <zb@cybertec.at> writes: > On the other hand, if I use a similar test case to my original one > (i.e. the tables are much wider) then the query planning takes > 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds > as we observed it using PostgreSQL 9.0.0. The beginning of the gprof > output now looks like this: > % cumulative self self total > time seconds seconds calls s/call s/call name > 21.13 0.30 0.30 235091 0.00 0.00 SearchCatCache > 7.04 0.40 0.10 1507206 0.00 0.00 hash_search_with_hash_value > 3.52 0.45 0.05 2308219 0.00 0.00 AllocSetAlloc Yeah, for me it looks even worse: oprofile shows about 77% of time in SearchCatCache. I poked around a little and it seems that probably most of the time is going into searches of the STATRELATTINH syscache, which looks like this: $13 = {id = 41, cc_next = 0x2b43a60, cc_relname = 0x7f6bc6ed2218 "pg_statistic", cc_reloid = 2619, cc_indexoid = 2696,cc_relisshared = 0 '\000', cc_tupdesc = 0x7f6bc6ed11d8, cc_ntup = 68922, cc_nbuckets = 1024, cc_nkeys = 3, cc_key ={1, 2, 3, 0}, ... Most of those entries are "negative" cache entries, since we don't have any actual stats in this toy example. I think that we probably should be very circumspect about believing that this example is still a good guide to what to optimize next; in particular, in a real-world example with real stats, I'm not sure that the hot spots will still be in the same places. I'd advise loading up some real data and doing more profiling. However, if the hot spot does stay in SearchCatCache, I can't help noticing that those bucket chains are looking a bit overloaded --- sixty-plus entries per bucket ain't good. Maybe it's time to teach catcache.c how to reorganize its hashtables once the load factor exceeds a certain level. Or more drastically, maybe it should lose its private hashtable logic and use dynahash.c; I'm not sure at the moment if the private implementation has any important characteristics dynahash hasn't got. regards, tom lane
I wrote: > This is going to be dominated by constraint exclusion checking. Hmm, maybe I spoke too soon. With 9000 child tables I get a profile like this: samples % symbol name 447433 47.1553 get_tabstat_entry 185458 19.5456 find_all_inheritors 53064 5.5925 SearchCatCache 33864 3.5690 pg_strtok 26301 2.7719 hash_search_with_hash_value 22577 2.3794 AllocSetAlloc 6696 0.7057 MemoryContextAllocZeroAligned 6250 0.6587 expression_tree_walker 5141 0.5418 LockReleaseAll 4779 0.5037 get_relation_info 4506 0.4749 MemoryContextAlloc 4467 0.4708 expression_tree_mutator 4136 0.4359 pgstat_initstats 3914 0.4125 relation_excluded_by_constraints get_tabstat_entry and find_all_inheritors are both obviously O(N^2) in the number of tables they have to deal with. However, the constant factors are small enough that you need a heck of a lot of tables before they become significant consumers of runtime. I'm not convinced that we should be optimizing for 9000-child-table cases. It'd be worth fixing these if we can do it without either introducing a lot of complexity, or slowing things down for typical cases that have only a few tables. Offhand not sure about how to do either. regards, tom lane
On Fri, Oct 29, 2010 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Boszormenyi Zoltan <zb@cybertec.at> writes: >> On the other hand, if I use a similar test case to my original one >> (i.e. the tables are much wider) then the query planning takes >> 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds >> as we observed it using PostgreSQL 9.0.0. The beginning of the gprof >> output now looks like this: > >> % cumulative self self total >> time seconds seconds calls s/call s/call name >> 21.13 0.30 0.30 235091 0.00 0.00 SearchCatCache >> 7.04 0.40 0.10 1507206 0.00 0.00 hash_search_with_hash_value >> 3.52 0.45 0.05 2308219 0.00 0.00 AllocSetAlloc > > Yeah, for me it looks even worse: oprofile shows about 77% of time in > SearchCatCache. I poked around a little and it seems that probably most > of the time is going into searches of the STATRELATTINH syscache, which > looks like this: > > $13 = {id = 41, cc_next = 0x2b43a60, > cc_relname = 0x7f6bc6ed2218 "pg_statistic", cc_reloid = 2619, > cc_indexoid = 2696, cc_relisshared = 0 '\000', cc_tupdesc = 0x7f6bc6ed11d8, > cc_ntup = 68922, cc_nbuckets = 1024, cc_nkeys = 3, cc_key = {1, 2, 3, 0}, > ... > > Most of those entries are "negative" cache entries, since we don't have > any actual stats in this toy example. > > I think that we probably should be very circumspect about believing that > this example is still a good guide to what to optimize next; in > particular, in a real-world example with real stats, I'm not sure that > the hot spots will still be in the same places. I'd advise loading up > some real data and doing more profiling. > > However, if the hot spot does stay in SearchCatCache, I can't help > noticing that those bucket chains are looking a bit overloaded --- > sixty-plus entries per bucket ain't good. Maybe it's time to teach > catcache.c how to reorganize its hashtables once the load factor > exceeds a certain level. Or more drastically, maybe it should lose > its private hashtable logic and use dynahash.c; I'm not sure at the > moment if the private implementation has any important characteristics > dynahash hasn't got. I'm not sure what's happening in this particular case, but I seem to remember poking at a case a while back where we were doing a lot of repeated statistics lookups for the same columns. If that's also the the case here and if there is some way to avoid it (hang a pointer to the stats off the node tree somewhere?) we might be able to cut down on the number of hash probes, as an alternative to or in addition to making them faster. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Hmm, maybe I spoke too soon. With 9000 child tables I get a profile > like this: Well, the 9000-table-test-case was meant to check the difference in performance with/without the patch... I don't see the reason for trying to optimize such an unrealistic case. BTW can someone explain to me which are the cases where the patch actually helps?
Leonardo Francalanci <m_lists@yahoo.it> writes: > BTW can someone explain to me which are the cases where the > patch actually helps? Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes per child table, only one of which was relevant to the query. In your test case there are no irrelevant indexes, which is why the runtime didn't change. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Oct 29, 2010 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> However, if the hot spot does stay in SearchCatCache, I can't help >> noticing that those bucket chains are looking a bit overloaded --- >> sixty-plus entries per bucket ain't good. �Maybe it's time to teach >> catcache.c how to reorganize its hashtables once the load factor >> exceeds a certain level. �Or more drastically, maybe it should lose >> its private hashtable logic and use dynahash.c; I'm not sure at the >> moment if the private implementation has any important characteristics >> dynahash hasn't got. > I'm not sure what's happening in this particular case, but I seem to > remember poking at a case a while back where we were doing a lot of > repeated statistics lookups for the same columns. If that's also the > the case here and if there is some way to avoid it (hang a pointer to > the stats off the node tree somewhere?) we might be able to cut down > on the number of hash probes, as an alternative to or in addition to > making them faster. I think there are already layers of caching in the planner to avoid fetching the same stats entries more than once per query. The problem here is that there are so many child tables that even fetching stats once per table per query starts to add up. (Also, as I said, I'm worried that we're being misled by the fact that there are no stats to fetch --- so we're not seeing the costs of actually doing something with the stats if they existed.) regards, tom lane
Excerpts from Tom Lane's message of vie oct 29 14:15:55 -0300 2010: > I wrote: > > This is going to be dominated by constraint exclusion checking. > > Hmm, maybe I spoke too soon. With 9000 child tables I get a profile > like this: > > samples % symbol name > 447433 47.1553 get_tabstat_entry Is there a reason for keeping the pgstat info in plain lists? We could use dynahash there too, I think. It would have more palloc overhead that way, though (hmm, but perhaps that can be fixed by having a smart "alloc" function for it, preallocating a bunch of entries instead of one by one). -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Tom Lane's message of vie oct 29 14:15:55 -0300 2010: >> samples % symbol name >> 447433 47.1553 get_tabstat_entry > Is there a reason for keeping the pgstat info in plain lists? Yeah: anything else loses for small numbers of tables per query, which is the normal case. I'd guess you'd need ~100 tables touched in a single transaction before a hashtable is even worth thinking about. We could possibly adopt a solution similar to the planner's approach for joinrels: start with a simple list, and switch over to hashing if the list gets too long. But I'm really doubtful that it's worth the code space. Even with Zoltan's 500-or-so-table case, this wasn't on the radar screen. regards, tom lane
> Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes > per child table, only one of which was relevant to the query. In your > test case there are no irrelevant indexes, which is why the runtime > didn't change. Mmh... I must be doing something wrong. It looks to me it's not just the irrelevant indexes: it's the "order by" that counts. If I remove that times are the same with and without the patch: using the test case: explain select * from inh_parent where timestamp1 between '2010-04-06' and '2010-06-25' this one runs in the same time with the patch; but adding: order by timestamp2 made the non-patched version run 3 times slower. My test case: create table t (a integer, b integer, c integer, d integer, e text); DO $$DECLARE i int; BEGIN FOR i IN 0..2000 LOOP EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || ' and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)'; EXECUTE 'create index taidx' || i || ' ON t' || i || ' (a)'; EXECUTE 'create index tbidx' || i || ' ON t' || i || ' (b)'; EXECUTE 'create index tcidx' || i || ' ON t' || i || ' (c)'; EXECUTE 'create index tdidx' || i || ' ON t' || i || ' (d)'; END LOOP; END$$; explain select * from t where a > 1060 and a < 109000 this runs in 1.5 secs with and without the patch. But if I add order by b the non-patched version runs in 10 seconds. Am I getting it wrong?
Leonardo Francalanci <m_lists@yahoo.it> writes: >> Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes >> per child table, only one of which was relevant to the query. In your >> test case there are no irrelevant indexes, which is why the runtime >> didn't change. > Mmh... I must be doing something wrong. It looks to me it's not just > the irrelevant indexes: it's the "order by" that counts. Ah, I oversimplified a bit: actually, if you don't have an ORDER BY or any mergejoinable join clauses, then the possibly_useful_pathkeys test in find_usable_indexes figures out that we aren't interested in the sort ordering of *any* indexes, so the whole thing gets short-circuited. You need at least the possibility of interest in sorted output from an indexscan before any of this code runs. regards, tom lane
I wrote: > samples % symbol name > 447433 47.1553 get_tabstat_entry > 185458 19.5456 find_all_inheritors > 53064 5.5925 SearchCatCache > 33864 3.5690 pg_strtok > get_tabstat_entry and find_all_inheritors are both obviously O(N^2) in > the number of tables they have to deal with. However, the constant > factors are small enough that you need a heck of a lot of tables > before they become significant consumers of runtime. I'm not convinced > that we should be optimizing for 9000-child-table cases. > It'd be worth fixing these if we can do it without either introducing a > lot of complexity, or slowing things down for typical cases that have > only a few tables. Offhand not sure about how to do either. I had a thought about how to make get_tabstat_entry() faster without adding overhead: what if we just plain remove the search, and always assume that a new entry has to be added to the tabstat array? The existing code seems to be designed to make no assumptions about how it's being used, but that's a bit silly. We know that the links are coming from the relcache, which will have only one entry per relation, and that the relcache is designed to hang onto the links for (at least) the life of a transaction. So rather than optimizing for the case where the relcache fails to remember the tabstat link, maybe we should optimize for the case where it does remember. The worst-case consequence AFAICS would be multiple tabstat entries for the same relation, which seems pretty noncritical anyway. Thoughts? regards, tom lane
[ for the archives' sake ] I wrote: > I had a thought about how to make get_tabstat_entry() faster without > adding overhead: what if we just plain remove the search, and always > assume that a new entry has to be added to the tabstat array? I spent some time looking into this idea. It doesn't really work, because there are places that will break if a transaction has more than one tabstat entry for the same relation. The one that seems most difficult to fix is that pgstat_recv_tabstat() clamps its n_live_tuples and n_dead_tuples values to be nonnegative after adding in each delta received from a backend. That is a good idea because it prevents insane results if some messages get lost --- but if a transaction's updates get randomly spread into several tabstat items, the intermediate counts might get clamped, resulting in a wrong final answer even though nothing was lost. I also added some instrumentation printouts and found that in our regression tests:* about 10% of get_tabstat_entry() calls find an existing entry for the relation OID. This seems to happenonly when a relcache entry gets flushed mid-transaction, but that does happen, and not so infrequently either.* abouthalf of the transactions use as many as 20 tabstats, and 10% use 50 or more; but it drops off fast after that. Almostno transactions use as many as 100 tabstats. It's not clear that these numbers are representative of typical database applications, but they're something to start with anyway. I also did some testing to compare the cost of get_tabstat_entry's linear search against a dynahash.c table with OID key. As I suspected, a hash table would make this code a *lot* slower for small numbers of tabstat entries: about a factor of 10 slower. You need upwards of 100 tabstats touched in a transaction before the hash table begins to pay for itself. This is largely because dynahash doesn't have any cheap way to reset a hashtable to empty, so you have to initialize and destroy the table for each transaction. By the time you've eaten that overhead, you've already expended as many cycles as the linear search takes to handle several dozen entries. I conclude that if we wanted to do something about this, the most practical solution would be the one of executing linear searches until we get to 100+ tabstat entries in a transaction, and then building a hashtable for subsequent searches. However, it's exceedingly unclear that it will ever be worth the effort or code space to do that. regards, tom lane
Tom Lane wrote: > Leonardo Francalanci <m_lists@yahoo.it> writes: > >> Cases with lots of irrelevant indexes. Zoltan's example had 4 indexes > >> per child table, only one of which was relevant to the query. In your > >> test case there are no irrelevant indexes, which is why the runtime > >> didn't change. > > > Mmh... I must be doing something wrong. It looks to me it's not just > > the irrelevant indexes: it's the "order by" that counts. > > Ah, I oversimplified a bit: actually, if you don't have an ORDER BY or > any mergejoinable join clauses, then the possibly_useful_pathkeys test > in find_usable_indexes figures out that we aren't interested in the sort > ordering of *any* indexes, so the whole thing gets short-circuited. > You need at least the possibility of interest in sorted output from an > indexscan before any of this code runs. FYI, I always wondered if the rare use of mergejoins justified the extra planning time of carrying around all those joinpaths. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes: > FYI, I always wondered if the rare use of mergejoins justified the extra > planning time of carrying around all those joinpaths. They're hardly rare. regards, tom lane
On Fri, Nov 12, 2010 at 10:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bruce Momjian <bruce@momjian.us> writes: >> FYI, I always wondered if the rare use of mergejoins justified the extra >> planning time of carrying around all those joinpaths. > > They're hardly rare. They fairly rare in the sorts of queries I normally issue, but I'd quibble with the statement on other grounds: IME, we generate far more nest loops paths than anything else. The comment in match_unsorted_outer() says it all: * We always generate a nestloop path for each available outer path.* In fact we may generate as many as five: one on thecheapest-total-cost* inner path, one on the same with materialization, one on the* cheapest-startup-cost inner path (ifdifferent), one on the* cheapest-total inner-indexscan path (if any), and one on the* cheapest-startup inner-indexscanpath (if different). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company