Thread: plan time of MASSIVE partitioning ...

plan time of MASSIVE partitioning ...

From
PostgreSQL - Hans-Jürgen Schönig
Date:
hello everybody,

we came across an issue which turned out to be more serious than previously expected.
imagine a system with, say, 1000 partitions (heavily indexed) or so. the time taken by the planner is already fairly
heavyin this case. 

i tried this one with 5000 unindexed tables (just one col):

test=# \timing
Timing is on.
test=# prepare x(int4) AS select * from t_data order by id desc;
PREPARE
Time: 361.552 ms

you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.

does anybody see a potential way to do a shortcut through the planner?
a prepared query is no solution here as constraint exclusion would be dead in this case (making the entire thing an
evenbigger drama). 

did anybody think of a solution to this problem.
or more precisely: can there be a solution to this problem?
many thanks,
    hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de



Re: plan time of MASSIVE partitioning ...

From
Stephen Frost
Date:
* PostgreSQL - Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
> did anybody think of a solution to this problem.
> or more precisely: can there be a solution to this problem?

Please post to the correct list (-performance) and provide information
like PG version, postgresql.conf, the actual table definition, the
resulting query plan, etc, etc...
Thanks,
    Stephen

Re: plan time of MASSIVE partitioning ...

From
PostgreSQL - Hans-Jürgen Schönig
Date:
On Sep 3, 2010, at 2:04 PM, Stephen Frost wrote:

> * PostgreSQL - Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
>> did anybody think of a solution to this problem.
>> or more precisely: can there be a solution to this problem?
>
> Please post to the correct list (-performance) and provide information
> like PG version, postgresql.conf, the actual table definition, the
> resulting query plan, etc, etc...
>
>     Thanks,
>
>         Stephen



hello stephen,

this seems like more a developer question to me than a pre performance one.
it is not related to the table structure at all - it is basically an issue with incredibly large inheritance lists.
it applies to postgres 9 and most likely to everything before.
postgresql.conf is not relevant at all at this point.

the plan is pretty fine.
the question is rather: does anybody see a chance to handle such lists more efficiently inside postgres?
also, it is not the point if my data structure is sane or not. it is really more generic - namely a shortcut for this
caseinside the planing process. 
many thanks,
    hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de



Re: plan time of MASSIVE partitioning ...

From
Stephen Frost
Date:
* PostgreSQL - Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
> this seems like more a developer question to me than a pre performance one.
> it is not related to the table structure at all - it is basically an issue with incredibly large inheritance lists.
> it applies to postgres 9 and most likely to everything before.
> postgresql.conf is not relevant at all at this point.

Really?  What's constraint_exclusion set to?  Is GEQO getting used?
What are the GEQO parameters set to?  Do you have any CHECK constraints
on the tables?

If you want someone else to build a test case and start looking into it,
it's best to not make them have to guess at what you've done.

> the plan is pretty fine.
> the question is rather: does anybody see a chance to handle such lists more efficiently inside postgres?
> also, it is not the point if my data structure is sane or not. it is really more generic - namely a shortcut for this
caseinside the planing process. 

Coming up with cases where PG doesn't perform well in a completely
contrived unrealistic environment isn't likely to impress anyone to
do anything.

A small (but complete..) test case which mimics a real world environment
and real world problem would go alot farther.  I can certainly believe
that people out there have partitions set up with lots of tables and
that it will likely grow- but they probably also have CHECK constraints,
have tweaked what constraint_exclusion is set to, have adjusted their
work_mem and related settings, maybe tweaked some planner GUCs, etc,
etc.

This is an area I'm actually interested in and curious about, I'd rather
work together on it than be told that the questions I'm asking aren't
relevant.
Thanks,
    Stephen

Re: plan time of MASSIVE partitioning ...

From
Robert Haas
Date:
2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>:
> i tried this one with 5000 unindexed tables (just one col):
>
> test=# \timing
> Timing is on.
> test=# prepare x(int4) AS select * from t_data order by id desc;
> PREPARE
> Time: 361.552 ms
>
> you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.

I'd like to see (1) a script to reproduce your test environment (as
Stephen also requested) and (2) gprof or oprofile results.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at> writes:
> imagine a system with, say, 1000 partitions (heavily indexed) or so. the time taken by the planner is already fairly
heavyin this case.
 

As the fine manual points out, the current scheme for managing
partitioned tables isn't intended to scale past a few dozen partitions.

I think we'll be able to do better when we have an explicit
representation of partitioning, since then the planner won't
have to expend large amounts of effort reverse-engineering knowledge
about how an inheritance tree is partitioned.  Before that happens,
it's not really worth the trouble to worry about such cases.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
PostgreSQL - Hans-Jürgen Schönig
Date:
On Sep 3, 2010, at 4:40 PM, Tom Lane wrote:

> PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at> writes:
>> imagine a system with, say, 1000 partitions (heavily indexed) or so. the time taken by the planner is already fairly
heavyin this case. 
>
> As the fine manual points out, the current scheme for managing
> partitioned tables isn't intended to scale past a few dozen partitions.
>
> I think we'll be able to do better when we have an explicit
> representation of partitioning, since then the planner won't
> have to expend large amounts of effort reverse-engineering knowledge
> about how an inheritance tree is partitioned.  Before that happens,
> it's not really worth the trouble to worry about such cases.
>
>             regards, tom lane
>


thank you ... - the manual is clear here but we wanted to see if there is some reasonably low hanging fruit to get
aroundthis. 
it is no solution but at least a clear statement ...
many thanks,
    hans


--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de



Re: plan time of MASSIVE partitioning ...

From
Robert Haas
Date:
On Tue, Sep 7, 2010 at 2:14 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
> Hi,
>
> Robert Haas írta:
>> 2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>:
>>
>>> i tried this one with 5000 unindexed tables (just one col):
>>>
>>> test=# \timing
>>> Timing is on.
>>> test=# prepare x(int4) AS select * from t_data order by id desc;
>>> PREPARE
>>> Time: 361.552 ms
>>>
>>> you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.
>>>
>>
>> I'd like to see (1) a script to reproduce your test environment (as
>> Stephen also requested) and (2) gprof or oprofile results.
>>
>
> attached are the test scripts, create_tables.sql and childtables.sql.
> The following query takes 4.7 seconds according to psql with \timing on:
> EXPLAIN SELECT * FROM qdrs
> WHERE streamstart BETWEEN '2010-04-06' AND '2010-06-25'
> ORDER BY streamhash;

Neat.  Have you checked what effect this has on memory consumption?

Also, don't forget to add it to
https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: plan time of MASSIVE partitioning ...

From
Hans-Jürgen Schönig
Date:
hello ...

no, we have not checked memory consumption.
there is still some stuff left to optimize away - it seems we are going close to O(n^2) somewhere.
"equal" is called really often in our sample case as well:

ach sample counts as 0.01 seconds.%   cumulative   self              self     total
time   seconds   seconds    calls   s/call   s/call  name
18.87      0.80     0.80     4812     0.00     0.00  make_canonical_pathkey
15.33      1.45     0.65     4811     0.00     0.00
get_eclass_for_sort_expr
14.15      2.05     0.60  8342410     0.00     0.00  equal6.13      2.31     0.26   229172     0.00     0.00
SearchCatCache3.66     2.47     0.16  5788835     0.00     0.00  _equalList3.07      2.60     0.13  1450043     0.00
0.00  
hash_search_with_hash_value2.36      2.70     0.10  2272545     0.00     0.00  AllocSetAlloc2.12      2.79     0.09
811460    0.00     0.00  hash_any1.89      2.87     0.08  3014983     0.00     0.00  list_head1.89      2.95     0.08
574526    0.00     0.00  _bt_compare1.77      3.02     0.08 11577670     0.00     0.00  list_head1.42      3.08
0.06    1136     0.00     0.00  tzload0.94      3.12     0.04  2992373     0.00     0.00  AllocSetFreeIndex 


look at the number of calls ...
"equal" is scary ...

make_canonical_pathkey is fixed it seems.
get_eclass_for_sort_expr seems a little more complex to fix.

great you like it ...
regards,
    hans



On Sep 8, 2010, at 3:54 PM, Robert Haas wrote:

> On Tue, Sep 7, 2010 at 2:14 PM, Boszormenyi Zoltan <zb@cybertec.at> wrote:
>> Hi,
>>
>> Robert Haas írta:
>>> 2010/9/3 PostgreSQL - Hans-Jürgen Schönig <postgres@cybertec.at>:
>>>
>>>> i tried this one with 5000 unindexed tables (just one col):
>>>>
>>>> test=# \timing
>>>> Timing is on.
>>>> test=# prepare x(int4) AS select * from t_data order by id desc;
>>>> PREPARE
>>>> Time: 361.552 ms
>>>>
>>>> you will see similar or higher runtimes in case of 500 partitions and a handful of indexes.
>>>>
>>>
>>> I'd like to see (1) a script to reproduce your test environment (as
>>> Stephen also requested) and (2) gprof or oprofile results.
>>>
>>
>> attached are the test scripts, create_tables.sql and childtables.sql.
>> The following query takes 4.7 seconds according to psql with \timing on:
>> EXPLAIN SELECT * FROM qdrs
>> WHERE streamstart BETWEEN '2010-04-06' AND '2010-06-25'
>> ORDER BY streamhash;
>
> Neat.  Have you checked what effect this has on memory consumption?
>
> Also, don't forget to add it to
> https://commitfest.postgresql.org/action/commitfest_view/open
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: plan time of MASSIVE partitioning ...

From
Stephen Frost
Date:
* Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
> no, we have not checked memory consumption.
> there is still some stuff left to optimize away - it seems we are going close to O(n^2) somewhere.
> "equal" is called really often in our sample case as well:

Did the mail with the scripts, etc, get hung up due to size or
something..?  I didn't see it on the mailing list nor in the archives..
If so, could you post them somewhere so others could look..?
Thanks,
    Stephen

Re: plan time of MASSIVE partitioning ...

From
Hans-Jürgen Schönig
Date:
here is the patch again.
we accidentally attached a wrong test file to the original posting so it grew to big. we had to revoke it from the
moderator(this happens if you code from 8am to 10pm). 
here is just the patch - it is nice and small.

you can easily test it by making yourself a nice parent table, many subtables (hundreds or thousands) and a decent
numberof indexes per partition. 
then run PREPARE with \timing to see what happens.
you should get scary planning times. the more potential indexes and tables the more scary it will be.

using this wonderful RB tree the time for this function call goes down to basically zero.
i hope this is something which is useful to some folks out there.

    many thanks,

        hans







On Sep 8, 2010, at 4:18 PM, Stephen Frost wrote:

> * Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
>> no, we have not checked memory consumption.
>> there is still some stuff left to optimize away - it seems we are going close to O(n^2) somewhere.
>> "equal" is called really often in our sample case as well:
>
> Did the mail with the scripts, etc, get hung up due to size or
> something..?  I didn't see it on the mailing list nor in the archives..
> If so, could you post them somewhere so others could look..?
>
>     Thanks,
>
>         Stephen


--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


Attachment

Re: plan time of MASSIVE partitioning ...

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> Neat.  Have you checked what effect this has on memory consumption?
>
> Also, don't forget to add it to
> https://commitfest.postgresql.org/action/commitfest_view/open

Would be good to have the patch updated to be against HEAD before
posting to the commitfest.
Thanks,
    Stephen

Re: plan time of MASSIVE partitioning ...

From
Hans-Jürgen Schönig
Date:
On Sep 8, 2010, at 4:57 PM, Stephen Frost wrote:

> * Robert Haas (robertmhaas@gmail.com) wrote:
>> Neat.  Have you checked what effect this has on memory consumption?
>>
>> Also, don't forget to add it to
>> https://commitfest.postgresql.org/action/commitfest_view/open
>
> Would be good to have the patch updated to be against HEAD before
> posting to the commitfest.
>
>     Thanks,
>
>         Stephen



we will definitely provide something which is for HEAD.
but, it seems the problem we are looking is not sufficiently fixed yet.
in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling that
morecan be done to reduce this. i guess we have to attack this as well. 
regards,
    hans


--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: plan time of MASSIVE partitioning ...

From
Stephen Frost
Date:
* Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
> but, it seems the problem we are looking is not sufficiently fixed yet.
> in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling
thatmore can be done to reduce this. i guess we have to attack this as well. 

An 18% increase is certainly nice, provided it doesn't slow down or
break other things..  I'm looking through the patch now actually and
I'm not really happy with the naming, comments, or some of the code
flow, but I think the concept looks reasonable.
Thanks,
    Stephen

Re: plan time of MASSIVE partitioning ...

From
Robert Haas
Date:
2010/9/8 Hans-Jürgen Schönig <postgres@cybertec.at>:
> On Sep 8, 2010, at 4:57 PM, Stephen Frost wrote:
>
>> * Robert Haas (robertmhaas@gmail.com) wrote:
>>> Neat.  Have you checked what effect this has on memory consumption?
>>>
>>> Also, don't forget to add it to
>>> https://commitfest.postgresql.org/action/commitfest_view/open
>>
>> Would be good to have the patch updated to be against HEAD before
>> posting to the commitfest.
>
>
> we will definitely provide something which is for HEAD.
> but, it seems the problem we are looking is not sufficiently fixed yet.
> in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling
thatmore can be done to reduce this. i guess we have to attack this as well. 

Just remember that four small patches (say) are apt to get committed
faster than one big one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: plan time of MASSIVE partitioning ...

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> 2010/9/8 Hans-Jürgen Schönig <postgres@cybertec.at>:
> > but, it seems the problem we are looking is not sufficiently fixed yet.
> > in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling
thatmore can be done to reduce this. i guess we have to attack this as well. 
>
> Just remember that four small patches (say) are apt to get committed
> faster than one big one.

Indeed, but code like this makes me wonder if this is really working the
way it's supposed to:

+   val1 = (long)pk_left->pk_eclass;
+   val2 = (long)pk_right->pk_eclass;
+
+   if (val1 < val2)
+       return -1;
+   else if (val1 > val2)
+       return 1;

Have you compared how big the tree gets to the size of the list it's
supposed to be replacing..?
Thanks,
    Stephen

Re: plan time of MASSIVE partitioning ...

From
Alvaro Herrera
Date:
Excerpts from Stephen Frost's message of mié sep 08 11:26:55 -0400 2010:
> * Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
> > but, it seems the problem we are looking is not sufficiently fixed yet.
> > in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling
thatmore can be done to reduce this. i guess we have to attack this as well.
 
> 
> An 18% increase is certainly nice, provided it doesn't slow down or
> break other things..  I'm looking through the patch now actually and
> I'm not really happy with the naming, comments, or some of the code
> flow, but I think the concept looks reasonable.

I don't understand the layering between pg_tree and rbtree.  Why does it
exist at all?  At first I thought this was another implementation of
rbtrees, but then I noticed it sits on top of it.  Is this really
necessary?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Hans-Jürgen Schönig <postgres@cybertec.at> writes:
> here is the patch again.

This patch seems remarkably documentation-free.  What is it trying to
accomplish and what is it doing to the planner data structures?
(Which do have documentation BTW.)  Also, what will it do to runtime
in normal cases where the pathkeys list isn't that long?
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Stephen Frost írta:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>   
>> 2010/9/8 Hans-Jürgen Schönig <postgres@cybertec.at>:
>>     
>>> but, it seems the problem we are looking is not sufficiently fixed yet.
>>> in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling
thatmore can be done to reduce this. i guess we have to attack this as well.
 
>>>       
>> Just remember that four small patches (say) are apt to get committed
>> faster than one big one.
>>     
>
> Indeed, but code like this makes me wonder if this is really working the
> way it's supposed to:
>
> +   val1 = (long)pk_left->pk_eclass;
> +   val2 = (long)pk_right->pk_eclass;
> + 
> +   if (val1 < val2)
> +       return -1;
> +   else if (val1 > val2)
> +       return 1;
>   

The original code checked for pointers being equal among
other conditions. This was an (almost) straight conversion
to a comparison function for rbtree. Do you mean casting
the pointer to long? Yes, e.g. on 64-bit Windows it wouldn't
work. Back to plain pointer comparison.

> Have you compared how big the tree gets to the size of the list it's
> supposed to be replacing..?
>   

No, but I think it's obvious. Now we have one TreeCell
where we had one ListCell.

Best regards,
Zoltán Böszörményi

-- 
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de    http://www.postgresql.at/



Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Alvaro Herrera írta:
> Excerpts from Stephen Frost's message of mié sep 08 11:26:55 -0400 2010:
>   
>> * Hans-Jürgen Schönig (postgres@cybertec.at) wrote:
>>     
>>> but, it seems the problem we are looking is not sufficiently fixed yet.
>>> in our case we shaved off some 18% of planning time or so - looking at the other top 2 functions i got the feeling
thatmore can be done to reduce this. i guess we have to attack this as well.
 
>>>       
>> An 18% increase is certainly nice, provided it doesn't slow down or
>> break other things..  I'm looking through the patch now actually and
>> I'm not really happy with the naming, comments, or some of the code
>> flow, but I think the concept looks reasonable.
>>     
>
> I don't understand the layering between pg_tree and rbtree.  Why does it
> exist at all?  At first I thought this was another implementation of
> rbtrees, but then I noticed it sits on top of it.  Is this really
> necessary?
>   

No, if it's acceptable to omit PlannerInfo from outfuncs.c.
Or at least its canon_pathkeys member. Otherwise yes, it's
necessary. We need to store (Node *) in a fast searchable way.

This applies to anything else that may need to be converted
from list to tree to decrease planning time. Like ec_members
in EquivalenceClass.

Best regards,
Zoltán Böszörményi

-- 
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de    http://www.postgresql.at/



Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> Indeed, but code like this makes me wonder if this is really working the
> way it's supposed to:

> +   val1 = (long)pk_left->pk_eclass;
> +   val2 = (long)pk_right->pk_eclass;

Ugh.  Casting a pointer to long?  We do have portable ways to do what
this is trying to do, but that is not one.  (For example, this is
guaranteed to misbehave on 64-bit Windows.)

Offhand I think PointerGetDatum might be the best way.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Boszormenyi Zoltan <zb@cybertec.at> writes:
> This applies to anything else that may need to be converted
> from list to tree to decrease planning time. Like ec_members
> in EquivalenceClass.

AFAIR, canonical pathkeys are the *only* thing in the planner where pure
pointer equality is interesting.  So I doubt this hack is of any use for
EquivalenceClass, even if the lists were likely to be long which they
aren't.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> I'm not really happy with the naming, comments, or some of the code
> flow, but I think the concept looks reasonable.

There seems to be a lot of unnecessary palloc/pfree traffic in this
implementation.  Getting rid of that might help the speed.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Tom Lane írta:
> Boszormenyi Zoltan <zb@cybertec.at> writes:
>   
>> This applies to anything else that may need to be converted
>> from list to tree to decrease planning time. Like ec_members
>> in EquivalenceClass.
>>     
>
> AFAIR, canonical pathkeys are the *only* thing in the planner where pure
> pointer equality is interesting.  So I doubt this hack is of any use for
> EquivalenceClass, even if the lists were likely to be long which they
> aren't.
>
>             regards, tom lane
>   

No, for EquivalanceClass->ec_member, I need to do something
funnier, like implement compare(Node *, Node *) and use that
instead of equal(Node *, Node *)... Something like nodeToString()
on both Node * and strcmp() the resulting strings.

-- 
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de    http://www.postgresql.at/



Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Tom Lane írta:
> Stephen Frost <sfrost@snowman.net> writes:
>   
>> I'm not really happy with the naming, comments, or some of the code
>> flow, but I think the concept looks reasonable.
>>     
>
> There seems to be a lot of unnecessary palloc/pfree traffic in this
> implementation.  Getting rid of that might help the speed.
>
>             regards, tom lane
>   

This was a first WIP implementation, I will look into it, thanks.

-- 
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de    http://www.postgresql.at/



Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Boszormenyi Zoltan <zb@cybertec.at> writes:
> Tom Lane �rta:
>> AFAIR, canonical pathkeys are the *only* thing in the planner where pure
>> pointer equality is interesting.  So I doubt this hack is of any use for
>> EquivalenceClass, even if the lists were likely to be long which they
>> aren't.

> No, for EquivalanceClass->ec_member, I need to do something
> funnier, like implement compare(Node *, Node *) and use that
> instead of equal(Node *, Node *)... Something like nodeToString()
> on both Node * and strcmp() the resulting strings.

Well, (a) that doesn't work (hint: there are fields in nodes that are
intentionally ignored by equal()), and (b) I still don't believe that
there's an actual bottleneck there.  ECs generally aren't very big.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Tom Lane írta:
> Boszormenyi Zoltan <zb@cybertec.at> writes:
>   
>> Tom Lane írta:
>>     
>>> AFAIR, canonical pathkeys are the *only* thing in the planner where pure
>>> pointer equality is interesting.  So I doubt this hack is of any use for
>>> EquivalenceClass, even if the lists were likely to be long which they
>>> aren't.
>>>       
>
>   
>> No, for EquivalanceClass->ec_member, I need to do something
>> funnier, like implement compare(Node *, Node *) and use that
>> instead of equal(Node *, Node *)... Something like nodeToString()
>> on both Node * and strcmp() the resulting strings.
>>     
>
> Well, (a) that doesn't work (hint: there are fields in nodes that are
> intentionally ignored by equal()),

Then this compare() needs to work like equal(), which can
ignore the fields that are ignored by equal(), too.
nodeToString would need more space anyway and comparing
non-equal Nodes can return the !0 result faster.

>  and (b) I still don't believe that
> there's an actual bottleneck there.  ECs generally aren't very big.
>   

Actually, PlannerInfo->eq_classes needs to be a Tree somehow,
the comparator function is not yet final in my head.

equal() is called over 8 million times with or without our patch:

without
 %   cumulative   self              self     total          time   seconds   seconds    calls   s/call   s/call  name
18.87     0.80     0.80     4812     0.00     0.00  make_canonical_pathkey15.33      1.45     0.65     4811     0.00
0.00 
 
get_eclass_for_sort_expr14.15      2.05     0.60  8342410     0.00     0.00  equal 6.13      2.31     0.26   229172
0.00    0.00  SearchCatCache 3.66      2.47     0.16  5788835     0.00     0.00  _equalList 3.07      2.60     0.13
1450043    0.00     0.00 
 
hash_search_with_hash_value 2.36      2.70     0.10  2272545     0.00     0.00  AllocSetAlloc 2.12      2.79     0.09
811460    0.00     0.00  hash_any 1.89      2.87     0.08  3014983     0.00     0.00  list_head 1.89      2.95     0.08
 574526     0.00     0.00  _bt_compare 1.77      3.02     0.08 11577670     0.00     0.00  list_head 1.42      3.08
0.06    1136     0.00     0.00  tzload 0.94      3.12     0.04  2992373     0.00     0.00  AllocSetFreeIndex 0.94
3.16    0.04    91427     0.00     0.00  _bt_checkkeys
 
...

with
 %   cumulative   self              self     total          time   seconds   seconds    calls   s/call   s/call  name
24.51     0.88     0.88     4811     0.00     0.00 
 
get_eclass_for_sort_expr14.21      1.39     0.51  8342410     0.00     0.00  equal 8.22      1.69     0.30  5788835
0.00    0.00  _equalList 5.29      1.88     0.19   229172     0.00     0.00  SearchCatCache 2.51      1.97     0.09
1136    0.00     0.00  tzload 2.23      2.05     0.08  3014983     0.00     0.00  list_head 2.23      2.13     0.08
2283130    0.00     0.00  AllocSetAlloc 2.09      2.20     0.08   811547     0.00     0.00  hash_any 2.09      2.28
0.0811577670     0.00     0.00  list_head 1.95      2.35     0.07  1450180     0.00     0.00 
 
hash_search_with_hash_value 1.39      2.40     0.05   640690     0.00     0.00  _bt_compare 1.39      2.45     0.05
157944    0.00     0.00  LockAcquireExtended 1.39      2.50     0.05    11164     0.00     0.00  AllocSetCheck 1.11
2.54     0.04  3010547     0.00     0.00  AllocSetFreeIndex 1.11      2.58     0.04   874975     0.00     0.00
AllocSetFree1.11      2.62     0.04    66211     0.00     0.00  heap_form_tuple 0.84      2.65     0.03   888128
0.00    0.00  LWLockRelease
 
...

The number of calls are the same for  equal and _equalList
in both cases as you can see.

And if you compare the number of AllocSetAlloc calls,
it's actually smaller with our patch, so it's not that the
conversion to Tree caused this.

Best regards,
Zoltán Böszörményi

-- 
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de    http://www.postgresql.at/



Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Boszormenyi Zoltan <zb@cybertec.at> writes:
> equal() is called over 8 million times with or without our patch:

From where, though?  You've provided not a shred of evidence that
searching large ec_member lists is the problem.

Also, did the test case you're using ever make it to the list?
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Tom Lane írta:
> Boszormenyi Zoltan <zb@cybertec.at> writes:
>
>> equal() is called over 8 million times with or without our patch:
>>
>
> From where, though?  You've provided not a shred of evidence that
> searching large ec_member lists is the problem.
>

Indeed. I have put elog(NOTICE) calls in there to see which
lists is how long. It turned out that the length of ec_members is either 0
or 1, mostly 1, but the length of eq_classes is constantly growing.
This is what I need to attack then.

> Also, did the test case you're using ever make it to the list?
>

No, because it was too large and because of the test case
accidentally contained confidential information, we asked
Bruce to delete it from the moderator queue.

Attached is a shortened test case that does the same and
also shows the same problem. The create_table.sql
creates the parent table, the table_generator.sql produces
the list of constraint excluded child tables and their indexes.

The gprof output of this is only slightly modified, the
number of equal() calls is still over 8 million, it is also
attached.

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
     http://www.postgresql.at/


Attachment

Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Hi,

attached is a WIP patch against 9.1 current GIT that converts
eq_classes and canon_pathkeys in PlannerInfo.

Also attached is the test case again the slow query is:

explain select * from inh_parent
where timestamp1 between '2010-04-06' and '2010-06-25'
order by timestamp2;

There is intentionally no data, the planning time is slow.
The currect GIT version plans this query in 2.4 seconds,
the patched version does it in 0.59 seconds according to
gprof. The gprof outputs are also attached.

There is one problem with the patch, it doesn't survive
"make check". One of the regression tests fails the
    Assert(!cur_em->em_is_child);
line in process_equivalence() in equivclass.c, but I couldn't
yet find it what causes it. The "why" is vaguely clear:
something modifies the ec_members list in the eq_classes'
tree nodes while the node is in the tree. Because I didn't find
the offender yet, I couldn't fix it, so I send this patch as is.
I'll try to fix it if someone doesn't beat me in fixing it. :)

The query produces the same EXPLAIN output for both the
stock and the patched version, they were checked with diff.
I didn't attach it to this mail because of the size constraints.
Almost all files are compressed because of this.

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
     http://www.postgresql.at/


Attachment

Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Hi,

Boszormenyi Zoltan írta:
> There is one problem with the patch, it doesn't survive
> "make check". One of the regression tests fails the
>     Assert(!cur_em->em_is_child);
> line in process_equivalence() in equivclass.c, but I couldn't
> yet find it what causes it. The "why" is vaguely clear:
> something modifies the ec_members list in the eq_classes'
> tree nodes while the node is in the tree. Because I didn't find
> the offender yet, I couldn't fix it, so I send this patch as is.
> I'll try to fix it if someone doesn't beat me in fixing it. :)
>   

I am a little closer to this bug, maybe I even found the cause of it.
I found that process_equivalence() is called from:

path/equivclass.c:   reconsider_outer_join_clause()   reconsider_full_join_clause()
plan/initsplan.c:   distribute_qual_to_rels()

The problem is with the two functions in path/equivclass.c,
as process_equivalance() and those functions are all walk
the tree, and the current RBTree code can only deal with
one walk at a time. We need to push/pop the iterator state
to be able to serve more than one walkers.

Also, we need to split out the tree modifying part from
process_equivalence() somehow, as the tree walking
also cannot deal with node additions and deletions.

Best regards,
Zoltán Böszörményi

-- 
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de    http://www.postgresql.at/



Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Boszormenyi Zoltan <zb@cybertec.at> writes:
> The problem is with the two functions in path/equivclass.c,
> as process_equivalance() and those functions are all walk
> the tree, and the current RBTree code can only deal with
> one walk at a time. We need to push/pop the iterator state
> to be able to serve more than one walkers.

Good luck with that --- the iteration state is embedded in the rbtree.

> Also, we need to split out the tree modifying part from
> process_equivalence() somehow, as the tree walking
> also cannot deal with node additions and deletions.

That's not happening either.  Maybe you need to think of some other data
structure to use.  Hashtable maybe?  dynahash.c at least has reasonably
well-defined limitations in this area.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Hi,

Tom Lane írta:
> Boszormenyi Zoltan <zb@cybertec.at> writes:
>
>> The problem is with the two functions in path/equivclass.c,
>> as process_equivalance() and those functions are all walk
>> the tree, and the current RBTree code can only deal with
>> one walk at a time. We need to push/pop the iterator state
>> to be able to serve more than one walkers.
>>
>
> Good luck with that --- the iteration state is embedded in the rbtree.
>
>
>> Also, we need to split out the tree modifying part from
>> process_equivalence() somehow, as the tree walking
>> also cannot deal with node additions and deletions.
>>
>
> That's not happening either.  Maybe you need to think of some other data
> structure to use.  Hashtable maybe?  dynahash.c at least has reasonably
> well-defined limitations in this area.
>
>             regards, tom lane
>

thank you very much for pointing me to dynahash, here is the
next version that finally seems to work.

Two patches are attached, the first is the absolute minimum for
making it work, this still has the Tree type for canon_pathkeys
and eq_classes got the same treatment as join_rel_list/join_rel_hash
has in the current sources: if the list grows larger than 32, a hash table
is created. It seems to be be enough for doing in for
     get_eclass_for_sort_expr()
only, the other users of eq_classes aren't bothered by this change.

The total speedup figure is in the 70+ percent range from these
two changes, a little later GIT version than the previous tree
I tested with before shows 1.74 vs. 0.41 second runtime for the
example query. These are with asserts and profiling enabled of
course. Without asserts and profiling enabled, the "time psql"
figures are:

$ time psql -p 54321 -c "explain select * from inh_parent where
timestamp1 between '2010-04-06' and '2010-06-25' order by timestamp2"
>/dev/null

real    0m1.932s
user    0m0.035s
sys    0m0.002s

vs.

real    0m0.630s
user    0m0.033s
sys    0m0.002s

The second patch contains extra infrastructure for the Tree type,
it's currently unused, it was created for experimenting with eq_classes
being a tree. It may be useful for someone, though.

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
     http://www.postgresql.at/


Attachment

Re: plan time of MASSIVE partitioning ...

From
Heikki Linnakangas
Date:
On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
> thank you very much for pointing me to dynahash, here is the
> next version that finally seems to work.
>
> Two patches are attached, the first is the absolute minimum for
> making it work, this still has the Tree type for canon_pathkeys
> and eq_classes got the same treatment as join_rel_list/join_rel_hash
> has in the current sources: if the list grows larger than 32, a hash table
> is created. It seems to be be enough for doing in for
>       get_eclass_for_sort_expr()
> only, the other users of eq_classes aren't bothered by this change.

That's better, but can't you use dynahash for canon_pathkeys as well?


--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Heikki Linnakangas írta:
> On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
>> thank you very much for pointing me to dynahash, here is the
>> next version that finally seems to work.
>>
>> Two patches are attached, the first is the absolute minimum for
>> making it work, this still has the Tree type for canon_pathkeys
>> and eq_classes got the same treatment as join_rel_list/join_rel_hash
>> has in the current sources: if the list grows larger than 32, a hash
>> table
>> is created. It seems to be be enough for doing in for
>>       get_eclass_for_sort_expr()
>> only, the other users of eq_classes aren't bothered by this change.
>
> That's better, but can't you use dynahash for canon_pathkeys as well?

Here's a purely dynahash solution. It's somewhat slower than
the tree version, 0.45 vs 0.41 seconds in the cached case for the
previously posted test case.

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
     http://www.postgresql.at/


Attachment

Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Boszormenyi Zoltan írta:
> Heikki Linnakangas írta:
>
>> On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
>>
>>> thank you very much for pointing me to dynahash, here is the
>>> next version that finally seems to work.
>>>
>>> Two patches are attached, the first is the absolute minimum for
>>> making it work, this still has the Tree type for canon_pathkeys
>>> and eq_classes got the same treatment as join_rel_list/join_rel_hash
>>> has in the current sources: if the list grows larger than 32, a hash
>>> table
>>> is created. It seems to be be enough for doing in for
>>>       get_eclass_for_sort_expr()
>>> only, the other users of eq_classes aren't bothered by this change.
>>>
>> That's better, but can't you use dynahash for canon_pathkeys as well?
>>
>
> Here's a purely dynahash solution. It's somewhat slower than
> the tree version, 0.45 vs 0.41 seconds in the cached case for the
> previously posted test case.
>

And now in context diff, sorry for my affection towards unified diffs. :-)

> Best regards,
> Zoltán Böszörményi
>
>


--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
     http://www.postgresql.at/


Attachment

Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Boszormenyi Zoltan írta:
> Boszormenyi Zoltan írta:
>
>> Heikki Linnakangas írta:
>>
>>
>>> On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
>>>
>>>
>>>> thank you very much for pointing me to dynahash, here is the
>>>> next version that finally seems to work.
>>>>
>>>> Two patches are attached, the first is the absolute minimum for
>>>> making it work, this still has the Tree type for canon_pathkeys
>>>> and eq_classes got the same treatment as join_rel_list/join_rel_hash
>>>> has in the current sources: if the list grows larger than 32, a hash
>>>> table
>>>> is created. It seems to be be enough for doing in for
>>>>       get_eclass_for_sort_expr()
>>>> only, the other users of eq_classes aren't bothered by this change.
>>>>
>>>>
>>> That's better, but can't you use dynahash for canon_pathkeys as well?
>>>
>>>
>> Here's a purely dynahash solution. It's somewhat slower than
>> the tree version, 0.45 vs 0.41 seconds in the cached case for the
>> previously posted test case.
>>
>>
>
> And now in context diff, sorry for my affection towards unified diffs. :-)
>

A little better version, no need for the heavy hash_any, hash_uint32
on the lower 32 bits on pk_eclass is enough. The profiling runtime
is now 0.42 seconds vs the previous 0.41 seconds for the tree version.

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
     http://www.postgresql.at/


Attachment

Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Boszormenyi Zoltan írta:
> Boszormenyi Zoltan írta:
>   
>> Boszormenyi Zoltan írta:
>>   
>>     
>>> Heikki Linnakangas írta:
>>>   
>>>     
>>>       
>>>> On 26.10.2010 18:34, Boszormenyi Zoltan wrote:
>>>>     
>>>>       
>>>>         
>>>>> thank you very much for pointing me to dynahash, here is the
>>>>> next version that finally seems to work.
>>>>>
>>>>> Two patches are attached, the first is the absolute minimum for
>>>>> making it work, this still has the Tree type for canon_pathkeys
>>>>> and eq_classes got the same treatment as join_rel_list/join_rel_hash
>>>>> has in the current sources: if the list grows larger than 32, a hash
>>>>> table
>>>>> is created. It seems to be be enough for doing in for
>>>>>       get_eclass_for_sort_expr()
>>>>> only, the other users of eq_classes aren't bothered by this change.
>>>>>       
>>>>>         
>>>>>           
>>>> That's better, but can't you use dynahash for canon_pathkeys as well?
>>>>     
>>>>       
>>>>         
>>> Here's a purely dynahash solution. It's somewhat slower than
>>> the tree version, 0.45 vs 0.41 seconds in the cached case for the
>>> previously posted test case.
>>>   
>>>     
>>>       
>> And now in context diff, sorry for my affection towards unified diffs. :-)
>>   
>>     
>
> A little better version, no need for the heavy hash_any, hash_uint32
> on the lower 32 bits on pk_eclass is enough. The profiling runtime
> is now 0.42 seconds vs the previous 0.41 seconds for the tree version.
>
> Best regards,
> Zoltán Böszörményi
>   

Btw, the top entries in the current gprof output are:

Each sample counts as 0.01 seconds. %   cumulative   self              self     total          time   seconds   seconds
  calls  ms/call  ms/call  name   19.05      0.08     0.08      482     0.17     0.29 
 
add_child_rel_equivalences11.90      0.13     0.05  1133447     0.00     0.00  bms_is_subset 9.52      0.17     0.04
331162    0.00     0.00 
 
hash_search_with_hash_value 7.14      0.20     0.03   548971     0.00     0.00  AllocSetAlloc 4.76      0.22     0.02
 2858     0.01     0.01  get_tabstat_entry 4.76      0.24     0.02     1136     0.02     0.02  tzload
 

This means add_child_rel_equivalences() is still takes
too much time, the previously posted test case calls this
function 482 times, it's called for almost  every 10th entry
added to eq_classes. The elog() I put into this function says
that at the last call list_length(eq_classes) == 4754.

Best regards,
Zoltán Böszörményi

-- 
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de    http://www.postgresql.at/



Re: plan time of MASSIVE partitioning ...

From
Heikki Linnakangas
Date:
On 28.10.2010 13:54, Boszormenyi Zoltan wrote:
> A little better version, no need for the heavy hash_any, hash_uint32
> on the lower 32 bits on pk_eclass is enough. The profiling runtime
> is now 0.42 seconds vs the previous 0.41 seconds for the tree version.

Actually, I wonder if we could just have a separate canon_pathkeys list
for each EquivalenceClass, instead of one big list in PlannerInfo. I'm
not too familiar with equivalence classes and all that, but the attached
patch at least passes the regressions. I haven't done any performance
testing, but I would expect this to be even faster than the hashtable or
tree implementations, and a lot simpler.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Boszormenyi Zoltan <zb@cybertec.at> writes:
> This means add_child_rel_equivalences() is still takes
> too much time, the previously posted test case calls this
> function 482 times, it's called for almost  every 10th entry
> added to eq_classes. The elog() I put into this function says
> that at the last call list_length(eq_classes) == 4754.

That seems like a ridiculously large number of ECs.  What is the
test query again?
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Actually, I wonder if we could just have a separate canon_pathkeys list 
> for each EquivalenceClass, instead of one big list in PlannerInfo. I'm 
> not too familiar with equivalence classes and all that,

Hm.  I don't like getting rid of the main canon_pathkeys list like that.
The whole point of a canonical pathkey is that there is only one, so
it seems like we need a central list.  But it might be sane for each
EC to have an additional, side list of PKs made from it.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Tom Lane írta:
> Boszormenyi Zoltan <zb@cybertec.at> writes:
>   
>> This means add_child_rel_equivalences() is still takes
>> too much time, the previously posted test case calls this
>> function 482 times, it's called for almost  every 10th entry
>> added to eq_classes. The elog() I put into this function says
>> that at the last call list_length(eq_classes) == 4754.
>>     
>
> That seems like a ridiculously large number of ECs.  What is the
> test query again?
>
>             regards, tom lane
>   

The test case is here:
http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at

create_table.sql for the main table plus childtables.sql.gz, the EXPLAIN
query is in the message body.

Basically, it's a model for a lot of data for three months, partitioned by
4 hour intervals for every day. Imagine the call list handled by a
phone company in a large country.

Best regards,
Zoltán Böszörményi

-- 
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de    http://www.postgresql.at/



Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Boszormenyi Zoltan <zb@cybertec.at> writes:
> Tom Lane �rta:
>> That seems like a ridiculously large number of ECs.  What is the
>> test query again?

> The test case is here:
> http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at

After poking through that a bit, I think that the real issue is in this
division of labor:
           index_pathkeys = build_index_pathkeys(root, index,
ForwardScanDirection);          useful_pathkeys = truncate_useless_pathkeys(root, rel,
                    index_pathkeys);
 

If you trace what is happening here, the index pathkeys that actually
survive the "usefulness" test all refer to exactly ONE equivalence
class, namely the one arising from the query's "order by timestamp2"
clause.  All the other pathkeys that get created are immediately
discarded as being irrelevant to the query.  The reason that we end up
with so many equivalence classes is that there is nothing causing the
variables of the different child tables to be recognized as all
sort-equivalent.  Maybe that's a bug in itself, but I would argue that
the right way to make this faster is to refactor things so that we
don't generate useless equivalence classes in the first place, or
at least don't keep them around in the planner's lists once we realize
they're useless.

I like Heikki's hack to cut down on searching in make_canonical_pathkey,
but I think that complicating the data structure searching beyond that
is just a band-aid.  Reasonably-sized queries shouldn't contain very
many equivalence classes: they should only come from equality clauses
or sort conditions that appeared in the query text.  Therefore, there
also shouldn't be all that many distinct pathkeys.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
I wrote:
> the right way to make this faster is to refactor things so that we
> don't generate useless equivalence classes in the first place, or
> at least don't keep them around in the planner's lists once we realize
> they're useless.

After a bit of hacking, I propose the attached patch.

> I like Heikki's hack to cut down on searching in make_canonical_pathkey,
> but I think that complicating the data structure searching beyond that
> is just a band-aid.

With the given test case and this patch, we end up with exactly two
canonical pathkeys referencing a single EquivalenceClass.  So as far
as I can tell there's not a lot of point in refining the pathkey
searching.  Now, the EquivalenceClass has got 483 members, which
means that there's still some O(N^2) behavior in
get_eclass_for_sort_expr.  There might be some use in refining the
search for a matching eclass member.  It's not sticking out in
profiling like it did before though.

            regards, tom lane

diff --git a/src/backend/optimizer/README b/src/backend/optimizer/README
index d6402cf911817b1b8c17da91019a1fac19bf051a..5c0786f2fe6dea9a72ad66ba93aa8833ab0e26ba 100644
*** a/src/backend/optimizer/README
--- b/src/backend/optimizer/README
*************** sort ordering was important; and so usin
*** 632,640 ****
  orderings doesn't create any real problem.



- Though Bob Devine <bob.devine@worldnet.att.net> was not involved in the
- coding of our optimizer, he is available to field questions about
- optimizer topics.

  -- bjm & tgl
--- 632,670 ----
  orderings doesn't create any real problem.


+ Order of processing for EquivalenceClasses and PathKeys
+ -------------------------------------------------------
+
+ As alluded to above, there is a specific sequence of phases in the
+ processing of EquivalenceClasses and PathKeys during planning.  During the
+ initial scanning of the query's quals (deconstruct_jointree followed by
+ reconsider_outer_join_clauses), we construct EquivalenceClasses based on
+ mergejoinable clauses found in the quals.  At the end of this process,
+ we know all we can know about equivalence of different variables, so
+ subsequently there will be no further merging of EquivalenceClasses.
+ At that point it is possible to consider the EquivalenceClasses as
+ "canonical" and build canonical PathKeys that reference them.  Before
+ we reach that point (actually, before entering query_planner at all)
+ we also ensure that we have constructed EquivalenceClasses for all the
+ expressions used in the query's ORDER BY and related clauses.  These
+ classes might or might not get merged together, depending on what we
+ find in the quals.
+
+ Because all the EquivalenceClasses are known before we begin path
+ generation, we can use them as a guide to which indexes are of interest:
+ if an index's column is not mentioned in any EquivalenceClass then that
+ index's sort order cannot possibly be helpful for the query.  This allows
+ short-circuiting of much of the processing of create_index_paths() for
+ irrelevant indexes.
+
+ There are some cases where planner.c constructs additional
+ EquivalenceClasses and PathKeys after query_planner has completed.
+ In these cases, the extra ECs/PKs are needed to represent sort orders
+ that were not considered during query_planner.  Such situations should be
+ minimized since it is impossible for query_planner to return a plan
+ producing such a sort order, meaning a explicit sort will always be needed.
+ Currently this happens only for queries involving multiple window functions
+ with different orderings, so extra sorts are needed anyway.


  -- bjm & tgl
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index e44e960b5454d4698ed82e4e857794ffe2a9adf2..c101c272a14b2f1b9d92a54670688df057d84a13 100644
*** a/src/backend/optimizer/path/equivclass.c
--- b/src/backend/optimizer/path/equivclass.c
*************** static bool reconsider_full_join_clause(
*** 78,83 ****
--- 78,87 ----
   * join.  (This is the reason why we need a failure return.  It's more
   * convenient to check this case here than at the call sites...)
   *
+  * On success return, we have also initialized the clause's left_ec/right_ec
+  * fields to point to the EquivalenceClass built from it.  This saves lookup
+  * effort later.
+  *
   * Note: constructing merged EquivalenceClasses is a standard UNION-FIND
   * problem, for which there exist better data structures than simple lists.
   * If this code ever proves to be a bottleneck then it could be sped up ---
*************** process_equivalence(PlannerInfo *root, R
*** 106,111 ****
--- 110,119 ----
                 *em2;
      ListCell   *lc1;

+     /* Should not already be marked as having generated an eclass */
+     Assert(restrictinfo->left_ec == NULL);
+     Assert(restrictinfo->right_ec == NULL);
+
      /* Extract info from given clause */
      Assert(is_opclause(clause));
      opno = ((OpExpr *) clause)->opno;
*************** process_equivalence(PlannerInfo *root, R
*** 236,243 ****
          {
              ec1->ec_sources = lappend(ec1->ec_sources, restrictinfo);
              ec1->ec_below_outer_join |= below_outer_join;
              /* mark the RI as usable with this pair of EMs */
-             /* NB: can't set left_ec/right_ec until merging is finished */
              restrictinfo->left_em = em1;
              restrictinfo->right_em = em2;
              return true;
--- 244,253 ----
          {
              ec1->ec_sources = lappend(ec1->ec_sources, restrictinfo);
              ec1->ec_below_outer_join |= below_outer_join;
+             /* mark the RI as associated with this eclass */
+             restrictinfo->left_ec = ec1;
+             restrictinfo->right_ec = ec1;
              /* mark the RI as usable with this pair of EMs */
              restrictinfo->left_em = em1;
              restrictinfo->right_em = em2;
              return true;
*************** process_equivalence(PlannerInfo *root, R
*** 266,271 ****
--- 276,284 ----
          ec2->ec_relids = NULL;
          ec1->ec_sources = lappend(ec1->ec_sources, restrictinfo);
          ec1->ec_below_outer_join |= below_outer_join;
+         /* mark the RI as associated with this eclass */
+         restrictinfo->left_ec = ec1;
+         restrictinfo->right_ec = ec1;
          /* mark the RI as usable with this pair of EMs */
          restrictinfo->left_em = em1;
          restrictinfo->right_em = em2;
*************** process_equivalence(PlannerInfo *root, R
*** 276,281 ****
--- 289,297 ----
          em2 = add_eq_member(ec1, item2, item2_relids, false, item2_type);
          ec1->ec_sources = lappend(ec1->ec_sources, restrictinfo);
          ec1->ec_below_outer_join |= below_outer_join;
+         /* mark the RI as associated with this eclass */
+         restrictinfo->left_ec = ec1;
+         restrictinfo->right_ec = ec1;
          /* mark the RI as usable with this pair of EMs */
          restrictinfo->left_em = em1;
          restrictinfo->right_em = em2;
*************** process_equivalence(PlannerInfo *root, R
*** 286,291 ****
--- 302,310 ----
          em1 = add_eq_member(ec2, item1, item1_relids, false, item1_type);
          ec2->ec_sources = lappend(ec2->ec_sources, restrictinfo);
          ec2->ec_below_outer_join |= below_outer_join;
+         /* mark the RI as associated with this eclass */
+         restrictinfo->left_ec = ec2;
+         restrictinfo->right_ec = ec2;
          /* mark the RI as usable with this pair of EMs */
          restrictinfo->left_em = em1;
          restrictinfo->right_em = em2;
*************** process_equivalence(PlannerInfo *root, R
*** 311,316 ****
--- 330,338 ----

          root->eq_classes = lappend(root->eq_classes, ec);

+         /* mark the RI as associated with this eclass */
+         restrictinfo->left_ec = ec;
+         restrictinfo->right_ec = ec;
          /* mark the RI as usable with this pair of EMs */
          restrictinfo->left_em = em1;
          restrictinfo->right_em = em2;
*************** add_eq_member(EquivalenceClass *ec, Expr
*** 362,376 ****
  /*
   * get_eclass_for_sort_expr
   *      Given an expression and opfamily info, find an existing equivalence
!  *      class it is a member of; if none, build a new single-member
   *      EquivalenceClass for it.
   *
   * sortref is the SortGroupRef of the originating SortGroupClause, if any,
   * or zero if not.    (It should never be zero if the expression is volatile!)
   *
   * This can be used safely both before and after EquivalenceClass merging;
   * since it never causes merging it does not invalidate any existing ECs
!  * or PathKeys.
   *
   * Note: opfamilies must be chosen consistently with the way
   * process_equivalence() would do; that is, generated from a mergejoinable
--- 384,402 ----
  /*
   * get_eclass_for_sort_expr
   *      Given an expression and opfamily info, find an existing equivalence
!  *      class it is a member of; if none, optionally build a new single-member
   *      EquivalenceClass for it.
   *
   * sortref is the SortGroupRef of the originating SortGroupClause, if any,
   * or zero if not.    (It should never be zero if the expression is volatile!)
   *
+  * If create_it is TRUE, we'll build a new EquivalenceClass when there is no
+  * match.  If create_it is FALSE, we just return NULL when no match.
+  *
   * This can be used safely both before and after EquivalenceClass merging;
   * since it never causes merging it does not invalidate any existing ECs
!  * or PathKeys.  However, ECs added after path generation has begun are
!  * of limited usefulness, so usually it's best to create them beforehand.
   *
   * Note: opfamilies must be chosen consistently with the way
   * process_equivalence() would do; that is, generated from a mergejoinable
*************** get_eclass_for_sort_expr(PlannerInfo *ro
*** 382,388 ****
                           Expr *expr,
                           Oid expr_datatype,
                           List *opfamilies,
!                          Index sortref)
  {
      EquivalenceClass *newec;
      EquivalenceMember *newem;
--- 408,415 ----
                           Expr *expr,
                           Oid expr_datatype,
                           List *opfamilies,
!                          Index sortref,
!                          bool create_it)
  {
      EquivalenceClass *newec;
      EquivalenceMember *newem;
*************** get_eclass_for_sort_expr(PlannerInfo *ro
*** 426,433 ****
          }
      }

      /*
!      * No match, so build a new single-member EC
       *
       * Here, we must be sure that we construct the EC in the right context. We
       * can assume, however, that the passed expr is long-lived.
--- 453,464 ----
          }
      }

+     /* No match; does caller want a NULL result? */
+     if (!create_it)
+         return NULL;
+
      /*
!      * OK, build a new single-member EC
       *
       * Here, we must be sure that we construct the EC in the right context. We
       * can assume, however, that the passed expr is long-lived.
*************** create_join_clause(PlannerInfo *root,
*** 1094,1101 ****
      rinfo->parent_ec = parent_ec;

      /*
!      * We can set these now, rather than letting them be looked up later,
!      * since this is only used after EC merging is complete.
       */
      rinfo->left_ec = ec;
      rinfo->right_ec = ec;
--- 1125,1132 ----
      rinfo->parent_ec = parent_ec;

      /*
!      * We know the correct values for left_ec/right_ec, ie this particular EC,
!      * so we can just set them directly instead of forcing another lookup.
       */
      rinfo->left_ec = ec;
      rinfo->right_ec = ec;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index ab3b9cd394d9af54bcd59e9830a9fbcc0a773fa8..bea134b3f1a02db8c611a6fc46d5bba041961682 100644
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*************** select_mergejoin_clauses(PlannerInfo *ro
*** 1041,1047 ****
           * mergejoin is not really all that big a deal, and so it's not clear
           * that improving this is important.
           */
!         cache_mergeclause_eclasses(root, restrictinfo);

          if (EC_MUST_BE_REDUNDANT(restrictinfo->left_ec) ||
              EC_MUST_BE_REDUNDANT(restrictinfo->right_ec))
--- 1041,1047 ----
           * mergejoin is not really all that big a deal, and so it's not clear
           * that improving this is important.
           */
!         update_mergeclause_eclasses(root, restrictinfo);

          if (EC_MUST_BE_REDUNDANT(restrictinfo->left_ec) ||
              EC_MUST_BE_REDUNDANT(restrictinfo->right_ec))
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 643d57a92dc83f6adf4e2d1da223909528d59805..20c6d73617d507b88fce427a681320a684b47cd0 100644
*** a/src/backend/optimizer/path/pathkeys.c
--- b/src/backend/optimizer/path/pathkeys.c
*************** static PathKey *make_pathkey_from_sortin
*** 40,45 ****
--- 40,46 ----
                             Expr *expr, Oid ordering_op,
                             bool nulls_first,
                             Index sortref,
+                            bool create_it,
                             bool canonicalize);
  static Var *find_indexkey_var(PlannerInfo *root, RelOptInfo *rel,
                    AttrNumber varattno);
*************** canonicalize_pathkeys(PlannerInfo *root,
*** 230,235 ****
--- 231,240 ----
   * If the PathKey is being generated from a SortGroupClause, sortref should be
   * the SortGroupClause's SortGroupRef; otherwise zero.
   *
+  * create_it is TRUE if we should create any missing EquivalenceClass
+  * needed to represent the sort key.  If it's FALSE, we return NULL if the
+  * sort key isn't already present in any EquivalenceClass.
+  *
   * canonicalize should always be TRUE after EquivalenceClass merging has
   * been performed, but FALSE if we haven't done EquivalenceClass merging yet.
   */
*************** make_pathkey_from_sortinfo(PlannerInfo *
*** 238,243 ****
--- 243,249 ----
                             Expr *expr, Oid ordering_op,
                             bool nulls_first,
                             Index sortref,
+                            bool create_it,
                             bool canonicalize)
  {
      Oid            opfamily,
*************** make_pathkey_from_sortinfo(PlannerInfo *
*** 300,308 ****
                                              COERCE_DONTCARE);
      }

!     /* Now find or create a matching EquivalenceClass */
      eclass = get_eclass_for_sort_expr(root, expr, opcintype, opfamilies,
!                                       sortref);

      /* And finally we can find or create a PathKey node */
      if (canonicalize)
--- 306,318 ----
                                              COERCE_DONTCARE);
      }

!     /* Now find or (optionally) create a matching EquivalenceClass */
      eclass = get_eclass_for_sort_expr(root, expr, opcintype, opfamilies,
!                                       sortref, create_it);
!
!     /* Fail if no EC and !create_it */
!     if (!eclass)
!         return NULL;

      /* And finally we can find or create a PathKey node */
      if (canonicalize)
*************** get_cheapest_fractional_path_for_pathkey
*** 478,485 ****
   * The result is canonical, meaning that redundant pathkeys are removed;
   * it may therefore have fewer entries than there are index columns.
   *
!  * We generate the full pathkeys list whether or not all are useful for the
!  * current query.  Caller should do truncate_useless_pathkeys().
   */
  List *
  build_index_pathkeys(PlannerInfo *root,
--- 488,498 ----
   * The result is canonical, meaning that redundant pathkeys are removed;
   * it may therefore have fewer entries than there are index columns.
   *
!  * Another reason for stopping early is that we may be able to tell that
!  * an index column's sort order is uninteresting for this query.  However,
!  * that test is just based on the existence of an EquivalenceClass and not
!  * on position in pathkey lists, so it's not complete.  Caller should call
!  * truncate_useless_pathkeys() to possibly remove more pathkeys.
   */
  List *
  build_index_pathkeys(PlannerInfo *root,
*************** build_index_pathkeys(PlannerInfo *root,
*** 527,540 ****
              indexprs_item = lnext(indexprs_item);
          }

!         /* OK, make a canonical pathkey for this sort key */
          cpathkey = make_pathkey_from_sortinfo(root,
                                                indexkey,
                                                sortop,
                                                nulls_first,
                                                0,
                                                true);

          /* Add to list unless redundant */
          if (!pathkey_is_redundant(cpathkey, retval))
              retval = lappend(retval, cpathkey);
--- 540,562 ----
              indexprs_item = lnext(indexprs_item);
          }

!         /* OK, try to make a canonical pathkey for this sort key */
          cpathkey = make_pathkey_from_sortinfo(root,
                                                indexkey,
                                                sortop,
                                                nulls_first,
                                                0,
+                                               false,
                                                true);

+         /*
+          * If the sort key isn't already present in any EquivalenceClass,
+          * then it's not an interesting sort order for this query.  So
+          * we can stop now --- lower-order sort keys aren't useful either.
+          */
+         if (!cpathkey)
+             break;
+
          /* Add to list unless redundant */
          if (!pathkey_is_redundant(cpathkey, retval))
              retval = lappend(retval, cpathkey);
*************** convert_subquery_pathkeys(PlannerInfo *r
*** 644,656 ****
                                               outer_expr,
                                               sub_member->em_datatype,
                                               sub_eclass->ec_opfamilies,
!                                              0);
!                 best_pathkey =
!                     make_canonical_pathkey(root,
!                                            outer_ec,
!                                            sub_pathkey->pk_opfamily,
!                                            sub_pathkey->pk_strategy,
!                                            sub_pathkey->pk_nulls_first);
              }
          }
          else
--- 666,685 ----
                                               outer_expr,
                                               sub_member->em_datatype,
                                               sub_eclass->ec_opfamilies,
!                                              0,
!                                              false);
!
!                 /*
!                  * If we don't find a matching EC, sub-pathkey isn't
!                  * interesting to the outer query
!                  */
!                 if (outer_ec)
!                     best_pathkey =
!                         make_canonical_pathkey(root,
!                                                outer_ec,
!                                                sub_pathkey->pk_opfamily,
!                                                sub_pathkey->pk_strategy,
!                                                sub_pathkey->pk_nulls_first);
              }
          }
          else
*************** convert_subquery_pathkeys(PlannerInfo *r
*** 738,744 ****
                                                          outer_expr,
                                                       sub_member->em_datatype,
                                                     sub_eclass->ec_opfamilies,
!                                                         0);
                      outer_pk = make_canonical_pathkey(root,
                                                        outer_ec,
                                                      sub_pathkey->pk_opfamily,
--- 767,782 ----
                                                          outer_expr,
                                                       sub_member->em_datatype,
                                                     sub_eclass->ec_opfamilies,
!                                                         0,
!                                                         false);
!
!                     /*
!                      * If we don't find a matching EC, this sub-pathkey isn't
!                      * interesting to the outer query
!                      */
!                     if (!outer_ec)
!                         continue;
!
                      outer_pk = make_canonical_pathkey(root,
                                                        outer_ec,
                                                      sub_pathkey->pk_opfamily,
*************** make_pathkeys_for_sortclauses(PlannerInf
*** 859,864 ****
--- 897,903 ----
                                               sortcl->sortop,
                                               sortcl->nulls_first,
                                               sortcl->tleSortGroupRef,
+                                              true,
                                               canonicalize);

          /* Canonical form eliminates redundant ordering keys */
*************** make_pathkeys_for_sortclauses(PlannerInf
*** 878,923 ****
   ****************************************************************************/

  /*
!  * cache_mergeclause_eclasses
!  *        Make the cached EquivalenceClass links valid in a mergeclause
!  *        restrictinfo.
   *
   * RestrictInfo contains fields in which we may cache pointers to
   * EquivalenceClasses for the left and right inputs of the mergeclause.
   * (If the mergeclause is a true equivalence clause these will be the
!  * same EquivalenceClass, otherwise not.)
   */
  void
! cache_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo)
  {
      Assert(restrictinfo->mergeopfamilies != NIL);

!     /* the cached values should be either both set or both not */
!     if (restrictinfo->left_ec == NULL)
!     {
!         Expr       *clause = restrictinfo->clause;
!         Oid            lefttype,
!                     righttype;

!         /* Need the declared input types of the operator */
!         op_input_types(((OpExpr *) clause)->opno, &lefttype, &righttype);

!         /* Find or create a matching EquivalenceClass for each side */
!         restrictinfo->left_ec =
!             get_eclass_for_sort_expr(root,
!                                      (Expr *) get_leftop(clause),
!                                      lefttype,
!                                      restrictinfo->mergeopfamilies,
!                                      0);
!         restrictinfo->right_ec =
!             get_eclass_for_sort_expr(root,
!                                      (Expr *) get_rightop(clause),
!                                      righttype,
!                                      restrictinfo->mergeopfamilies,
!                                      0);
!     }
!     else
!         Assert(restrictinfo->right_ec != NULL);
  }

  /*
--- 917,997 ----
   ****************************************************************************/

  /*
!  * initialize_mergeclause_eclasses
!  *        Set the EquivalenceClass links in a mergeclause restrictinfo.
   *
   * RestrictInfo contains fields in which we may cache pointers to
   * EquivalenceClasses for the left and right inputs of the mergeclause.
   * (If the mergeclause is a true equivalence clause these will be the
!  * same EquivalenceClass, otherwise not.)  If the mergeclause is either
!  * used to generate an EquivalenceClass, or derived from an EquivalenceClass,
!  * then it's easy to set up the left_ec and right_ec members --- otherwise,
!  * this function should be called to set them up.  We will generate new
!  * EquivalenceClauses if necessary to represent the mergeclause's left and
!  * right sides.
!  *
!  * Note this is called before EC merging is complete, so the links won't
!  * necessarily point to canonical ECs.  Before they are actually used for
!  * anything, update_mergeclause_eclasses must be called to ensure that
!  * they've been updated to point to canonical ECs.
   */
  void
! initialize_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo)
  {
+     Expr       *clause = restrictinfo->clause;
+     Oid            lefttype,
+                 righttype;
+
+     /* Should be a mergeclause ... */
      Assert(restrictinfo->mergeopfamilies != NIL);
+     /* ... with links not yet set */
+     Assert(restrictinfo->left_ec == NULL);
+     Assert(restrictinfo->right_ec == NULL);

!     /* Need the declared input types of the operator */
!     op_input_types(((OpExpr *) clause)->opno, &lefttype, &righttype);

!     /* Find or create a matching EquivalenceClass for each side */
!     restrictinfo->left_ec =
!         get_eclass_for_sort_expr(root,
!                                  (Expr *) get_leftop(clause),
!                                  lefttype,
!                                  restrictinfo->mergeopfamilies,
!                                  0,
!                                  true);
!     restrictinfo->right_ec =
!         get_eclass_for_sort_expr(root,
!                                  (Expr *) get_rightop(clause),
!                                  righttype,
!                                  restrictinfo->mergeopfamilies,
!                                  0,
!                                  true);
! }

! /*
!  * update_mergeclause_eclasses
!  *        Make the cached EquivalenceClass links valid in a mergeclause
!  *        restrictinfo.
!  *
!  * These pointers should have been set by process_equivalence or
!  * initialize_mergeclause_eclasses, but they might have been set to
!  * non-canonical ECs that got merged later.  Chase up to the canonical
!  * merged parent if so.
!  */
! void
! update_mergeclause_eclasses(PlannerInfo *root, RestrictInfo *restrictinfo)
! {
!     /* Should be a merge clause ... */
!     Assert(restrictinfo->mergeopfamilies != NIL);
!     /* ... with pointers already set */
!     Assert(restrictinfo->left_ec != NULL);
!     Assert(restrictinfo->right_ec != NULL);
!
!     /* Chase up to the top as needed */
!     while (restrictinfo->left_ec->ec_merged)
!         restrictinfo->left_ec = restrictinfo->left_ec->ec_merged;
!     while (restrictinfo->right_ec->ec_merged)
!         restrictinfo->right_ec = restrictinfo->right_ec->ec_merged;
  }

  /*
*************** find_mergeclauses_for_pathkeys(PlannerIn
*** 953,959 ****
      {
          RestrictInfo *rinfo = (RestrictInfo *) lfirst(i);

!         cache_mergeclause_eclasses(root, rinfo);
      }

      foreach(i, pathkeys)
--- 1027,1033 ----
      {
          RestrictInfo *rinfo = (RestrictInfo *) lfirst(i);

!         update_mergeclause_eclasses(root, rinfo);
      }

      foreach(i, pathkeys)
*************** select_outer_pathkeys_for_merge(PlannerI
*** 1089,1095 ****
          ListCell   *lc2;

          /* get the outer eclass */
!         cache_mergeclause_eclasses(root, rinfo);

          if (rinfo->outer_is_left)
              oeclass = rinfo->left_ec;
--- 1163,1169 ----
          ListCell   *lc2;

          /* get the outer eclass */
!         update_mergeclause_eclasses(root, rinfo);

          if (rinfo->outer_is_left)
              oeclass = rinfo->left_ec;
*************** make_inner_pathkeys_for_merge(PlannerInf
*** 1250,1256 ****
          EquivalenceClass *ieclass;
          PathKey    *pathkey;

!         cache_mergeclause_eclasses(root, rinfo);

          if (rinfo->outer_is_left)
          {
--- 1324,1330 ----
          EquivalenceClass *ieclass;
          PathKey    *pathkey;

!         update_mergeclause_eclasses(root, rinfo);

          if (rinfo->outer_is_left)
          {
*************** pathkeys_useful_for_merging(PlannerInfo
*** 1366,1372 ****

                  if (restrictinfo->mergeopfamilies == NIL)
                      continue;
!                 cache_mergeclause_eclasses(root, restrictinfo);

                  if (pathkey->pk_eclass == restrictinfo->left_ec ||
                      pathkey->pk_eclass == restrictinfo->right_ec)
--- 1440,1446 ----

                  if (restrictinfo->mergeopfamilies == NIL)
                      continue;
!                 update_mergeclause_eclasses(root, restrictinfo);

                  if (pathkey->pk_eclass == restrictinfo->left_ec ||
                      pathkey->pk_eclass == restrictinfo->right_ec)
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index a2ff314679daa30003d59d04ec01066049e1cb13..77ebadd5cc6a634f5e07886c19a3e4c71dbe3c31 100644
*** a/src/backend/optimizer/plan/initsplan.c
--- b/src/backend/optimizer/plan/initsplan.c
*************** distribute_qual_to_rels(PlannerInfo *roo
*** 1066,1071 ****
--- 1066,1077 ----
       *
       * If none of the above hold, pass it off to
       * distribute_restrictinfo_to_rels().
+      *
+      * In all cases, it's important to initialize the left_ec and right_ec
+      * fields of a mergejoinable clause, so that all possibly mergejoinable
+      * expressions have representations in EquivalenceClasses.  If
+      * process_equivalence is successful, it will take care of that;
+      * otherwise, we have to call initialize_mergeclause_eclasses to do it.
       */
      if (restrictinfo->mergeopfamilies)
      {
*************** distribute_qual_to_rels(PlannerInfo *roo
*** 1073,1082 ****
          {
              if (process_equivalence(root, restrictinfo, below_outer_join))
                  return;
!             /* EC rejected it, so pass to distribute_restrictinfo_to_rels */
          }
          else if (maybe_outer_join && restrictinfo->can_join)
          {
              if (bms_is_subset(restrictinfo->left_relids,
                                outerjoin_nonnullable) &&
                  !bms_overlap(restrictinfo->right_relids,
--- 1079,1093 ----
          {
              if (process_equivalence(root, restrictinfo, below_outer_join))
                  return;
!             /* EC rejected it, so set left_ec/right_ec the hard way ... */
!             initialize_mergeclause_eclasses(root, restrictinfo);
!             /* ... and fall through to distribute_restrictinfo_to_rels */
          }
          else if (maybe_outer_join && restrictinfo->can_join)
          {
+             /* we need to set up left_ec/right_ec the hard way */
+             initialize_mergeclause_eclasses(root, restrictinfo);
+             /* now see if it should go to any outer-join lists */
              if (bms_is_subset(restrictinfo->left_relids,
                                outerjoin_nonnullable) &&
                  !bms_overlap(restrictinfo->right_relids,
*************** distribute_qual_to_rels(PlannerInfo *roo
*** 1104,1109 ****
--- 1115,1126 ----
                                                    restrictinfo);
                  return;
              }
+             /* nope, so fall through to distribute_restrictinfo_to_rels */
+         }
+         else
+         {
+             /* we still need to set up left_ec/right_ec */
+             initialize_mergeclause_eclasses(root, restrictinfo);
          }
      }

*************** process_implied_equality(PlannerInfo *ro
*** 1404,1409 ****
--- 1421,1429 ----
   *
   * This overlaps the functionality of process_implied_equality(), but we
   * must return the RestrictInfo, not push it into the joininfo tree.
+  *
+  * Note: we do not do initialize_mergeclause_eclasses() here.  It is
+  * caller's responsibility that left_ec/right_ec be set as necessary.
   */
  RestrictInfo *
  build_implied_join_equality(Oid opno,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 5f628eeeb96174a3c9051257e235baff43efcacd..6bce53cce57d5c3d8551ad12c371e36100c009ff 100644
*** a/src/include/optimizer/paths.h
--- b/src/include/optimizer/paths.h
*************** extern EquivalenceClass *get_eclass_for_
*** 116,122 ****
                           Expr *expr,
                           Oid expr_datatype,
                           List *opfamilies,
!                          Index sortref);
  extern void generate_base_implied_equalities(PlannerInfo *root);
  extern List *generate_join_implied_equalities(PlannerInfo *root,
                                   RelOptInfo *joinrel,
--- 116,123 ----
                           Expr *expr,
                           Oid expr_datatype,
                           List *opfamilies,
!                          Index sortref,
!                          bool create_it);
  extern void generate_base_implied_equalities(PlannerInfo *root);
  extern List *generate_join_implied_equalities(PlannerInfo *root,
                                   RelOptInfo *joinrel,
*************** extern List *make_pathkeys_for_sortclaus
*** 172,178 ****
                                List *sortclauses,
                                List *tlist,
                                bool canonicalize);
! extern void cache_mergeclause_eclasses(PlannerInfo *root,
                             RestrictInfo *restrictinfo);
  extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root,
                                 List *pathkeys,
--- 173,181 ----
                                List *sortclauses,
                                List *tlist,
                                bool canonicalize);
! extern void initialize_mergeclause_eclasses(PlannerInfo *root,
!                                             RestrictInfo *restrictinfo);
! extern void update_mergeclause_eclasses(PlannerInfo *root,
                             RestrictInfo *restrictinfo);
  extern List *find_mergeclauses_for_pathkeys(PlannerInfo *root,
                                 List *pathkeys,

Re: plan time of MASSIVE partitioning ...

From
Boszormenyi Zoltan
Date:
Tom Lane írta:
> I wrote:
>
>> the right way to make this faster is to refactor things so that we
>> don't generate useless equivalence classes in the first place, or
>> at least don't keep them around in the planner's lists once we realize
>> they're useless.
>>
>
> After a bit of hacking, I propose the attached patch.
>
>
>> I like Heikki's hack to cut down on searching in make_canonical_pathkey,
>> but I think that complicating the data structure searching beyond that
>> is just a band-aid.
>>
>
> With the given test case and this patch, we end up with exactly two
> canonical pathkeys referencing a single EquivalenceClass.  So as far
> as I can tell there's not a lot of point in refining the pathkey
> searching.  Now, the EquivalenceClass has got 483 members, which
> means that there's still some O(N^2) behavior in
> get_eclass_for_sort_expr.  There might be some use in refining the
> search for a matching eclass member.  It's not sticking out in
> profiling like it did before though.
>
>             regards, tom lane
>

Thanks, this patch made get_eclass_from_sort_expr almost,
make_canonical_pathkeys and add_child_rel_equivalences
completely disappear from the gprof timing.

+1 for including this into 9.1.

On the other hand, if I use a similar test case to my original one
(i.e. the tables are much wider) then the query planning takes
1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
output now looks like this:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 21.13      0.30     0.30   235091     0.00     0.00  SearchCatCache
  7.04      0.40     0.10  1507206     0.00     0.00
hash_search_with_hash_value
  3.52      0.45     0.05  2308219     0.00     0.00  AllocSetAlloc
  3.52      0.50     0.05   845776     0.00     0.00  hash_any
  3.52      0.55     0.05   341637     0.00     0.00  HeapTupleSatisfiesNow
  3.52      0.60     0.05     1136     0.00     0.00  tzload
  2.82      0.64     0.04      547     0.00     0.00  get_rel_data_width
  2.11      0.67     0.03   669414     0.00     0.00  hash_search
  2.11      0.70     0.03   235091     0.00     0.00  SearchSysCache
  2.11      0.73     0.03   192590     0.00     0.00  copyObject
  2.11      0.76     0.03   164457     0.00     0.00  pgstat_initstats
  2.11      0.79     0.03   152999     0.00     0.00  index_getnext
...

Use the attached synthetic create_table_wide.sql together with the
previous childtables.sql. The full compressed gprof output is attached.
Your patch creates a 70% speedup in planning time, which is excellent.

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
     http://www.postgresql.at/


Attachment

Re: plan time of MASSIVE partitioning ...

From
Leonardo Francalanci
Date:
> On the other hand, if I use a similar test case to my original  one
> (i.e. the tables are much wider) then the query planning takes
> 1.42  seconds in 9.1 with this patch instead of about 4.7 seconds
> as we observed it  using PostgreSQL 9.0.0. The beginning of the gprof
> output now looks like  this:

Hi,

I'm really interested in this patch.

I tried a simple test case:

create table t (a integer, b text);

DO $$DECLARE i int;
BEGIN           FOR i IN 0..9000 LOOP               EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || '
and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';  EXECUTE 'create index tidx' || i || ' ON t' || i || '  (a)';
END LOOP;
END$$;

explain select * from t where a > 1060 and a < 1090;

but I don't get any gain from the patch... explain time is still around 250 ms.

Tried with 9000 partitions, time is still 2 secs.


Maybe I've missed completely the patch purpose?


(I tried the test case at

http://archives.postgresql.org/message-id/4CBD9DDC.4040304@cybertec.at

and that, in fact, gets a boost with this patch).



Leonardo





Re: plan time of MASSIVE partitioning ...

From
Leonardo Francalanci
Date:
> but I don't get any gain from the patch... explain time is still  around 250
>ms.
> Tried with 9000 partitions, time is still 2  secs.


Small correction: I tried with 3000 partitions (FOR i IN 0..3000 ...)
and got 250ms with both versions, with 9000 partitions 2 secs (again
no gain from the patch)





Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Leonardo Francalanci <m_lists@yahoo.it> writes:
> I tried a simple test case:

> create table t (a integer, b text);

> DO $$DECLARE i int;
> BEGIN
>             FOR i IN 0..9000 LOOP
>                 EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || ' 
> and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';
>    EXECUTE 'create index tidx' || i || ' ON t' || i || '  (a)';
> END LOOP;
> END$$;

> explain select * from t where a > 1060 and a < 1090;

This is going to be dominated by constraint exclusion checking.  There's
basically no fix for that except a more explicit representation of the
partitioning rules.  If the planner has to make 8999 theorem proofs to
remove the 8999 unwanted partitions from the plan, it's gonna take
awhile.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Leonardo Francalanci
Date:
> This is going to be dominated by constraint exclusion  checking.  There's
> basically no fix for that except a more explicit  representation of the
> partitioning rules.

Damn, I knew that was going to be more complicated :)

So in which case does this patch help? I guess in a multi-index
scenario? childtables.sql is kind of hard to read (I think a FOR loop
would have been more auto-explaining?).





Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Boszormenyi Zoltan <zb@cybertec.at> writes:
> On the other hand, if I use a similar test case to my original one
> (i.e. the tables are much wider) then the query planning takes
> 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
> as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
> output now looks like this:

>   %   cumulative   self              self     total          
>  time   seconds   seconds    calls   s/call   s/call  name   
>  21.13      0.30     0.30   235091     0.00     0.00  SearchCatCache
>   7.04      0.40     0.10  1507206     0.00     0.00  hash_search_with_hash_value
>   3.52      0.45     0.05  2308219     0.00     0.00  AllocSetAlloc

Yeah, for me it looks even worse: oprofile shows about 77% of time in
SearchCatCache.  I poked around a little and it seems that probably most
of the time is going into searches of the STATRELATTINH syscache, which
looks like this:

$13 = {id = 41, cc_next = 0x2b43a60,  cc_relname = 0x7f6bc6ed2218 "pg_statistic", cc_reloid = 2619,  cc_indexoid =
2696,cc_relisshared = 0 '\000', cc_tupdesc = 0x7f6bc6ed11d8,  cc_ntup = 68922, cc_nbuckets = 1024, cc_nkeys = 3, cc_key
={1, 2, 3, 0},  ...
 

Most of those entries are "negative" cache entries, since we don't have
any actual stats in this toy example.

I think that we probably should be very circumspect about believing that
this example is still a good guide to what to optimize next; in
particular, in a real-world example with real stats, I'm not sure that
the hot spots will still be in the same places.  I'd advise loading up
some real data and doing more profiling.

However, if the hot spot does stay in SearchCatCache, I can't help
noticing that those bucket chains are looking a bit overloaded ---
sixty-plus entries per bucket ain't good.  Maybe it's time to teach
catcache.c how to reorganize its hashtables once the load factor
exceeds a certain level.  Or more drastically, maybe it should lose
its private hashtable logic and use dynahash.c; I'm not sure at the
moment if the private implementation has any important characteristics
dynahash hasn't got.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
I wrote:
> This is going to be dominated by constraint exclusion checking.

Hmm, maybe I spoke too soon.  With 9000 child tables I get a profile
like this:

samples  %        symbol name
447433   47.1553  get_tabstat_entry
185458   19.5456  find_all_inheritors
53064     5.5925  SearchCatCache
33864     3.5690  pg_strtok
26301     2.7719  hash_search_with_hash_value
22577     2.3794  AllocSetAlloc
6696      0.7057  MemoryContextAllocZeroAligned
6250      0.6587  expression_tree_walker
5141      0.5418  LockReleaseAll
4779      0.5037  get_relation_info
4506      0.4749  MemoryContextAlloc
4467      0.4708  expression_tree_mutator
4136      0.4359  pgstat_initstats
3914      0.4125  relation_excluded_by_constraints

get_tabstat_entry and find_all_inheritors are both obviously O(N^2) in
the number of tables they have to deal with.  However, the constant
factors are small enough that you need a heck of a lot of tables
before they become significant consumers of runtime.  I'm not convinced
that we should be optimizing for 9000-child-table cases.

It'd be worth fixing these if we can do it without either introducing a
lot of complexity, or slowing things down for typical cases that have
only a few tables.  Offhand not sure about how to do either.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Robert Haas
Date:
On Fri, Oct 29, 2010 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Boszormenyi Zoltan <zb@cybertec.at> writes:
>> On the other hand, if I use a similar test case to my original one
>> (i.e. the tables are much wider) then the query planning takes
>> 1.42 seconds in 9.1 with this patch instead of about 4.7 seconds
>> as we observed it using PostgreSQL 9.0.0. The beginning of the gprof
>> output now looks like this:
>
>>   %   cumulative   self              self     total
>>  time   seconds   seconds    calls   s/call   s/call  name
>>  21.13      0.30     0.30   235091     0.00     0.00  SearchCatCache
>>   7.04      0.40     0.10  1507206     0.00     0.00  hash_search_with_hash_value
>>   3.52      0.45     0.05  2308219     0.00     0.00  AllocSetAlloc
>
> Yeah, for me it looks even worse: oprofile shows about 77% of time in
> SearchCatCache.  I poked around a little and it seems that probably most
> of the time is going into searches of the STATRELATTINH syscache, which
> looks like this:
>
> $13 = {id = 41, cc_next = 0x2b43a60,
>  cc_relname = 0x7f6bc6ed2218 "pg_statistic", cc_reloid = 2619,
>  cc_indexoid = 2696, cc_relisshared = 0 '\000', cc_tupdesc = 0x7f6bc6ed11d8,
>  cc_ntup = 68922, cc_nbuckets = 1024, cc_nkeys = 3, cc_key = {1, 2, 3, 0},
>  ...
>
> Most of those entries are "negative" cache entries, since we don't have
> any actual stats in this toy example.
>
> I think that we probably should be very circumspect about believing that
> this example is still a good guide to what to optimize next; in
> particular, in a real-world example with real stats, I'm not sure that
> the hot spots will still be in the same places.  I'd advise loading up
> some real data and doing more profiling.
>
> However, if the hot spot does stay in SearchCatCache, I can't help
> noticing that those bucket chains are looking a bit overloaded ---
> sixty-plus entries per bucket ain't good.  Maybe it's time to teach
> catcache.c how to reorganize its hashtables once the load factor
> exceeds a certain level.  Or more drastically, maybe it should lose
> its private hashtable logic and use dynahash.c; I'm not sure at the
> moment if the private implementation has any important characteristics
> dynahash hasn't got.

I'm not sure what's happening in this particular case, but I seem to
remember poking at a case a while back where we were doing a lot of
repeated statistics lookups for the same columns.  If that's also the
the case here and if there is some way to avoid it (hang a pointer to
the stats off the node tree somewhere?) we might be able to cut down
on the number of hash probes, as an alternative to or in addition to
making them faster.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: plan time of MASSIVE partitioning ...

From
Leonardo Francalanci
Date:
> Hmm, maybe I spoke too soon.  With 9000 child tables I get  a profile
> like this:


Well, the 9000-table-test-case was meant to check the difference in
performance with/without the patch... I don't see the reason for trying
to optimize such an unrealistic case.

BTW can someone explain to me which are the cases where the
patch actually helps?




Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Leonardo Francalanci <m_lists@yahoo.it> writes:
> BTW can someone explain to me which are the cases where the
> patch actually helps?

Cases with lots of irrelevant indexes.  Zoltan's example had 4 indexes
per child table, only one of which was relevant to the query.  In your
test case there are no irrelevant indexes, which is why the runtime
didn't change.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 29, 2010 at 12:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, if the hot spot does stay in SearchCatCache, I can't help
>> noticing that those bucket chains are looking a bit overloaded ---
>> sixty-plus entries per bucket ain't good. �Maybe it's time to teach
>> catcache.c how to reorganize its hashtables once the load factor
>> exceeds a certain level. �Or more drastically, maybe it should lose
>> its private hashtable logic and use dynahash.c; I'm not sure at the
>> moment if the private implementation has any important characteristics
>> dynahash hasn't got.

> I'm not sure what's happening in this particular case, but I seem to
> remember poking at a case a while back where we were doing a lot of
> repeated statistics lookups for the same columns.  If that's also the
> the case here and if there is some way to avoid it (hang a pointer to
> the stats off the node tree somewhere?) we might be able to cut down
> on the number of hash probes, as an alternative to or in addition to
> making them faster.

I think there are already layers of caching in the planner to avoid
fetching the same stats entries more than once per query.  The problem
here is that there are so many child tables that even fetching stats
once per table per query starts to add up.  (Also, as I said, I'm
worried that we're being misled by the fact that there are no stats to
fetch --- so we're not seeing the costs of actually doing something with
the stats if they existed.)
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of vie oct 29 14:15:55 -0300 2010:
> I wrote:
> > This is going to be dominated by constraint exclusion checking.
> 
> Hmm, maybe I spoke too soon.  With 9000 child tables I get a profile
> like this:
> 
> samples  %        symbol name
> 447433   47.1553  get_tabstat_entry

Is there a reason for keeping the pgstat info in plain lists?  We could
use dynahash there too, I think.  It would have more palloc overhead
that way, though (hmm, but perhaps that can be fixed by having a smart
"alloc" function for it, preallocating a bunch of entries instead of one
by one).

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Tom Lane's message of vie oct 29 14:15:55 -0300 2010:
>> samples  %        symbol name
>> 447433   47.1553  get_tabstat_entry

> Is there a reason for keeping the pgstat info in plain lists?

Yeah: anything else loses for small numbers of tables per query, which
is the normal case.  I'd guess you'd need ~100 tables touched in
a single transaction before a hashtable is even worth thinking about.

We could possibly adopt a solution similar to the planner's approach for
joinrels: start with a simple list, and switch over to hashing if the
list gets too long.  But I'm really doubtful that it's worth the code
space.  Even with Zoltan's 500-or-so-table case, this wasn't on the
radar screen.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Leonardo Francalanci
Date:
> Cases with lots of irrelevant indexes.  Zoltan's example had  4 indexes
> per child table, only one of which was relevant to the query.   In your
> test case there are no irrelevant indexes, which is why the  runtime
> didn't change.



Mmh... I must be doing something wrong. It looks to me it's not just
the irrelevant indexes: it's the "order by" that counts. If I remove that
times are the same with and without the patch:

using the test case:

explain select * from inh_parent
where timestamp1 between '2010-04-06' and '2010-06-25'

this one runs in the same time with the patch; but adding:


order by timestamp2

made the non-patched version run 3 times slower.

My test case:

create table t (a integer, b integer, c integer, d integer, e text);

DO $$DECLARE i int;
BEGIN
FOR i IN 0..2000 LOOP
EXECUTE 'create table t' || i || ' ( CHECK (a >' || i*10 || '
and a <= ' || (i+1)*10 || ' ) ) INHERITS (t)';
EXECUTE 'create index taidx' || i || ' ON t' || i || '  (a)';
EXECUTE 'create index tbidx' || i || ' ON t' || i || '  (b)';
EXECUTE 'create index tcidx' || i || ' ON t' || i || '  (c)';
EXECUTE 'create index tdidx' || i || ' ON t' || i || '  (d)';
END LOOP;
END$$;


explain select * from t where a > 1060 and a < 109000

this runs in 1.5 secs with and without the patch. But if I add
order by b

the non-patched version runs in 10 seconds.

Am I getting it wrong?





Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Leonardo Francalanci <m_lists@yahoo.it> writes:
>> Cases with lots of irrelevant indexes.  Zoltan's example had  4 indexes
>> per child table, only one of which was relevant to the query.   In your
>> test case there are no irrelevant indexes, which is why the  runtime
>> didn't change.

> Mmh... I must be doing something wrong. It looks to me it's not just
> the irrelevant indexes: it's the "order by" that counts.

Ah, I oversimplified a bit: actually, if you don't have an ORDER BY or
any mergejoinable join clauses, then the possibly_useful_pathkeys test
in find_usable_indexes figures out that we aren't interested in the sort
ordering of *any* indexes, so the whole thing gets short-circuited.
You need at least the possibility of interest in sorted output from an
indexscan before any of this code runs.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
I wrote:
> samples  %        symbol name
> 447433   47.1553  get_tabstat_entry
> 185458   19.5456  find_all_inheritors
> 53064     5.5925  SearchCatCache
> 33864     3.5690  pg_strtok

> get_tabstat_entry and find_all_inheritors are both obviously O(N^2) in
> the number of tables they have to deal with.  However, the constant
> factors are small enough that you need a heck of a lot of tables
> before they become significant consumers of runtime.  I'm not convinced
> that we should be optimizing for 9000-child-table cases.

> It'd be worth fixing these if we can do it without either introducing a
> lot of complexity, or slowing things down for typical cases that have
> only a few tables.  Offhand not sure about how to do either.

I had a thought about how to make get_tabstat_entry() faster without
adding overhead: what if we just plain remove the search, and always
assume that a new entry has to be added to the tabstat array?

The existing code seems to be designed to make no assumptions about
how it's being used, but that's a bit silly.  We know that the links are
coming from the relcache, which will have only one entry per relation,
and that the relcache is designed to hang onto the links for (at least)
the life of a transaction.  So rather than optimizing for the case where
the relcache fails to remember the tabstat link, maybe we should
optimize for the case where it does remember.

The worst-case consequence AFAICS would be multiple tabstat entries for
the same relation, which seems pretty noncritical anyway.

Thoughts?
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
[ for the archives' sake ]

I wrote:
> I had a thought about how to make get_tabstat_entry() faster without
> adding overhead: what if we just plain remove the search, and always
> assume that a new entry has to be added to the tabstat array?

I spent some time looking into this idea.  It doesn't really work,
because there are places that will break if a transaction has more than
one tabstat entry for the same relation.  The one that seems most
difficult to fix is that pgstat_recv_tabstat() clamps its n_live_tuples
and n_dead_tuples values to be nonnegative after adding in each delta
received from a backend.  That is a good idea because it prevents insane
results if some messages get lost --- but if a transaction's updates get
randomly spread into several tabstat items, the intermediate counts
might get clamped, resulting in a wrong final answer even though nothing
was lost.

I also added some instrumentation printouts and found that in our
regression tests:* about 10% of get_tabstat_entry() calls find an existing entry  for the relation OID.  This seems to
happenonly when a  relcache entry gets flushed mid-transaction, but that does  happen, and not so infrequently either.*
abouthalf of the transactions use as many as 20 tabstats,  and 10% use 50 or more; but it drops off fast after that.
Almostno transactions use as many as 100 tabstats.
 
It's not clear that these numbers are representative of typical
database applications, but they're something to start with anyway.

I also did some testing to compare the cost of get_tabstat_entry's
linear search against a dynahash.c table with OID key.  As I suspected,
a hash table would make this code a *lot* slower for small numbers of
tabstat entries: about a factor of 10 slower.  You need upwards of 100
tabstats touched in a transaction before the hash table begins to pay
for itself.  This is largely because dynahash doesn't have any cheap way
to reset a hashtable to empty, so you have to initialize and destroy the
table for each transaction.  By the time you've eaten that overhead,
you've already expended as many cycles as the linear search takes to
handle several dozen entries.

I conclude that if we wanted to do something about this, the most
practical solution would be the one of executing linear searches until
we get to 100+ tabstat entries in a transaction, and then building a
hashtable for subsequent searches.  However, it's exceedingly unclear
that it will ever be worth the effort or code space to do that.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Bruce Momjian
Date:
Tom Lane wrote:
> Leonardo Francalanci <m_lists@yahoo.it> writes:
> >> Cases with lots of irrelevant indexes.  Zoltan's example had  4 indexes
> >> per child table, only one of which was relevant to the query.   In your
> >> test case there are no irrelevant indexes, which is why the  runtime
> >> didn't change.
> 
> > Mmh... I must be doing something wrong. It looks to me it's not just
> > the irrelevant indexes: it's the "order by" that counts.
> 
> Ah, I oversimplified a bit: actually, if you don't have an ORDER BY or
> any mergejoinable join clauses, then the possibly_useful_pathkeys test
> in find_usable_indexes figures out that we aren't interested in the sort
> ordering of *any* indexes, so the whole thing gets short-circuited.
> You need at least the possibility of interest in sorted output from an
> indexscan before any of this code runs.

FYI, I always wondered if the rare use of mergejoins justified the extra
planning time of carrying around all those joinpaths.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: plan time of MASSIVE partitioning ...

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> FYI, I always wondered if the rare use of mergejoins justified the extra
> planning time of carrying around all those joinpaths.

They're hardly rare.
        regards, tom lane


Re: plan time of MASSIVE partitioning ...

From
Robert Haas
Date:
On Fri, Nov 12, 2010 at 10:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> FYI, I always wondered if the rare use of mergejoins justified the extra
>> planning time of carrying around all those joinpaths.
>
> They're hardly rare.

They fairly rare in the sorts of queries I normally issue, but I'd
quibble with the statement on other grounds: IME, we generate far more
nest loops paths than anything else.  The comment in
match_unsorted_outer() says it all:
* We always generate a nestloop path for each available outer path.* In fact we may generate as many as five: one on
thecheapest-total-cost* inner path, one on the same with materialization, one on the* cheapest-startup-cost inner path
(ifdifferent), one on the* cheapest-total inner-indexscan path (if any), and one on the* cheapest-startup
inner-indexscanpath (if different).
 

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company