Thread: inheritance: planning time vs children number vs column number
Hi,
I've been facing a very large (more than 15 seconds) planning time in a partitioned configuration. The amount of partitions wasn't completely crazy, around 500, not in the thousands. The problem was that there were nearly 1000 columns in the parent table (very special use case, there is a reason for this application for having these many columns). The check constraint was extremely simple (for each child, 1 column = 1 constant, always the same column).
As I was surprised by this very large planning time, I have been trying to study the variation of planning time against several parameters:
- number of columns
- number of children tables
- constraint exclusion's value (partition or off)
What (I think) I measured is that the planning time seems to be O(n^2) for the number of columns, and O(n^2) for the number of children tables.
Constraint exclusion had a limited impact on planning time (it added between 20% and 100% planning time when there were many columns).
I'd like to know if this is a known behavior ? And if I could mitigate it somehow ?
Attached is a zipped csv file containing the result of the tests for constraint_exclusion=partition, for children from 100 to 1000 in steps of 100, and for columns from 10 to 1590 in steps of 20.
A few values are a bit off-chart as this was done on my personal computer, and it was sometimes used for other things at the same time.
The tests were done with a parent table made of only integer columns, and every children having a check (col0=id_of_child) constraint (I can also provide the perl script, of course).
The test query was "SELECT * FROM parent_table WHERE col0=id_of_child_0". Replacing it with "SELECT col0 FROM parent_table WHERE col0=id_of_child_0" didn't change the planning time significantly: it was around 5% lower, but still O(n^2). This query returned nothing (every partition is empty).
I've also done an openoffice spreadsheet graphing all this, but as it's 130kB I won't send it before being told to do so :)
The computer running the tests was an Intel core i7 870. Postgresql was 9.0.3.
Anything else I could add ?
Cheers
Attachment
On 28.02.2011 11:38, Marc Cousin wrote: > I've been facing a very large (more than 15 seconds) planning time in a > partitioned configuration. The amount of partitions wasn't completely crazy, > around 500, not in the thousands. The problem was that there were nearly 1000 > columns in the parent table (very special use case, there is a reason for this > application for having these many columns). The check constraint was extremely > simple (for each child, 1 column = 1 constant, always the same column). > > As I was surprised by this very large planning time, I have been trying to > study the variation of planning time against several parameters: > - number of columns > - number of children tables > - constraint exclusion's value (partition or off) > > What (I think) I measured is that the planning time seems to be O(n^2) for the > number of columns, and O(n^2) for the number of children tables. > > Constraint exclusion had a limited impact on planning time (it added between > 20% and 100% planning time when there were many columns). Testing here with a table with 1000 columns and 100 partitions, about 80% of the planning time is looking up the statistics on attribute width, to calculate average tuple width. I don't see O(n^2) behavior, though, it seems linear. > I'd like to know if this is a known behavior ? And if I could mitigate it > somehow ? I'm out of ideas on how to make it faster, I'm afraid. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
The Monday 28 February 2011 13:57:45, Heikki Linnakangas wrote :
> On 28.02.2011 11:38, Marc Cousin wrote:
> > I've been facing a very large (more than 15 seconds) planning time in a
> > partitioned configuration. The amount of partitions wasn't completely
> > crazy, around 500, not in the thousands. The problem was that there were
> > nearly 1000 columns in the parent table (very special use case, there is
> > a reason for this application for having these many columns). The check
> > constraint was extremely simple (for each child, 1 column = 1 constant,
> > always the same column).
> >
> > As I was surprised by this very large planning time, I have been trying
> > to study the variation of planning time against several parameters: -
> > number of columns
> > - number of children tables
> > - constraint exclusion's value (partition or off)
> >
> > What (I think) I measured is that the planning time seems to be O(n^2)
> > for the number of columns, and O(n^2) for the number of children tables.
> >
> > Constraint exclusion had a limited impact on planning time (it added
> > between 20% and 100% planning time when there were many columns).
>
> Testing here with a table with 1000 columns and 100 partitions, about
> 80% of the planning time is looking up the statistics on attribute
> width, to calculate average tuple width. I don't see O(n^2) behavior,
> though, it seems linear.
It is only based on experimentation, for my part, of course…
If you measure the planning time, modifying either the columns or the partitions number, the square root of the planning time is almost perfectly proportional with the parameter you're playing with.
Marc Cousin <cousinmarc@gmail.com> writes: > The Monday 28 February 2011 13:57:45, Heikki Linnakangas wrote : >> Testing here with a table with 1000 columns and 100 partitions, about >> 80% of the planning time is looking up the statistics on attribute >> width, to calculate average tuple width. I don't see O(n^2) behavior, >> though, it seems linear. > It is only based on experimentation, for my part, of course… > If you measure the planning time, modifying either the columns or the > partitions number, the square root of the planning time is almost perfectly > proportional with the parameter you're playing with. Could we see a concrete example demonstrating that? I agree with Heikki that it's not obvious what you are testing that would have such behavior. I can think of places that would have O(N^2) behavior in the length of the targetlist, but it seems unlikely that they'd come to dominate runtime at a mere 1000 columns. regards, tom lane
We have a medium-sized catalog (about 5 million rows), but some of our customers only want to see portions of it. I've beenexperimenting with a customer-specific schema that contains nothing but a "join table" -- just the primary keys of thatportion of the data that each customer wants to see, which is used to create a view that looks like the original table.But the most important query, the one that customers use to scan page-by-page through search results, turns out tobe far too slow (65 seconds versus 55 milliseconds). Below are the results of two explain/analyze statements. The first one uses the view, the second one goes directly to theoriginal tables. I thought this would be a slam-dunk, that it would return results in a flash because the view is createdfrom two tables with the same primary keys. My guess (and it's just a wild guess) is that the "left join" is forcing a sequence scan or something. But we need the leftjoin, because it's on a "hitlist" that recorded all the matches to a customer's earlier query, and if rows have beenremoved from the tables, the customer needs to see a blank row. Here is the "bad" query, which is run on the view: em=> explain analyze select version.version_id, version.isosmiles from hitlist_rows_reset_140 left join version on (hitlist_rows_reset_140.objectid = version.version_id) where hitlist_rows_reset_140.sortorder >= 1 and hitlist_rows_reset_140.sortorder <= 10 order by hitlist_rows_reset_140.sortorder; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- ------------------------------ Nested Loop Left Join (cost=23687.51..215315.74 rows=1 width=54) (actual time=2682.662..63680.076 rows=10 loops=1) Join Filter: (hitlist_rows_reset_140.objectid = v.version_id) -> Index Scan using hitlist_rows_reset_140_pkey on hitlist_rows_reset_140 (cost=0.00..8.36 rows=1 width=8) (actualtime= 0.015..0.049 rows=10 loops=1) Index Cond: ((sortorder >= 1) AND (sortorder <= 10)) -> Hash Join (cost=23687.51..204666.54 rows=851267 width=50) (actual time=31.829..6263.403 rows=851267 loops=10) Hash Cond: (v.version_id = mv.version_id) -> Seq Scan on version v (cost=0.00..116146.68 rows=5631968 width=50) (actual time=0.006..859.758 rows=5632191loo ps=10) -> Hash (cost=13046.67..13046.67 rows=851267 width=4) (actual time=317.488..317.488 rows=851267 loops=1) -> Seq Scan on my_version mv (cost=0.00..13046.67 rows=851267 width=4) (actual time=2.888..115.166 rows=8512 67 loops=1) Total runtime: 63680.162 ms Here is the "good" query, which is run directly on the data tables. em=> explain analyze select registry.version.version_id, registry.version.isosmiles from hitlist_rows_reset_140 left join registry.version on (hitlist_rows_reset_140.objectid = registry.version.version_id) where hitlist_rows_reset_140.sortorder >= 1 and hitlist_rows_reset_140.sortorder <= 10 order by hitlist_rows_reset_140.SortOrder; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- ------------------------------ Nested Loop Left Join (cost=0.00..17.73 rows=1 width=54) (actual time=36.022..55.558 rows=10 loops=1) -> Index Scan using hitlist_rows_reset_140_pkey on hitlist_rows_reset_140 (cost=0.00..8.36 rows=1 width=8) (actualtime= 0.021..0.025 rows=10 loops=1) Index Cond: ((sortorder >= 1) AND (sortorder <= 10)) -> Index Scan using version_pkey on version (cost=0.00..9.35 rows=1 width=50) (actual time=5.551..5.552 rows=1 loops=10) Index Cond: (hitlist_rows_reset_140.objectid = version.version_id) Total runtime: 55.608 ms (6 rows) The view is defined like this: em=> \d my_version Table "test_schema.my_version" Column | Type | Modifiers ------------+---------+----------- version_id | integer | not null Indexes: "my_version_pkey" PRIMARY KEY, btree (version_id) em=> \d version View "test_schema.version" Column | Type | Modifiers ------------+---------+----------- version_id | integer | parent_id | integer | isosmiles | text | coord_2d | text | View definition: SELECT v.version_id, v.parent_id, v.isosmiles, v.coord_2d FROM registry.version v JOIN my_version mv ON mv.version_id = v.version_id; This is: Postgres 8.4.4 Ubuntu Linux 2.6.32-27 Database: 8x7200 RAID 10, LSI RAID controller with BBU WAL: 2x7200 RAID1 Non-default config parameters: max_connections = 500 shared_buffers = 1000MB work_mem = 128MB synchronous_commit = off full_page_writes = off wal_buffers = 256kB checkpoint_segments = 30 effective_cache_size = 4GB track_activities = on track_counts = off track_functions = none escape_string_warning = off Thanks, Craig
The Monday 28 February 2011 16:35:37, Tom Lane wrote :
> Marc Cousin <cousinmarc@gmail.com> writes:
> > The Monday 28 February 2011 13:57:45, Heikki Linnakangas wrote :
> >> Testing here with a table with 1000 columns and 100 partitions, about
> >> 80% of the planning time is looking up the statistics on attribute
> >> width, to calculate average tuple width. I don't see O(n^2) behavior,
> >> though, it seems linear.
> >
> > It is only based on experimentation, for my part, of courseâŠ
> >
> > If you measure the planning time, modifying either the columns or the
> > partitions number, the square root of the planning time is almost
> > perfectly proportional with the parameter you're playing with.
>
> Could we see a concrete example demonstrating that? I agree with Heikki
> that it's not obvious what you are testing that would have such behavior.
> I can think of places that would have O(N^2) behavior in the length of
> the targetlist, but it seems unlikely that they'd come to dominate
> runtime at a mere 1000 columns.
>
> regards, tom lane
I feel a little silly not having provided a test case from the start…
A script doing a complete test is attached to this email.
It's doing a simple
CREATE TABLE test_father (col0 int,col1 int,col2 int,col3 int,col4 int,col5 int,col6 int,col7 int,col8 int,col9 int,col10 in
t,col11 int,col12 int,col13 int,col14 int,col15 int,col16 int,col17 int,col18 int,col19 int,col20 int,col21 int,col22 int,co
l23 int,…)
Followed by 600
CREATE TABLE test_child_0 (CHECK (col0=0)) INHERITS (test_father);
And a single
SELECT col0 FROM test_father WHERE col0=0;
Here are my results (from the same machine). I've done it with 600 partitions, to have big planning times. If you need a smaller one (this one takes nearly ten minutes to run) tell me.
COLS:100 PARTITIONS:600
Time : 513,764 ms (sqrt : 22.6)
COLS:200 PARTITIONS:600
Time : 906,214 ms (sqrt : 30.1)
COLS:300 PARTITIONS:600
Time : 2255,390 ms (sqrt : 47.48)
COLS:400 PARTITIONS:600
Time : 4816,820 ms (sqrt : 69.4)
COLS:500 PARTITIONS:600
Time : 5736,602 ms (sqrt : 75.73)
COLS:600 PARTITIONS:600
Time : 7659,617 ms (sqrt : 87.51)
COLS:700 PARTITIONS:600
Time : 9313,260 ms (sqrt : 96.5)
COLS:800 PARTITIONS:600
Time : 13700,353 ms (sqrt : 117.04)
COLS:900 PARTITIONS:600
Time : 13914,765 ms (sqrt : 117.95)
COLS:1000 PARTITIONS:600
Time : 20335,750 ms (sqrt : 142.6)
COLS:1100 PARTITIONS:600
Time : 21048,958 ms (sqrt : 145.08)
COLS:1200 PARTITIONS:600
Time : 27619,559 ms (sqrt : 166.18)
COLS:1300 PARTITIONS:600
Time : 31357,353 ms (sqrt : 177.08)
COLS:1400 PARTITIONS:600
Time : 34435,711 ms (sqrt : 185.57)
COLS:1500 PARTITIONS:600
Time : 38954,676 ms (sqrt : 197.37)
As for my previous results, these ones are on a machine doing a bit of other work, so some values may be a bit offset, and it's only one measure each time anyway.
The CSV file I sent from the first email is obtained running the exact same commands, but playing on both columns and partitions, and averaged over 3 measures.
Regards.
Attachment
Craig James <craig_james@emolecules.com> writes: > My guess (and it's just a wild guess) is that the "left join" is > forcing a sequence scan or something. No, that's forcing the other join to be done in toto because it can't reorder the left join and regular join. regards, tom lane
> Craig James<craig_james@emolecules.com> writes: >> Here is the "bad" query, which is run on the view: >> >> em=> explain analyze >> select version.version_id, version.isosmiles >> from hitlist_rows_reset_140 >> left join version on (hitlist_rows_reset_140.objectid = version.version_id) >> where hitlist_rows_reset_140.sortorder >= 1 >> and hitlist_rows_reset_140.sortorder <= 10 >> order by hitlist_rows_reset_140.sortorder; >> QUERY PLAN >> >> ----------------------------------------------------------------------------------------------------------------------------- >> ------------------------------ >> Nested Loop Left Join (cost=23687.51..215315.74 rows=1 width=54) (actual time=2682.662..63680.076 rows=10 loops=1) >> Join Filter: (hitlist_rows_reset_140.objectid = v.version_id) >> -> Index Scan using hitlist_rows_reset_140_pkey on hitlist_rows_reset_140 (cost=0.00..8.36 rows=1 width=8) (actual time= >> 0.015..0.049 rows=10 loops=1) >> Index Cond: ((sortorder >= 1) AND (sortorder <= 10)) >> -> Hash Join (cost=23687.51..204666.54 rows=851267 width=50) (actual time=31.829..6263.403 rows=851267 loops=10) >> Hash Cond: (v.version_id = mv.version_id) >> -> Seq Scan on version v (cost=0.00..116146.68 rows=5631968 width=50) (actual time=0.006..859.758 rows=5632191 loo >> ps=10) >> -> Hash (cost=13046.67..13046.67 rows=851267 width=4) (actual time=317.488..317.488 rows=851267 loops=1) >> -> Seq Scan on my_version mv (cost=0.00..13046.67 rows=851267 width=4) (actual time=2.888..115.166 rows=8512 >> 67 loops=1) >> Total runtime: 63680.162 ms On 2/28/11 10:57 AM, Tom Lane wrote: >> My guess (and it's just a wild guess) is that the "left join" is >> forcing a sequence scan or something. > > No, that's forcing the other join to be done in toto because it can't > reorder the left join and regular join. I change the "left join" to just "join" and confirmed that it's fast -- the join on the view drops from 65 seconds back downto a few milliseconds. Then I thought maybe putting a foreign-key constraint on table "my_version" would solve the problem: alter table my_version add constraint fk_my_view foreign key(version_id) references registry.version(version_id) on delete cascade; That way, the planner would know that every key in table "my_version" has to also be in table "version", thus avoiding thatpart about "forcing the other join to be done in toto". But the foreign-key constraint makes no difference, it stilldoes the full join and takes 65 seconds. So here's how I see it: - The select can only return ten rows from table "hitlist_rows_reset_140" - The left join could be applied to table "my_version" - The results of that could be joined to table "version" It seems to me that with the foreign-key constraint, it shouldn't have to examine more than ten rows from any of the threetables. Or have I overlooked something? Thanks, Craig
Craig James <craig_james@emolecules.com> writes: > Then I thought maybe putting a foreign-key constraint on table "my_version" would solve the problem: > alter table my_version add constraint fk_my_view foreign key(version_id) > references registry.version(version_id) on delete cascade; > That way, the planner would know that every key in table "my_version" has to also be in table "version", thus avoidingthat part about "forcing the other join to be done in toto". But the foreign-key constraint makes no difference,it still does the full join and takes 65 seconds. That's just wishful thinking I'm afraid. The planner doesn't currently make any deductions whatsoever from the presence of a foreign key constraint; and even if it did, I'm not sure that this would help it decide that a join order constraint could safely be dropped. regards, tom lane
Marc Cousin <cousinmarc@gmail.com> writes: > The Monday 28 February 2011 16:35:37, Tom Lane wrote : >> Could we see a concrete example demonstrating that? I agree with Heikki >> that it's not obvious what you are testing that would have such behavior. >> I can think of places that would have O(N^2) behavior in the length of >> the targetlist, but it seems unlikely that they'd come to dominate >> runtime at a mere 1000 columns. > I feel a little silly not having provided a test case from the start… > A script doing a complete test is attached to this email. I did some oprofile analysis of this test case. It's spending essentially all its time in SearchCatCache, on failed searches of pg_statistic. The cache accumulates negative entries for each probed column, and then the searches take time proportional to the number of entries, so indeed there is an O(N^2) behavior --- but N is the number of columns times number of tables in your test case, not just the number of columns. The cache is a hash table, so ideally the search time would be more or less constant as the table grows, but to make that happen we'd need to reallocate with more buckets as the table grows, and catcache.c doesn't do that yet. We've seen a few cases that make that look worth doing, but they tend to be pretty extreme, like this one. It's worth pointing out that the only reason this effect is dominating the runtime is that you don't have any statistics for these toy test tables. If you did, the cycles spent using those entries would dwarf the lookup costs, I think. So it's hard to get excited about doing anything based on this test case --- it's likely the bottleneck would be somewhere else entirely if you'd bothered to load up some data. regards, tom lane
Le mardi 01 mars 2011 07:20:19, Tom Lane a écrit : > Marc Cousin <cousinmarc@gmail.com> writes: > > The Monday 28 February 2011 16:35:37, Tom Lane wrote : > >> Could we see a concrete example demonstrating that? I agree with Heikki > >> that it's not obvious what you are testing that would have such > >> behavior. I can think of places that would have O(N^2) behavior in the > >> length of the targetlist, but it seems unlikely that they'd come to > >> dominate runtime at a mere 1000 columns. > > > > I feel a little silly not having provided a test case from the startق� > > > > A script doing a complete test is attached to this email. > > I did some oprofile analysis of this test case. It's spending > essentially all its time in SearchCatCache, on failed searches of > pg_statistic. The cache accumulates negative entries for each probed > column, and then the searches take time proportional to the number of > entries, so indeed there is an O(N^2) behavior --- but N is the number > of columns times number of tables in your test case, not just the number > of columns. > > The cache is a hash table, so ideally the search time would be more or > less constant as the table grows, but to make that happen we'd need to > reallocate with more buckets as the table grows, and catcache.c doesn't > do that yet. We've seen a few cases that make that look worth doing, > but they tend to be pretty extreme, like this one. > > It's worth pointing out that the only reason this effect is dominating > the runtime is that you don't have any statistics for these toy test > tables. If you did, the cycles spent using those entries would dwarf > the lookup costs, I think. So it's hard to get excited about doing > anything based on this test case --- it's likely the bottleneck would be > somewhere else entirely if you'd bothered to load up some data. > > regards, tom lane Yes, for the same test case, with a bit of data in every partition and statistics up to date, planning time goes from 20 seconds to 125ms for the 600 children/1000 columns case. Which is of course more than acceptable. Now I've got to check it's the same problem on the real environment. I think it has quite a few empty partitions, so no statistics for them… Thanks a lot. Marc
Marc Cousin <cousinmarc@gmail.com> writes: > Le mardi 01 mars 2011 07:20:19, Tom Lane a écrit : >> It's worth pointing out that the only reason this effect is dominating >> the runtime is that you don't have any statistics for these toy test >> tables. If you did, the cycles spent using those entries would dwarf >> the lookup costs, I think. So it's hard to get excited about doing >> anything based on this test case --- it's likely the bottleneck would be >> somewhere else entirely if you'd bothered to load up some data. > Yes, for the same test case, with a bit of data in every partition and > statistics up to date, planning time goes from 20 seconds to 125ms for the 600 > children/1000 columns case. Which is of course more than acceptable. [ scratches head ... ] Actually, I was expecting the runtime to go up not down. Maybe there's something else strange going on here. regards, tom lane
The Tuesday 01 March 2011 16:33:51, Tom Lane wrote :
> Marc Cousin <cousinmarc@gmail.com> writes:
> > Le mardi 01 mars 2011 07:20:19, Tom Lane a écrit :
> >> It's worth pointing out that the only reason this effect is dominating
> >> the runtime is that you don't have any statistics for these toy test
> >> tables. If you did, the cycles spent using those entries would dwarf
> >> the lookup costs, I think. So it's hard to get excited about doing
> >> anything based on this test case --- it's likely the bottleneck would be
> >> somewhere else entirely if you'd bothered to load up some data.
> >
> > Yes, for the same test case, with a bit of data in every partition and
> > statistics up to date, planning time goes from 20 seconds to 125ms for
> > the 600 children/1000 columns case. Which is of course more than
> > acceptable.
>
> [ scratches head ... ] Actually, I was expecting the runtime to go up
> not down. Maybe there's something else strange going on here.
>
> regards, tom lane
Then, what can I do to help ?
I wrote: > Marc Cousin <cousinmarc@gmail.com> writes: >> Yes, for the same test case, with a bit of data in every partition and >> statistics up to date, planning time goes from 20 seconds to 125ms for the 600 >> children/1000 columns case. Which is of course more than acceptable. > [ scratches head ... ] Actually, I was expecting the runtime to go up > not down. Maybe there's something else strange going on here. Oh, doh: the failing pg_statistic lookups are all coming from the part of estimate_rel_size() where it tries to induce a reasonable tuple width estimate for an empty table (see get_rel_data_width). Probably not a case we need to get really tense about. Of course, you could also argue that this code is stupid because it's very unlikely that there will be any pg_statistic entries either. Maybe we should just have it go directly to the datatype-based estimate instead of making a boatload of useless pg_statistic probes. regards, tom lane
On Mon, Feb 28, 2011 at 2:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Craig James <craig_james@emolecules.com> writes: >> Then I thought maybe putting a foreign-key constraint on table "my_version" would solve the problem: > >> alter table my_version add constraint fk_my_view foreign key(version_id) >> references registry.version(version_id) on delete cascade; > >> That way, the planner would know that every key in table "my_version" has to also be in table "version", thus avoidingthat part about "forcing the other join to be done in toto". But the foreign-key constraint makes no difference,it still does the full join and takes 65 seconds. > > That's just wishful thinking I'm afraid. The planner doesn't currently > make any deductions whatsoever from the presence of a foreign key > constraint; and even if it did, I'm not sure that this would help it > decide that a join order constraint could safely be dropped. I've previously mused on -hackers about teaching the planner the concept of an inner-or-left-join; that is, a join that's guaranteed to return the same results whichever way we choose to implement it. Proving that an inner join is actually inner-or-left would allow the join removal logic to consider removing it altogether, and would allow reordering in cases that aren't otherwise known to be safe. Proving that a left join is actually inner-or-left doesn't help with join removal, but it might allow the join to be reordered. Maybe "non-row-reducing-join" is better terminology than "inner-or-left-join", but in any case I have a suspicion that inner join removal will end up being implemented as a special case of noticing that an inner join falls into this class. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company