Thread: Many joins: monthly summaries S-L--O--W
Hello all! I've been working with PostgreSQL now for about a year and have really enjoyed it. And I've been able to impress the boss with very simple PHP/PostgreSQL web applications, so that's been great too! Nearly all I've done has been very straightforward and basic: simple selects from a couple of different tables joined by where clauses. But now I'd like to do something a bit more complex: show sales per month, current inventory, and expected duration of inventory at present sales rates for each item we have. All in one relation :) Here are the relevant relations products (code TEXT PK, name TEXT) orders (product_code TEXT FK products(id), date DATE, branch_id INTEGER FK branches(id), qty INTEGER) inventory (product_code INTEGER FK products(id), date DATE, qty INTEGER) What I'm trying to make is a table with columns product_code, product_name, jan03_qty,..., half1_qty, jul03_qty, ... sep03_qty, half2_qty, total_qty, inv_qty, inv_date, est_inv_qty, months_remaining where product_name is products(name), jan03_qty is the quantity of sales in January, 2003, half1_qty is quantity of sales from January through June, 2003, inv_qty is the latest inventory data we have, inv_date is the date of that inventory data, est_inv_qty is the estimated current inventory based on inv_qty and sales since inv_date, and months_remaining is an estimate of how many months the estimated inventory will last at average sales rates (calculated using the average monthly sales for the previous months). I've got something that works, but it's *extremely* slow. It takes about 10 minutes running on a 500MHz G4 Mac with 1GB RAM, running OS 10.2.8 and PostgreSQL 7.3.3. Here's what I've done: select products.code as product_code, products.name as product_name, jan03.qty as jan03_qty, feb03.qty as feb03_qty ... inv.qty as inv_qty, est_inv.qty as est_inv_qty, months_remaining::numeric(8,1) from products left join ( select products.code as product_code, sum(qty) as qty from orders, products where products.code = orders.id and date between '2003-01-01' and '2003-01-31' group by product_code) as jan03 on (jan03.product_code = products.code) left join ( select products.code as product_code, sum(qty) as qty from orders, products where products.code = orders.id and date between '2003-02-01' and '2003-02-28' group by product_code) as feb03 on (feb03.product_code = products.code) left join -- repeat through total_qty total on (total.product_code = products.code) left join ( -- this is where it get's hairy select est_inventory.product_code as product_code, est_inventory.qty, monthly_averages.monthly_average, (est_inventory.qty/monthly_average)::numeric(10,1) as months_remaining from ( select inventory.product_code as product_code, inventory.qty - coalesce(orders.qty,0) as qty from ( select product_code, date, qty from current_inventory_view ) as inventory left outer join ( select orders.product_code as product_code, sum(orders.qty) as qty from ( select product_code, date, qty from current_inventory_view ) as inventory, orders where orders.date > inventory.date and orders.product_code = inventory.product_code group by orders.product_code ) as orders on (inventory.product_code = orders.product_code) ) as est_inventory left outer join ( select product_code as product_code, sum(qty)/ageinmonths(timestamp '9/30/2003', timestamp '1/1/2003') as monthly_average from orders where date between '1/1/2003' and '9/30/2003' group by product_code ) as monthly_averages on (est_inventory.product_code = monthly_averages.product_code) where monthly_average > 0 ) as remaining on (remaining.product_code = products.code) left join ( select distinct product_code, date, qty from current_inventory_view order by date desc ) as inventory on (inventory.product_code = products.code) ; -- finally ageinmonths is an sql function that returns a double precision float, the number of months between $1 and $2, defined as follows select 12 * date_part ('year',age($1, $2)) + date_part('month',age($1,$2)) + date_part('day'), age($1, $2))/30; current_inventory_view, showing the most recent inventory qty and date, is defined as SELECT inv.date, inv.product_code, inv.qty FROM inventory inv, (SELECT max(inventory.date) AS date, inventory.product_code FROM inventory GROUP BY inventory.product_code) curr_inv WHERE ((inv.date = curr_inv.date) AND (inv.product_code = curr_inv.product_code)); orders is indexed on date and product_code (and branch_id, but that shouldn't matter) inventory is indexed on date products is unindexed besides the order_pkey index that's auto-generated. I'm using left joins because not all items sell each month. Items that have zero sales would fall out if I used WHERE clauses. Now, I've done an EXPLAIN on the monstrosity and it comes to just over 3 pages printed at 8 pt on A3 landscape. It's much more than I can take in. I've looked through the EXPLAIN documentation and am willing to learn (which is good, because I definitely need to). I tried a version of this, truncating the 'hairy' part. It definitely ran faster, taking less than 5 minutes. Also, this query requires more upkeep than I'd like. I modify it every time I want to add a new month. It seems like I should be able to use a custom function so I can just put in the date span I'd like to cover. I've started to work on a function and have run into some problems (see separate post: Custom function problems if you're interested ) Any suggestion would be appreciated. It seems like this should be something that's easy, and I'm approaching it wrong. If I'm completely on the wrong track, I'd love to know! Also, places I should/could look for ways to accomplish this, that'd be great. If you've made it to here, thanks for your perseverance! : ) Regards, Michael Glaesemann grzm myrealbox com
Micheal, > where product_name is products(name), jan03_qty is the quantity of > sales in January, 2003, half1_qty is quantity of sales from January > through June, 2003, inv_qty is the latest inventory data we have, > inv_date is the date of that inventory data, est_inv_qty is the > estimated current inventory based on inv_qty and sales since inv_date, > and months_remaining is an estimate of how many months the estimated > inventory will last at average sales rates (calculated using the > average monthly sales for the previous months). Well, I wouldn't recommend your left outer join approach. You're making the query do far too much work. There are a number of different ways to solve the "crosstab" problem, and the outer join method is only really good for small data sets. I'd suggest instead that you use the "aggregate grid" method: Construct a table like this, called month_xtab: month_no jan_ct feb_ct mar_ct .... half_ct half2_ct 1 1 0 0 1 0 2 0 1 0 1 0 3 0 0 1 1 0 ... 12 0 0 0 0 1 Then you can do monthly crosstabs like: SELECT item, SUM(no_sold * jan_ct) as jan_qty, SUM(no_sold * feb_ct) as feb_qty .... SUM (no_sold * half_ct) as half_qty, SUM(no_sold) as tot_qty FROM sales, month_xtab WHERE (extract(month from sales_date) = month_no and extract(year from sales_date) = 2003) This is much, much faster than the outer join method for large numbers of columns. For better performance, make an index on extract(month from sales_date). This doesn't solve your whole query problem, but it's a good start. This solution, and other tips, can be found in Joe Celko's "SQL for Smarties" -- Josh Berkus Aglio Database Solutions San Francisco
Michael Glaesmann <grzm@myrealbox.com> writes: > [ a very messy query ] It does seem like that is a big query with a small query struggling to get out. I don't have any immediate suggestions, but I wonder whether you wouldn't profit by reading something about how to do crosstabs and statistics in SQL. Joe Celko's book "SQL For Smarties" is an invaluable resource for hard problems in SQL (be sure to get the 2nd edition). Also, if you decide that a crosstab would help, take a look at the crosstab functions in contrib/tablefunc. (Celko's book only covers SQL-standard solutions, not methods that depend on nonstandard features, so he's at a disadvantage when covering crosstab methods.) Also, I do have a suggestion for this: > current_inventory_view, showing the most recent inventory qty and date, > is defined as > SELECT inv.date, inv.product_code, inv.qty FROM inventory inv, (SELECT > max(inventory.date) AS date, inventory.product_code FROM inventory > GROUP BY inventory.product_code) curr_inv WHERE ((inv.date = > curr_inv.date) AND (inv.product_code = curr_inv.product_code)); If you don't mind using a Postgres-specific feature, you should be able to make this view faster by using DISTINCT ON. Look at the "weather reports" example in the SELECT reference page. regards, tom lane
Functional index problems. (Was: Many joins: monthly summaries S-L--O--W)
From
Michael Glaesmann
Date:
I've implemented the aggregate grid method, building a "xtab" table as outlined. The orders relation has about 300K records, which isn't that big from the numbers thrown around on these lists. The basic query (just the monthly columns, none of the inventory mess) took about 2 minutes. That's going through the 300K records and creating monthly totals. Something tells me there's room for improvement, so I'll keep trying. (Adding the inventory joins slows it back down to about 10 minutes, so there's a lot of room for improvement there, but one thing at a time, right?) On Wednesday, Oct 22, 2003, at 01:48 Asia/Tokyo, Josh Berkus wrote: <excerpt>I'd suggest instead that you use the "aggregate grid" method: </excerpt><<snip/> <excerpt>This is much, much faster than the outer join method for large numbers of columns. For better performance, make an index on extract(month from sales_date). </excerpt> Searching for ways to improve performance, I tried to create a index on the extract function, but for some reason I couldn't get it to work. Following the documentation for CREATE INDEX and EXTRACT, I tried CREATE INDEX dborders_extract_month_idx ON dborders (EXTRACT(MONTH from date)); which gave me ERROR: parser: parse error at or near "(" at character 61 I also tried CREATE INDEX dborders_extract_month_idx ON dborders EXTRACT(MONTH from date); ERROR: parser: parse error at or near "EXTRACT" at character 53 and just for good measure, not that I thought it work CREATE INDEX dborders_extract_month_idx on dborders EXTRACT(MONTH from (date)); ERROR: parser: parse error at or near "EXTRACT" at character 53 What am I missing from the CREATE statement? It seems in line with the documentation for CREATE INDEX: <fontfamily><param>Courier</param>CREATE [ UNIQUE ] INDEX <italic>index_name </italic>ON <italic>table </italic>[ USING <italic>acc_method </italic>] ( <italic>func_name </italic>(<italic>column </italic>[, ... ]) [ <italic>ops_name </italic>] ) [ WHERE <italic>predicate </italic>]</fontfamily> and the example given in 8.5 Functional Indexes of the User's Guide: <fontfamily><param>Courier</param>CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1)); I know this isn't the solution to all my problems, but I am a bit curious why my CREATE INDEX isn't working. Further debugging attempts: Try a different fuction. CREATE INDEX dborders_date_trunc_idx on dborders (date_trunc('month',date)); This failed with ERROR: parser: parse error at or near "'month'" at character 62 I even renamed the 'date' column (type timestamp with timezone) to 'datetime' on the off chance that using and SQL key word (though not a PostgreSQL key word) was causing some problem. Same errors. What is that simple thing I'm overlooking? Any ideas what else I should check? (I did a search on 'functional index' in the list archives but kept getting timed out :( Michael </fontfamily> I've implemented the aggregate grid method, building a "xtab" table as outlined. The orders relation has about 300K records, which isn't that big from the numbers thrown around on these lists. The basic query (just the monthly columns, none of the inventory mess) took about 2 minutes. That's going through the 300K records and creating monthly totals. Something tells me there's room for improvement, so I'll keep trying. (Adding the inventory joins slows it back down to about 10 minutes, so there's a lot of room for improvement there, but one thing at a time, right?) On Wednesday, Oct 22, 2003, at 01:48 Asia/Tokyo, Josh Berkus wrote: > I'd suggest instead that you use the "aggregate grid" method: <snip/> > This is much, much faster than the outer join method for large numbers > of > columns. For better performance, make an index on extract(month from > sales_date). Searching for ways to improve performance, I tried to create a index on the extract function, but for some reason I couldn't get it to work. Following the documentation for CREATE INDEX and EXTRACT, I tried CREATE INDEX dborders_extract_month_idx ON dborders (EXTRACT(MONTH from date)); which gave me ERROR: parser: parse error at or near "(" at character 61 I also tried CREATE INDEX dborders_extract_month_idx ON dborders EXTRACT(MONTH from date); ERROR: parser: parse error at or near "EXTRACT" at character 53 and just for good measure, not that I thought it work CREATE INDEX dborders_extract_month_idx on dborders EXTRACT(MONTH from (date)); ERROR: parser: parse error at or near "EXTRACT" at character 53 What am I missing from the CREATE statement? It seems in line with the documentation for CREATE INDEX: CREATE [ UNIQUE ] INDEX index_name ON table [ USING acc_method ] ( func_name (column [, ... ]) [ ops_name ] ) [ WHERE predicate ] and the example given in 8.5 Functional Indexes of the User's Guide: CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1)); I know this isn't the solution to all my problems, but I am a bit curious why my CREATE INDEX isn't working. Further debugging attempts: Try a different fuction. CREATE INDEX dborders_date_trunc_idx on dborders (date_trunc('month',date)); This failed with ERROR: parser: parse error at or near "'month'" at character 62 I even renamed the 'date' column (type timestamp with timezone) to 'datetime' on the off chance that using and SQL key word (though not a PostgreSQL key word) was causing some problem. Same errors. What is that simple thing I'm overlooking? Any ideas what else I should check? (I did a search on 'functional index' in the list archives but kept getting timed out :( Michael
Michael Glaesmann <grzm@myrealbox.com> writes: > Searching for ways to improve performance, I tried to create a index on > the extract function, but for some reason I couldn't get it to work. > Following the documentation for CREATE INDEX and EXTRACT, I tried > CREATE INDEX dborders_extract_month_idx ON dborders (EXTRACT(MONTH from > date)); > ERROR: parser: parse error at or near "(" at character 61 You can't do that in pre-7.4 releases; the syntax of a functional index can only be "ON table (func(col1,col2,...))" --- that is, a simple, standard-notation function applied to one or more columns of the table. So to do this, you'd need to create an intermediate function along the lines of "month_trunc(date)"; and you'd have to use it in your queries as well as in the index definition. 7.4 is more flexible though --- it will take the above as long as you put an extra set of parentheses in there... regards, tom lane
Re: Functional index problems. (Was: Many joins: monthly summaries S-L--O--W)
From
Josh Berkus
Date:
Michael, > What is that simple thing I'm overlooking? Any ideas what else I should > check? (I did a search on 'functional index' in the list archives but > kept getting timed out :( Oh, sorry. There's an implementation issue with funcional indexes, where they can't take parameters other than column names. So you need to do: CREATE FUNCTION get_month ( TIMESTAMPTZ ) RETURNS INTEGER AS ' SELECT EXTRACT(MONTH from $1); ' LANGUAGE sql IMMUTABLE STRICT; Then do CREATE INDEX dborders_date_trunc_idx on dborders (get_month(date)); -- -Josh Berkus Aglio Database Solutions San Francisco
Re: Functional index problems. (Was: Many joins: monthly summaries S-L--O--W)
From
Michael Glaesmann
Date:
On Thursday, Oct 23, 2003, at 02:44 Asia/Tokyo, Tom Lane wrote: > Michael Glaesmann <grzm@myrealbox.com> writes: >> CREATE INDEX dborders_extract_month_idx ON dborders (EXTRACT(MONTH >> from >> date)); >> ERROR: parser: parse error at or near "(" at character 61 > > You can't do that in pre-7.4 releases; the syntax of a functional index > can only be "ON table (func(col1,col2,...))" --- that is, a simple, > standard-notation function applied to one or more columns of the table. > So to do this, you'd need to create an intermediate function along > the lines of "month_trunc(date)"; and you'd have to use it in your > queries as well as in the index definition. On Thursday, Oct 23, 2003, at 02:53 Asia/Tokyo, Josh Berkus wrote: > Oh, sorry. There's an implementation issue with funcional indexes, > where they > can't take parameters other than column names. So you need to do: > > CREATE FUNCTION get_month ( > TIMESTAMPTZ ) RETURNS INTEGER AS > ' SELECT EXTRACT(MONTH from $1); ' > LANGUAGE sql IMMUTABLE STRICT; Thanks, Tom and Josh! Added a type cast of the extract (which returns a double precision) and it's all good. Tom commented: > 7.4 is more flexible though --- it will take the above as long as you > put an extra set of parentheses in there... I took a gander at the documentation for 7.4beta. I can tell it's been reorganized. There's no longer a specific section on functional indexes that I can see, though there is mention of it in the SQL CREATE INDEX entry <http://developer.postgresql.org/docs/postgres/sql-createindex.html> The 7.3.2 documents I have say that there cannot be multicolumn functional indexes, though there's no mention of this in the new documentation. Does this mean this proscription has been lifted? Thanks again for your help! Michael
Michael Glaesmann <grzm@myrealbox.com> writes: > On Thursday, Oct 23, 2003, at 02:44 Asia/Tokyo, Tom Lane wrote: >> 7.4 is more flexible though --- it will take the above as long as you >> put an extra set of parentheses in there... > I took a gander at the documentation for 7.4beta. I can tell it's been > reorganized. There's no longer a specific section on functional indexes > that I can see, though there is mention of it in the SQL CREATE INDEX > entry > <http://developer.postgresql.org/docs/postgres/sql-createindex.html> > The 7.3.2 documents I have say that there cannot be multicolumn > functional indexes, though there's no mention of this in the new > documentation. Does this mean this proscription has been lifted? Yes. Any column of an index can now be an expression; the former functional-index capability is now just a special case of "expressional indexes". For syntactic reasons we had to require an extra pair of parens around expressions in the CREATE INDEX statement --- although it proved possible to not require these when the expression looks like a standard-syntax function call, thus providing backwards compatibility with the old functional-index syntax. See http://developer.postgresql.org/docs/postgres/indexes-expressional.html as well as the CREATE INDEX reference page. I'm not particularly happy with the phrase "expressional index", and would like to think of something else to use before the 7.4 docs go final. Any ideas? regards, tom lane
Re: Functional index problems. (Was: Many joins: monthly summaries S-L--O--W)
From
Josh Berkus
Date:
Michael, > <http://developer.postgresql.org/docs/postgres/sql-createindex.html> > The 7.3.2 documents I have say that there cannot be multicolumn > functional indexes, though there's no mention of this in the new > documentation. Does this mean this proscription has been lifted? Yes. -- -Josh Berkus Aglio Database Solutions San Francisco
Re: Functional index problems. (Was: Many joins: monthly summaries S-L--O--W)
From
Michael Glaesmann
Date:
On Thursday, Oct 23, 2003, at 04:17 Asia/Tokyo, Tom Lane wrote: <excerpt>I'm not particularly happy with the phrase "expressional index", and would like to think of something else to use before the 7.4 docs go final. Any ideas? </excerpt> Though it might be nice to be creative and name it something like "Bill", for a defined language like SQL I think it's best to work within the framework already used to describe the language. Looking in the PostgreSQL docs (which I'm sure you've done), I find this definition of "value expression" <fontfamily><param>Lucida Grande</param><bigger>A value expression is one of the following: - A constant or literal value; see <color><param>0000,0000,6666</param>Section 1.1.2 </color>. - A column reference. - A positional parameter reference, in the body of a function declaration. - An operator invocation. - A function call. - An aggregate expression. - A type cast. - A scalar subquery. - Another value expression in parentheses, useful to group subexpressions and override precedence. Preceding this is a paragraph mentioning that value expressions are also called "scalar expressions" or just "expressions". I was surprised to find column references are included in this list, but thinking about it, a column has a scalar value, which is what any function or operator returns as well. According to the above description of "value expression", a normal column index is just a special case of the more general "expression" index. (Actually, I'm a bit confused with the definition of value expression. Does column reference here mean value in the same row (and same table)? That's the only way I can see getting a scalar value. If it means reference in the constraint meaning, it's a set of allowed values, rather than a specific one, which definitely wouldn't work for an index. Perhaps the use of the word "reference" here is inappropriate, though I haven't looked through the rest of the documentation to see how the term "column reference" is used. My bad, I know. I'll definitely accept criticism for not doing my homework thoroughly.) To me, an operator is just a special class of function that doesn't use the more general func(arg [, arg]) construction. 2 + 2 could just as easily be add(2,2) — and maybe it is already defined this way as well. I haven't checked. However, I can see that it might be useful to distinguish between functions and column references. One term I thought about was "derived index". Another was "evaluated index". But both of these terms seem to introduce unnecessary language. My final thought would be to continue using "functional index", noting that operators are a special class of function. Out of the above description of "value expression", I believe only functions and operators are allowed as expressions in the CREATE INDEX syntax, correct? (Besides referring to a column.) What do you think? Michael</bigger></fontfamily>On Thursday, Oct 23, 2003, at 04:17 Asia/Tokyo, Tom Lane wrote: > I'm not particularly happy with the phrase "expressional index", and > would like to think of something else to use before the 7.4 docs go > final. Any ideas? Though it might be nice to be creative and name it something like "Bill", for a defined language like SQL I think it's best to work within the framework already used to describe the language. Looking in the PostgreSQL docs (which I'm sure you've done), I find this definition of "value expression" A value expression is one of the following: - A constant or literal value; see Section 1.1.2 . - A column reference. - A positional parameter reference, in the body of a function declaration. - An operator invocation. - A function call. - An aggregate expression. - A type cast. - A scalar subquery. - Another value expression in parentheses, useful to group subexpressions and override precedence. Preceding this is a paragraph mentioning that value expressions are also called "scalar expressions" or just "expressions". I was surprised to find column references are included in this list, but thinking about it, a column has a scalar value, which is what any function or operator returns as well. According to the above description of "value expression", a normal column index is just a special case of the more general "expression" index. (Actually, I'm a bit confused with the definition of value expression. Does column reference here mean value in the same row (and same table)? That's the only way I can see getting a scalar value. If it means reference in the constraint meaning, it's a set of allowed values, rather than a specific one, which definitely wouldn't work for an index. Perhaps the use of the word "reference" here is inappropriate, though I haven't looked through the rest of the documentation to see how the term "column reference" is used. My bad, I know. I'll definitely accept criticism for not doing my homework thoroughly.) To me, an operator is just a special class of function that doesn't use the more general func(arg [, arg]) construction. 2 + 2 could just as easily be add(2,2) — and maybe it is already defined this way as well. I haven't checked. However, I can see that it might be useful to distinguish between functions and column references. One term I thought about was "derived index". Another was "evaluated index". But both of these terms seem to introduce unnecessary language. My final thought would be to continue using "functional index", noting that operators are a special class of function. Out of the above description of "value expression", I believe only functions and operators are allowed as expressions in the CREATE INDEX syntax, correct? (Besides referring to a column.) What do you think? Michael
Michael Glaesmann <grzm@myrealbox.com> writes: > My final thought would be to continue using "functional index", noting > that operators are a special class of function. Out of the above > description of "value expression", I believe only functions and > operators are allowed as expressions in the CREATE INDEX syntax, > correct? (Besides referring to a column.) Actually, you can use any scalar-valued expression that does not contain a sub-SELECT or aggregate function. (Aggregates would be meaningless --- what are you aggregating over? The prohibition of sub-SELECTs is partly because we don't do planning for index expressions, which could be fixed if anyone were sufficiently motivated, and partly because it seems quite unlikely that the result of a SELECT could reasonably be considered immutable. We do *not* want to get into updating indexes for reasons other than updates of the indexed row.) I see your point that "functional index" is still a valid description, but I'm inclined to bow to Peter's position that it has another meaning that could distract people. Presently I'm going to go with "expression index" unless someone can provide a better choice. regards, tom lane