Thread: SEVEN cross joins?!?!?
I have a table with only 434 rows in it. Two important columns are "itemid" and "locn". Each item must be in one of seven locations. We need to create a "combo" by selecting one item from each of seven locations; then determine which "combo" is the "best" according to our analysis (see below). A subselect for items in a location looks something like: (select * from suit_item where locn='Head' AND username='Walter' ORDER BY itemid LIMIT 10) as Head One subselect for each location, cross join them all and the query generates 10,000,000 combinations! Without the "LIMIT 10", there are 78 * 37 * 91 * 81 * 99 * 47 * 1 = 98,981,901,018 results returned for username='Walter' (the only user at the moment). The large volume is causing a problem for my systems! The "ORDER BY itemid" was added only so that same 10 items were processed on different computer systems I tested this query on. Only one item for 7th locn in the database at the moment. Every item has three key properties val1, val2 and val3. For each combo, we calculate:(Head.val1 + Arm.val1 + ... Leg.val1) AS Calc1(Head.val2 + Arm.val2 + ... Leg.val2) AS Calc2(Head.val3 + Arm.val3+ ... Leg.val3) AS Calc3 Each calculation has a pseudo "max" value coded so that values above this "max" are considered equal:CASE WHEN calc1 > 70 then 70 else calc1 END as ordcalc1CASE WHEN calc2 > 15 then 15 else calc2END as ordcalc2CASE WHEN calc3 > 60 then 60 else calc3 END as ordcalc3 Then I use:ORDER BY ordcalc1 DESC, ordcalc2 DESC, ordcalc3 DESC When I activated a couple of my brain cells, I realised that adding "WHERE ordcalc1 >= 70 AND ordcalc2 >= 15 AND ordcalc3 >= 60" after the cross joins might help things out a bit. The 10,000,000 results was reduced significantly (8K - 30K with different samples). Because the "ordcalc" cannot be used in a WHERE clause, the entire expression was repeated. I used php to generate the query from pieces so that I could avoid lots of repetition in coding (but still there in final query). The query itself is about 6K when assembled. After that big introduction, I have a couple of questions: 1) Did I approach the problem incorrectly? Is there another way to approach this query so that fewer combos are analysed? 2) Are there any optimisations that could improve query speed? Since the table is so small, I guessed that indexes wouldn't help. I created an index on (username, itemid), but it doesn't get used. Output of EXPLAIN ANALYSE found here:http://www.execulink.com/~fbax/JOINS/ 3) When run on P2 and P4 systems, I would expect to see huge improvement in time taken to process query, but I don't (only 35-40% better)? i = number of items in LIMIT of subselect rc = raw record count rcw = record count with "limits" in WHERE clause p2 = seconds for query to run on P2-400M pg=7.4.3 ram=32M p4 = seconds for query to run on P4-2.8G pg=7.3.5 ram=1G i=10 - rc=1,000,000 rcw=27,086 p2=81 p4=49 i=11 - rc=1,771,561 rcw=41,121 p2=141 p4=86 i=12 - rc=2,985,984 rcw=56,425 p2=216 p4=142 i=13 - rc=4,826,809 rcw=81,527 p2=??? p4=228 On P2 system i=13 query returns empty page with no errors on server. On P4 system i=15 results in: PostgreSQL Error: 1 (ERROR: tuplestore: write failed) I suppose this is a temp file - is it created in $DATA? OpenBSD has several partitions, so I'll need to know which one is too small.
Frank Bax wrote: > I have a table with only 434 rows in it. Two important columns are > "itemid" and "locn". Each item must be in one of seven locations. We > need to create a "combo" by selecting one item from each of seven > locations; then determine which "combo" is the "best" according to our > analysis (see below). > > A subselect for items in a location looks something like: > (select * from suit_item where locn='Head' AND username='Walter' ORDER > BY itemid LIMIT 10) as Head > > One subselect for each location, cross join them all and the query > generates 10,000,000 combinations! Without the "LIMIT 10", there are > 78 * 37 * 91 * 81 * 99 * 47 * 1 = 98,981,901,018 results returned for > username='Walter' (the only user at the moment). The large volume is > causing a problem for my systems! The "ORDER BY itemid" was added only > so that same 10 items were processed on different computer systems I > tested this query on. Only one item for 7th locn in the database at the > moment. Frank - it might just be me, but I've read your email twice and despite all the information I still don't have any idea what you are trying to do. Are you saying that you WANT to generate a cross-join, score the millions of results and then pick the best 10? It's doing what you want, but you'd like it to be faster. Or are you saying that you'd like to avoid the explosion in rows altogether? In either case - I don't suppose you could provide a real example of the query, so we can see exactly what you're trying to do. -- Richard Huxton Archonet Ltd
At 08:29 AM 10/11/05, Richard Huxton wrote: >Frank Bax wrote: >>I have a table with only 434 rows in it. Two important columns are >>"itemid" and "locn". Each item must be in one of seven locations. We >>need to create a "combo" by selecting one item from each of seven >>locations; then determine which "combo" is the "best" according to our >>analysis (see below). >>A subselect for items in a location looks something like: >>(select * from suit_item where locn='Head' AND username='Walter' ORDER BY >>itemid LIMIT 10) as Head >>One subselect for each location, cross join them all and the query >>generates 10,000,000 combinations! Without the "LIMIT 10", there are >>78 * 37 * 91 * 81 * 99 * 47 * 1 = 98,981,901,018 results returned for >>username='Walter' (the only user at the moment). The large volume is >>causing a problem for my systems! The "ORDER BY itemid" was added only >>so that same 10 items were processed on different computer systems I >>tested this query on. Only one item for 7th locn in the database at the >>moment. > >Frank - it might just be me, but I've read your email twice and despite >all the information I still don't have any idea what you are trying to do. > >Are you saying that you WANT to generate a cross-join, score the millions >of results and then pick the best 10? It's doing what you want, but you'd >like it to be faster. > >Or are you saying that you'd like to avoid the explosion in rows altogether? > >In either case - I don't suppose you could provide a real example of the >query, so we can see exactly what you're trying to do. There is no "best 10". I currently limit each subselect to 10 items so that query will actually run without crashing. I would like to remove the "ORDER BY itemid LIMIT 10" mentioned above. At the end of the query I have a "LIMIT 100" clause which will stay and produces a list of "best 100" combos. Either of your solutions would be acceptable; since avoiding the "explosion" would also make the query faster. Current calculations indicate that running the query without "LIMIT 10" in subselect would take years to process. The query is filled with expressions. I'm not sure I can shorten it without making typos or deleting something important, so I'll make it available on web here: http://www.execulink.com/~fbax/JOINS/ Results of "explain analyse" is also there.
Aha ! A gamer... playing with armor and hit points and things.... -----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of Frank Bax Sent: Tuesday, October 11, 2005 1:06 PM To: pgsql-sql@postgresql.org Subject: Re: [SQL] SEVEN cross joins?!?!? At 08:29 AM 10/11/05, Richard Huxton wrote: >Frank Bax wrote: >>I have a table with only 434 rows in it. Two important columns are >>"itemid" and "locn". Each item must be in one of seven locations. We >>need to create a "combo" by selecting one item from each of seven >>locations; then determine which "combo" is the "best" according to our >>analysis (see below). >>A subselect for items in a location looks something like: >>(select * from suit_item where locn='Head' AND username='Walter' ORDER BY >>itemid LIMIT 10) as Head >>One subselect for each location, cross join them all and the query >>generates 10,000,000 combinations! Without the "LIMIT 10", there are >>78 * 37 * 91 * 81 * 99 * 47 * 1 = 98,981,901,018 results returned for >>username='Walter' (the only user at the moment). The large volume is >>causing a problem for my systems! The "ORDER BY itemid" was added only >>so that same 10 items were processed on different computer systems I >>tested this query on. Only one item for 7th locn in the database at the >>moment. > >Frank - it might just be me, but I've read your email twice and despite >all the information I still don't have any idea what you are trying to do. > >Are you saying that you WANT to generate a cross-join, score the millions >of results and then pick the best 10? It's doing what you want, but you'd >like it to be faster. > >Or are you saying that you'd like to avoid the explosion in rows altogether? > >In either case - I don't suppose you could provide a real example of the >query, so we can see exactly what you're trying to do. There is no "best 10". I currently limit each subselect to 10 items so that query will actually run without crashing. I would like to remove the "ORDER BY itemid LIMIT 10" mentioned above. At the end of the query I have a "LIMIT 100" clause which will stay and produces a list of "best 100" combos. Either of your solutions would be acceptable; since avoiding the "explosion" would also make the query faster. Current calculations indicate that running the query without "LIMIT 10" in subselect would take years to process. The query is filled with expressions. I'm not sure I can shorten it without making typos or deleting something important, so I'll make it available on web here: http://www.execulink.com/~fbax/JOINS/ Results of "explain analyse" is also there. ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypesdo not match
Frank Bax wrote: >> Are you saying that you WANT to generate a cross-join, score the >> millions of results and then pick the best 10? It's doing what you >> want, but you'd like it to be faster. >> >> Or are you saying that you'd like to avoid the explosion in rows >> altogether? >> >> In either case - I don't suppose you could provide a real example of >> the query, so we can see exactly what you're trying to do. > > > There is no "best 10". I currently limit each subselect to 10 items so > that query will actually run without crashing. I would like to remove > the "ORDER BY itemid LIMIT 10" mentioned above. At the end of the query > I have a "LIMIT 100" clause which will stay and produces a list of "best > 100" combos. > > Either of your solutions would be acceptable; since avoiding the > "explosion" would also make the query faster. Current calculations > indicate that running the query without "LIMIT 10" in subselect would > take years to process. OK - so at the heart of the problem is the fact that you want to search a space with 100 billion possible states. There are three main options 1. Brute force and patience - simple and is guaranteed to produce the "best" answers. You can use cursors/multiple queries to manage the workload. The disadvantage is that it's probably slower than you'd like. 2. Smarter algorithms - possibly something genetic to work towards local maxima. Or, if you'd like to keep it simple, just split your 7 locations into 2,2,3 and solve for each separately. 3. Statistical - examine a subset of possible states and accept you'll only be able to say "almost best" to 99% confidence or similar. I'd be tempted by #2 - there are probably some combinations you can rule out, which combined with a split/recombine should reduce the number of states to query. That any help? -- Richard Huxton Archonet Ltd
Richard Huxton wrote: > Frank Bax wrote: > >>> Are you saying that you WANT to generate a cross-join, score the >>> millions of results and then pick the best 10? It's doing what you >>> want, but you'd like it to be faster. >>> >>> Or are you saying that you'd like to avoid the explosion in rows >>> altogether? >>> >>> In either case - I don't suppose you could provide a real example of >>> the query, so we can see exactly what you're trying to do. >> >> >> >> There is no "best 10". I currently limit each subselect to 10 items >> so that query will actually run without crashing. I would like to >> remove the "ORDER BY itemid LIMIT 10" mentioned above. At the end of >> the query I have a "LIMIT 100" clause which will stay and produces a >> list of "best 100" combos. >> >> Either of your solutions would be acceptable; since avoiding the >> "explosion" would also make the query faster. Current calculations >> indicate that running the query without "LIMIT 10" in subselect would >> take years to process. > > > OK - so at the heart of the problem is the fact that you want to search > a space with 100 billion possible states. There are three main options > > 1. Brute force and patience - simple and is guaranteed to produce the > "best" answers. You can use cursors/multiple queries to manage the > workload. The disadvantage is that it's probably slower than you'd like. > 2. Smarter algorithms - possibly something genetic to work towards local > maxima. Or, if you'd like to keep it simple, just split your 7 locations > into 2,2,3 and solve for each separately. > 3. Statistical - examine a subset of possible states and accept you'll > only be able to say "almost best" to 99% confidence or similar. > > I'd be tempted by #2 - there are probably some combinations you can rule > out, which combined with a split/recombine should reduce the number of > states to query. > I'll second this recommendation. The OP is trying to drive a nail with a screwdriver. He needs a program, not a query. > That any help? > -- > Richard Huxton > Archonet Ltd > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Daryl Richter (daryl (at)(brandywine dot com))
At 09:00 AM 10/12/05, Daryl Richter wrote: >Richard Huxton wrote: >>Frank Bax wrote: >> >>>>Are you saying that you WANT to generate a cross-join, score the >>>>millions of results and then pick the best 10? It's doing what you >>>>want, but you'd like it to be faster. >>>> >>>>Or are you saying that you'd like to avoid the explosion in rows >>>>altogether? >>>> >>>>In either case - I don't suppose you could provide a real example of >>>>the query, so we can see exactly what you're trying to do. >>> >>> >>> >>>There is no "best 10". I currently limit each subselect to 10 items so >>>that query will actually run without crashing. I would like to remove >>>the "ORDER BY itemid LIMIT 10" mentioned above. At the end of the query >>>I have a "LIMIT 100" clause which will stay and produces a list of "best >>>100" combos. >>> >>>Either of your solutions would be acceptable; since avoiding the >>>"explosion" would also make the query faster. Current calculations >>>indicate that running the query without "LIMIT 10" in subselect would >>>take years to process. >> >>OK - so at the heart of the problem is the fact that you want to search a >>space with 100 billion possible states. There are three main options >>1. Brute force and patience - simple and is guaranteed to produce the >>"best" answers. You can use cursors/multiple queries to manage the >>workload. The disadvantage is that it's probably slower than you'd like. >>2. Smarter algorithms - possibly something genetic to work towards local >>maxima. Or, if you'd like to keep it simple, just split your 7 locations >>into 2,2,3 and solve for each separately. >>3. Statistical - examine a subset of possible states and accept you'll >>only be able to say "almost best" to 99% confidence or similar. >>I'd be tempted by #2 - there are probably some combinations you can rule >>out, which combined with a split/recombine should reduce the number of >>states to query. > >I'll second this recommendation. The OP is trying to drive a nail with a >screwdriver. He needs a program, not a query. Richard, you've summed it up nicely. Splitting locations into subsets (like 2,2,3) doesn't work because it is possible that low values in one location can be offset by high values in another location, and still result in an excellent combo. The good news is these suggestions got me thinking outside the box. I think I can program a modified brute-force that bypasses large numbers of combos early. It might still be too large/slow, so I'd be interested in finding more info about these "smarter algorithms" in option 2. Where do I look? Greg: my son's the gamer; I'm just trying to help him out.
> Splitting locations into subsets (like 2,2,3) doesn't work > because it is possible that low values in one location can be offset by > high values in another location, and still result in an excellent combo. > > The good news is these suggestions got me thinking outside > the box. I think I can program a modified brute-force that bypasses > large numbers of combos early. It might still be too large/slow, so I'd be > interested in finding more info about these "smarter algorithms" in option Could you define an view with a calculated field, say, 2 * a1 + 6 * a2 + 3 * a3, and then use this aggregate to score the individual rows? I haven't looked at the exact nature of the problem, but if you're multiplying a table by itself, it may be worth redefining the problem in terms of a simple ranking algorithm, define a column to calculate this, then sort by that column. Best regards, Ben Stewart -- Robert Bosch (Australia) Pty. Ltd. Engineering Quality Services, Student Software Engineer (RBAU/EQS4) Locked Bag 66 - Clayton South, VIC 3169 - AUSTRALIA Tel: +61 3 9541-7002 Fax: +61 3 9541-7700 mailto:ben.stewart@au.bosch.com http://www.bosch.com.au/
Frank Bax wrote: > At 09:00 AM 10/12/05, Daryl Richter wrote: > >> Richard Huxton wrote: >> >>> Frank Bax wrote: >>> [snip] > > Richard, you've summed it up nicely. > > Splitting locations into subsets (like 2,2,3) doesn't work because it is > possible that low values in one location can be offset by high values in > another location, and still result in an excellent combo. > > The good news is these suggestions got me thinking outside the box. I > think I can program a modified brute-force that bypasses large numbers > of combos early. It might still be too large/slow, so I'd be interested > in finding more info about these "smarter algorithms" in option 2. > Where do I look? > If you're mathematically inclined, I would first look at using Lagrangian Relexation, it may be appropriate for your problem: http://www.2112fx.com/lagrange.html > Greg: my son's the gamer; I'm just trying to help him out. > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Daryl Richter Director of Technology (( Brandywine Asset Management ) ( "Expanding the Science of Global Investing" ) ( http://www.brandywine.com ))
At 09:04 AM 10/13/05, Daryl Richter wrote: >Frank Bax wrote: > >[snip] > >>Richard, you've summed it up nicely. >>Splitting locations into subsets (like 2,2,3) doesn't work because it is >>possible that low values in one location can be offset by high values in >>another location, and still result in an excellent combo. >>The good news is these suggestions got me thinking outside the box. I >>think I can program a modified brute-force that bypasses large numbers of >>combos early. It might still be too large/slow, so I'd be interested in >>finding more info about these "smarter algorithms" in option 2. >>Where do I look? > >If you're mathematically inclined, I would first look at using >Lagrangian Relexation, it may be appropriate for your problem: > >http://www.2112fx.com/lagrange.html Thanks, but that's a little too complex for me to turn into code! I did rewrite my code from a simple cross join SQL in PHP to custom searching in perl. I sucked subselects into arrays and then looked at all possible combinations. For those that forgot/missed the background, my table has 514 rows. Using subselect, this table is split into 7 subtables. These seven subtables are cross joined with each other to produce 770 billion rows that need to be searched (to assemble a 'made-to-order' suit of armour). By using more intelligent code (and not simple brute-force), I was able to analyse a complete set of 770 billion states in just under 70 hours on a P4-2.8Ghz system, which is fast enough for today. A faster cpu will help, since process does no i/o except at beginning and end of script. I realised that if I am ever able to figure out coding for multi-processor or systems (even remote like set@home), I can exploit either/both of these for this problem by slitting problem on items in first subtable into 50-60 subtasks, then merging results from each of those subtasks. This might become a requirement if the underlying table grows to be quite large. Thanks for pointing me in the right direction, it's been an interesting week. Frank