Thread: SEVEN cross joins?!?!?

SEVEN cross joins?!?!?

From
Frank Bax
Date:
I have a table with only 434 rows in it.  Two important columns are 
"itemid" and "locn".  Each item must be in one of seven locations.  We need 
to create a "combo" by selecting one item from each of seven locations; 
then determine which "combo" is the "best" according to our analysis (see 
below).

A subselect for items in a location looks something like:
(select * from suit_item where locn='Head' AND username='Walter' ORDER BY 
itemid LIMIT 10) as Head

One subselect for each location, cross join them all and the query 
generates 10,000,000 combinations!  Without the "LIMIT 10",  there are  78 
* 37 * 91 * 81 * 99 * 47 * 1 = 98,981,901,018 results returned for 
username='Walter' (the only user at the moment).  The large volume is 
causing a problem for my systems!  The "ORDER BY itemid" was added only so 
that same 10 items were processed on different computer systems I tested 
this query on.  Only one item for 7th locn in the database at the moment.

Every item has three key properties val1, val2 and val3.  For each combo, 
we calculate:(Head.val1 + Arm.val1 + ... Leg.val1) AS Calc1(Head.val2 + Arm.val2 + ... Leg.val2) AS Calc2(Head.val3 +
Arm.val3+ ... Leg.val3) AS Calc3
 
Each calculation has a pseudo "max" value coded so that values above this 
"max" are considered equal:CASE WHEN calc1 > 70 then 70 else calc1 END as ordcalc1CASE WHEN calc2 > 15 then 15 else
calc2END as ordcalc2CASE WHEN calc3 > 60 then 60 else calc3 END as ordcalc3
 
Then I use:ORDER BY ordcalc1 DESC, ordcalc2 DESC, ordcalc3 DESC

When I activated a couple of my brain cells, I realised that adding "WHERE 
ordcalc1 >= 70 AND ordcalc2 >= 15 AND ordcalc3 >= 60" after the cross joins 
might help things out a bit.  The 10,000,000 results was reduced 
significantly (8K - 30K with different samples).  Because the "ordcalc" 
cannot be used in a WHERE clause, the entire expression was repeated.

I used php to generate the query from pieces so that I could avoid lots of 
repetition in coding (but still there in final query).  The query itself is 
about 6K when assembled.

After that big introduction, I have a couple of questions:

1) Did I approach the problem incorrectly?  Is there another way to 
approach this query so that fewer combos are analysed?

2) Are there any optimisations that could improve query speed?  Since the 
table is so small, I guessed that indexes wouldn't help.  I created an 
index on (username, itemid), but it doesn't get used.  Output of EXPLAIN 
ANALYSE found here:http://www.execulink.com/~fbax/JOINS/

3) When run on P2 and P4 systems, I would expect to see huge improvement in 
time taken to process query, but I don't (only 35-40% better)?

i = number of items in LIMIT of subselect
rc = raw record count
rcw = record count with "limits" in WHERE clause
p2 = seconds for query to run on P2-400M pg=7.4.3 ram=32M
p4 = seconds for query to run on P4-2.8G pg=7.3.5 ram=1G

i=10 - rc=1,000,000 rcw=27,086 p2=81  p4=49
i=11 - rc=1,771,561 rcw=41,121 p2=141 p4=86
i=12 - rc=2,985,984 rcw=56,425 p2=216 p4=142
i=13 - rc=4,826,809 rcw=81,527 p2=??? p4=228

On P2 system i=13 query returns empty page with no errors on server.

On P4 system i=15 results in:
PostgreSQL Error: 1 (ERROR: tuplestore: write failed)

I suppose this is a temp file - is it created in $DATA?  OpenBSD has 
several partitions, so I'll need to know which one is too small. 



Re: SEVEN cross joins?!?!?

From
Richard Huxton
Date:
Frank Bax wrote:
> I have a table with only 434 rows in it.  Two important columns are 
> "itemid" and "locn".  Each item must be in one of seven locations.  We 
> need to create a "combo" by selecting one item from each of seven 
> locations; then determine which "combo" is the "best" according to our 
> analysis (see below).
> 
> A subselect for items in a location looks something like:
> (select * from suit_item where locn='Head' AND username='Walter' ORDER 
> BY itemid LIMIT 10) as Head
> 
> One subselect for each location, cross join them all and the query 
> generates 10,000,000 combinations!  Without the "LIMIT 10",  there are  
> 78 * 37 * 91 * 81 * 99 * 47 * 1 = 98,981,901,018 results returned for 
> username='Walter' (the only user at the moment).  The large volume is 
> causing a problem for my systems!  The "ORDER BY itemid" was added only 
> so that same 10 items were processed on different computer systems I 
> tested this query on.  Only one item for 7th locn in the database at the 
> moment.

Frank - it might just be me, but I've read your email twice and despite 
all the information I still don't have any idea what you are trying to do.

Are you saying that you WANT to generate a cross-join, score the 
millions of results and then pick the best 10? It's doing what you want, 
but you'd like it to be faster.

Or are you saying that you'd like to avoid the explosion in rows altogether?

In either case - I don't suppose you could provide a real example of the 
query, so we can see exactly what you're trying to do.
--  Richard Huxton  Archonet Ltd


Re: SEVEN cross joins?!?!?

From
Frank Bax
Date:
At 08:29 AM 10/11/05, Richard Huxton wrote:

>Frank Bax wrote:
>>I have a table with only 434 rows in it.  Two important columns are 
>>"itemid" and "locn".  Each item must be in one of seven locations.  We 
>>need to create a "combo" by selecting one item from each of seven 
>>locations; then determine which "combo" is the "best" according to our 
>>analysis (see below).
>>A subselect for items in a location looks something like:
>>(select * from suit_item where locn='Head' AND username='Walter' ORDER BY 
>>itemid LIMIT 10) as Head
>>One subselect for each location, cross join them all and the query 
>>generates 10,000,000 combinations!  Without the "LIMIT 10",  there are
>>78 * 37 * 91 * 81 * 99 * 47 * 1 = 98,981,901,018 results returned for 
>>username='Walter' (the only user at the moment).  The large volume is 
>>causing a problem for my systems!  The "ORDER BY itemid" was added only 
>>so that same 10 items were processed on different computer systems I 
>>tested this query on.  Only one item for 7th locn in the database at the 
>>moment.
>
>Frank - it might just be me, but I've read your email twice and despite 
>all the information I still don't have any idea what you are trying to do.
>
>Are you saying that you WANT to generate a cross-join, score the millions 
>of results and then pick the best 10? It's doing what you want, but you'd 
>like it to be faster.
>
>Or are you saying that you'd like to avoid the explosion in rows altogether?
>
>In either case - I don't suppose you could provide a real example of the 
>query, so we can see exactly what you're trying to do.


There is no "best 10".  I currently limit each subselect to 10 items so 
that query will actually run without crashing.  I would like to remove the 
"ORDER BY itemid LIMIT 10" mentioned above.  At the end of the query I have 
a "LIMIT 100" clause which will stay and produces a list of "best 100" combos.

Either of your solutions would be acceptable; since avoiding the 
"explosion" would also make the query faster.  Current calculations 
indicate that running the query without "LIMIT 10" in subselect would take 
years to process.

The query is filled with expressions.  I'm not sure I can shorten it 
without making typos or deleting something important, so I'll make it 
available on web here:        http://www.execulink.com/~fbax/JOINS/
Results of "explain analyse" is also there. 



Re: SEVEN cross joins?!?!?

From
"Greg Patnude"
Date:
Aha ! A gamer... playing with armor and hit points and things.... 




-----Original Message-----
From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org]
On Behalf Of Frank Bax
Sent: Tuesday, October 11, 2005 1:06 PM
To: pgsql-sql@postgresql.org
Subject: Re: [SQL] SEVEN cross joins?!?!?

At 08:29 AM 10/11/05, Richard Huxton wrote:

>Frank Bax wrote:
>>I have a table with only 434 rows in it.  Two important columns are 
>>"itemid" and "locn".  Each item must be in one of seven locations.  We 
>>need to create a "combo" by selecting one item from each of seven 
>>locations; then determine which "combo" is the "best" according to our 
>>analysis (see below).
>>A subselect for items in a location looks something like:
>>(select * from suit_item where locn='Head' AND username='Walter' ORDER BY 
>>itemid LIMIT 10) as Head
>>One subselect for each location, cross join them all and the query 
>>generates 10,000,000 combinations!  Without the "LIMIT 10",  there are
>>78 * 37 * 91 * 81 * 99 * 47 * 1 = 98,981,901,018 results returned for 
>>username='Walter' (the only user at the moment).  The large volume is 
>>causing a problem for my systems!  The "ORDER BY itemid" was added only 
>>so that same 10 items were processed on different computer systems I 
>>tested this query on.  Only one item for 7th locn in the database at the 
>>moment.
>
>Frank - it might just be me, but I've read your email twice and despite 
>all the information I still don't have any idea what you are trying to do.
>
>Are you saying that you WANT to generate a cross-join, score the millions 
>of results and then pick the best 10? It's doing what you want, but you'd 
>like it to be faster.
>
>Or are you saying that you'd like to avoid the explosion in rows
altogether?
>
>In either case - I don't suppose you could provide a real example of the 
>query, so we can see exactly what you're trying to do.


There is no "best 10".  I currently limit each subselect to 10 items so 
that query will actually run without crashing.  I would like to remove the 
"ORDER BY itemid LIMIT 10" mentioned above.  At the end of the query I have 
a "LIMIT 100" clause which will stay and produces a list of "best 100"
combos.

Either of your solutions would be acceptable; since avoiding the 
"explosion" would also make the query faster.  Current calculations 
indicate that running the query without "LIMIT 10" in subselect would take 
years to process.

The query is filled with expressions.  I'm not sure I can shorten it 
without making typos or deleting something important, so I'll make it 
available on web here:        http://www.execulink.com/~fbax/JOINS/
Results of "explain analyse" is also there. 


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to      choose an index scan if your joining column's
datatypesdo not      match
 


Re: SEVEN cross joins?!?!?

From
Richard Huxton
Date:
Frank Bax wrote:
>> Are you saying that you WANT to generate a cross-join, score the 
>> millions of results and then pick the best 10? It's doing what you 
>> want, but you'd like it to be faster.
>>
>> Or are you saying that you'd like to avoid the explosion in rows 
>> altogether?
>>
>> In either case - I don't suppose you could provide a real example of 
>> the query, so we can see exactly what you're trying to do.
> 
> 
> There is no "best 10".  I currently limit each subselect to 10 items so 
> that query will actually run without crashing.  I would like to remove 
> the "ORDER BY itemid LIMIT 10" mentioned above.  At the end of the query 
> I have a "LIMIT 100" clause which will stay and produces a list of "best 
> 100" combos.
> 
> Either of your solutions would be acceptable; since avoiding the 
> "explosion" would also make the query faster.  Current calculations 
> indicate that running the query without "LIMIT 10" in subselect would 
> take years to process.

OK - so at the heart of the problem is the fact that you want to search 
a space with 100 billion possible states. There are three main options

1. Brute force and patience - simple and is guaranteed to produce the 
"best" answers. You can use cursors/multiple queries to manage the 
workload. The disadvantage is that it's probably slower than you'd like.
2. Smarter algorithms - possibly something genetic to work towards local 
maxima. Or, if you'd like to keep it simple, just split your 7 locations 
into 2,2,3 and solve for each separately.
3. Statistical - examine a subset of possible states and accept you'll 
only be able to say "almost best" to 99% confidence or similar.

I'd be tempted by #2 - there are probably some combinations you can rule 
out, which combined with a split/recombine should reduce the number of 
states to query.

That any help?
--  Richard Huxton  Archonet Ltd


Re: SEVEN cross joins?!?!?

From
Daryl Richter
Date:
Richard Huxton wrote:
> Frank Bax wrote:
> 
>>> Are you saying that you WANT to generate a cross-join, score the 
>>> millions of results and then pick the best 10? It's doing what you 
>>> want, but you'd like it to be faster.
>>>
>>> Or are you saying that you'd like to avoid the explosion in rows 
>>> altogether?
>>>
>>> In either case - I don't suppose you could provide a real example of 
>>> the query, so we can see exactly what you're trying to do.
>>
>>
>>
>> There is no "best 10".  I currently limit each subselect to 10 items 
>> so that query will actually run without crashing.  I would like to 
>> remove the "ORDER BY itemid LIMIT 10" mentioned above.  At the end of 
>> the query I have a "LIMIT 100" clause which will stay and produces a 
>> list of "best 100" combos.
>>
>> Either of your solutions would be acceptable; since avoiding the 
>> "explosion" would also make the query faster.  Current calculations 
>> indicate that running the query without "LIMIT 10" in subselect would 
>> take years to process.
> 
> 
> OK - so at the heart of the problem is the fact that you want to search 
> a space with 100 billion possible states. There are three main options
> 
> 1. Brute force and patience - simple and is guaranteed to produce the 
> "best" answers. You can use cursors/multiple queries to manage the 
> workload. The disadvantage is that it's probably slower than you'd like.
> 2. Smarter algorithms - possibly something genetic to work towards local 
> maxima. Or, if you'd like to keep it simple, just split your 7 locations 
> into 2,2,3 and solve for each separately.
> 3. Statistical - examine a subset of possible states and accept you'll 
> only be able to say "almost best" to 99% confidence or similar.
> 
> I'd be tempted by #2 - there are probably some combinations you can rule 
> out, which combined with a split/recombine should reduce the number of 
> states to query.
> 

I'll second this recommendation.  The OP is trying to drive a nail with 
a screwdriver.  He needs a program, not a query.

> That any help?
> -- 
>   Richard Huxton
>   Archonet Ltd
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings


-- 
Daryl Richter
(daryl (at)(brandywine dot com))



Re: SEVEN cross joins?!?!?

From
Frank Bax
Date:
At 09:00 AM 10/12/05, Daryl Richter wrote:

>Richard Huxton wrote:
>>Frank Bax wrote:
>>
>>>>Are you saying that you WANT to generate a cross-join, score the 
>>>>millions of results and then pick the best 10? It's doing what you 
>>>>want, but you'd like it to be faster.
>>>>
>>>>Or are you saying that you'd like to avoid the explosion in rows 
>>>>altogether?
>>>>
>>>>In either case - I don't suppose you could provide a real example of 
>>>>the query, so we can see exactly what you're trying to do.
>>>
>>>
>>>
>>>There is no "best 10".  I currently limit each subselect to 10 items so 
>>>that query will actually run without crashing.  I would like to remove 
>>>the "ORDER BY itemid LIMIT 10" mentioned above.  At the end of the query 
>>>I have a "LIMIT 100" clause which will stay and produces a list of "best 
>>>100" combos.
>>>
>>>Either of your solutions would be acceptable; since avoiding the 
>>>"explosion" would also make the query faster.  Current calculations 
>>>indicate that running the query without "LIMIT 10" in subselect would 
>>>take years to process.
>>
>>OK - so at the heart of the problem is the fact that you want to search a 
>>space with 100 billion possible states. There are three main options
>>1. Brute force and patience - simple and is guaranteed to produce the 
>>"best" answers. You can use cursors/multiple queries to manage the 
>>workload. The disadvantage is that it's probably slower than you'd like.
>>2. Smarter algorithms - possibly something genetic to work towards local 
>>maxima. Or, if you'd like to keep it simple, just split your 7 locations 
>>into 2,2,3 and solve for each separately.
>>3. Statistical - examine a subset of possible states and accept you'll 
>>only be able to say "almost best" to 99% confidence or similar.
>>I'd be tempted by #2 - there are probably some combinations you can rule 
>>out, which combined with a split/recombine should reduce the number of 
>>states to query.
>
>I'll second this recommendation.  The OP is trying to drive a nail with a 
>screwdriver.  He needs a program, not a query.


Richard, you've summed it up nicely.

Splitting locations into subsets (like 2,2,3) doesn't work because it is 
possible that low values in one location can be offset by high values in 
another location, and still result in an excellent combo.

The good news is these suggestions got me thinking outside the box.  I 
think I can program a modified brute-force that bypasses large numbers of 
combos early.  It might still be too large/slow, so I'd be interested in 
finding more info about these "smarter algorithms" in option 2.  Where do I 
look?

Greg: my son's the gamer; I'm just trying to help him out.



Re: SEVEN cross joins?!?!?

From
"Stewart Ben (RBAU/EQS4) *"
Date:
> Splitting locations into subsets (like 2,2,3) doesn't work
> because it is possible that low values in one location can be offset
by
> high values in another location, and still result in an excellent
combo.
>
> The good news is these suggestions got me thinking outside
> the box.  I think I can program a modified brute-force that bypasses
> large numbers of combos early.  It might still be too large/slow, so
I'd be
> interested in finding more info about these "smarter algorithms" in
option

Could you define an view with a calculated field, say, 2 * a1 + 6 * a2 +
3 * a3, and then use this aggregate to score the individual rows? I
haven't looked at the exact nature of the problem, but if you're
multiplying a table by itself, it may be worth redefining the problem in
terms of a simple ranking algorithm, define a column to calculate this,
then sort by that column.

Best regards,

Ben Stewart

--
Robert Bosch (Australia) Pty. Ltd.
Engineering Quality Services, Student Software Engineer (RBAU/EQS4)
Locked Bag 66 - Clayton South, VIC 3169 - AUSTRALIA
Tel: +61 3 9541-7002 Fax: +61 3 9541-7700
mailto:ben.stewart@au.bosch.com
http://www.bosch.com.au/


Re: SEVEN cross joins?!?!?

From
Daryl Richter
Date:
Frank Bax wrote:
> At 09:00 AM 10/12/05, Daryl Richter wrote:
> 
>> Richard Huxton wrote:
>>
>>> Frank Bax wrote:
>>>

[snip]

> 
> Richard, you've summed it up nicely.
> 
> Splitting locations into subsets (like 2,2,3) doesn't work because it is 
> possible that low values in one location can be offset by high values in 
> another location, and still result in an excellent combo.
> 
> The good news is these suggestions got me thinking outside the box.  I 
> think I can program a modified brute-force that bypasses large numbers 
> of combos early.  It might still be too large/slow, so I'd be interested 
> in finding more info about these "smarter algorithms" in option 2.  
> Where do I look?
>

If you're mathematically inclined, I would first look at using
Lagrangian Relexation, it may be appropriate for your problem:

http://www.2112fx.com/lagrange.html

> Greg: my son's the gamer; I'm just trying to help him out.
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend


-- 
Daryl Richter
Director of Technology

((         Brandywine Asset Management          ) ( "Expanding the Science of Global Investing"  ) (
http://www.brandywine.com          ))
 




Re: SEVEN cross joins?!?!?

From
Frank Bax
Date:
At 09:04 AM 10/13/05, Daryl Richter wrote:

>Frank Bax wrote:
>
>[snip]
>
>>Richard, you've summed it up nicely.
>>Splitting locations into subsets (like 2,2,3) doesn't work because it is 
>>possible that low values in one location can be offset by high values in 
>>another location, and still result in an excellent combo.
>>The good news is these suggestions got me thinking outside the box.  I 
>>think I can program a modified brute-force that bypasses large numbers of 
>>combos early.  It might still be too large/slow, so I'd be interested in 
>>finding more info about these "smarter algorithms" in option 2.
>>Where do I look?
>
>If you're mathematically inclined, I would first look at using
>Lagrangian Relexation, it may be appropriate for your problem:
>
>http://www.2112fx.com/lagrange.html


Thanks, but that's a little too complex for me to turn into code!

I did rewrite my code from a simple cross join SQL in PHP to custom 
searching in perl.  I sucked subselects into arrays and then looked at all 
possible combinations.

For those that forgot/missed the background, my table has 514 rows.  Using 
subselect, this table is split into 7 subtables.  These seven subtables are 
cross joined with each other to produce 770 billion rows that need to be 
searched (to assemble a 'made-to-order' suit of armour).

By using more intelligent code (and not simple brute-force), I was able to 
analyse a complete set of 770 billion states in just under 70 hours on a 
P4-2.8Ghz system, which is fast enough for today.  A faster cpu will help, 
since process does no i/o except at beginning and end of script. I realised 
that if I am ever able to figure out coding for multi-processor or systems 
(even remote like set@home), I can exploit either/both of these for this 
problem by slitting problem on items in first subtable into 50-60 subtasks, 
then merging results from each of those subtasks.  This might become a 
requirement if the underlying table grows to be quite large.

Thanks for pointing me in the right direction, it's been an interesting week.

Frank