Thread: Trouble running nested select - Join query

Trouble running nested select - Join query

From
RUSHI KAW
Date:
Hi,

I have been working with Postgresxc 1.1 on a cluster, each cluster compute node having 48GB RAM.
I have instantiated the coordinator on one of the compute nodes and have 4 datanodes (2+2) running on two other compute nodes. 
I have been trying to run the  following query which is a select on two large, same size tables - miami_2d(head int, tail int) and miami_directednetwork(head int, tail int) (both of them have 105400516 rows).
miami_2d has been replicated on all datanodes and miami_directednetwork has been hash distributed on the head column across all 4 datanodes.
I have indexes on both the tables.
The query is:
copy (select * from (select s1.head as h1,s1.tail as t1 ,s2.tail as neighbor from miami_2d s1, miami_directednetwork s2 where s1.tail=s2.head and s2.tail not in (select tail from miami_2d where head=s1.head)) as O where O.h1!=O.neighbor) to '/tmp/tmp.txt'

Basically, the two tables are just a copy of each other having 'tail' as neighbor of 'head'. I am trying to find the distance-2 or 2 hop neighbor of the head column.  
For example:


miami_directednetwork                                                      
head | tail
400 | 336
400 | 209
400 | 487
336 | 400
336 | 209
336 | 500
487 | 400
487 | 391

miami_2d 
head | tail
400 | 336
400 | 209
400 | 487
336 | 400
336 | 209
336 | 500
487 | 400
487 | 391

Result:
head | tail | neighbor
400 | 336 | 500 
400 | 209 | 391

The issue is that every time I run the above query, it either gets killed by the system or it just keeps running for hours( I kill it after I see it running for more than a day).
I am using the following settings in the postgers.conf file:
For the datanodes:
effective_cache_size=10480MB
shared_buffers=3072MB
work_mem=10480MB
For the coordinator:
effective_cache_size=1048MB
shared_buffers=512MB
work_mem=1024MB

My question is: Is there anything wrong with the way I have framed the query or perhaps something else wrong? 
Any help is appreciated!

Regards,
Rushi

Re: Trouble running nested select - Join query

From
David G Johnston
Date:
Rushi wrote
> copy (select * from (select s1.head as h1,s1.tail as t1 ,s2.tail as
> neighbor from miami_2d s1, miami_directednetwork s2 where s1.tail=s2.head
> and s2.tail not in (select tail from miami_2d where head=s1.head)) as O
> where O.h1!=O.neighbor) to '/tmp/tmp.txt'

Some thoughts:

The O sub-query could probably be top-level

A correlated sub-query in the where clause should generally use EXISTS, not
IN ...
... Though if you can get rid of the whole thing that would be better

You might want to use the statement_timeout GUC to kill the query if it
takes too long

You could separate the the select and the copy into two statements via use
of a temporary table - depending on how large the output is that would split
the action into two statements that could be independently monitored.
Though, to be honest now that I re-read I have no clue how this query
inter-plays with PostgresXC...

Can you run a smaller input size and use that estimate how long a 100M^2
query should take to complete?

Can you calculate in smaller block sizes on the input?

David J.






--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Trouble-running-nested-select-Join-query-tp5804938p5804995.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.