Thread: Trouble running nested select - Join query

Trouble running nested select - Join query

From

RUSHI KAW

Date:

26 May 2014, 05:01:48

Hi,

I have been working with Postgresxc 1.1 on a cluster, each cluster compute node having 48GB RAM.

I have instantiated the coordinator on one of the compute nodes and have 4 datanodes (2+2) running on two other compute nodes.

I have been trying to run the following query which is a select on two large, same size tables - miami_2d(head int, tail int) and miami_directednetwork(head int, tail int) (both of them have 105400516 rows).

miami_2d has been replicated on all datanodes and miami_directednetwork has been hash distributed on the head column across all 4 datanodes.

I have indexes on both the tables.

The query is:

copy (select * from (select s1.head as h1,s1.tail as t1 ,s2.tail as neighbor from miami_2d s1, miami_directednetwork s2 where s1.tail=s2.head and s2.tail not in (select tail from miami_2d where head=s1.head)) as O where O.h1!=O.neighbor) to '/tmp/tmp.txt'

Basically, the two tables are just a copy of each other having 'tail' as neighbor of 'head'. I am trying to find the distance-2 or 2 hop neighbor of the head column.

For example:

miami_directednetwork

head | tail

400 | 336

400 | 209

400 | 487

336 | 400

336 | 209

336 | 500

487 | 400

487 | 391

miami_2d

head | tail

400 | 336

400 | 209

400 | 487

336 | 400

336 | 209

336 | 500

487 | 400

487 | 391

Result:

head | tail | neighbor

400 | 336 | 500

400 | 209 | 391

The issue is that every time I run the above query, it either gets killed by the system or it just keeps running for hours( I kill it after I see it running for more than a day).

I am using the following settings in the postgers.conf file:

For the datanodes:

effective_cache_size=10480MB

shared_buffers=3072MB

work_mem=10480MB

For the coordinator:

effective_cache_size=1048MB

shared_buffers=512MB

work_mem=1024MB

My question is: Is there anything wrong with the way I have framed the query or perhaps something else wrong?

Any help is appreciated!

Regards,

Rushi

Re: Trouble running nested select - Join query

From

David G Johnston

Date:

26 May 2014, 20:14:37

Rushi wrote
> copy (select * from (select s1.head as h1,s1.tail as t1 ,s2.tail as
> neighbor from miami_2d s1, miami_directednetwork s2 where s1.tail=s2.head
> and s2.tail not in (select tail from miami_2d where head=s1.head)) as O
> where O.h1!=O.neighbor) to '/tmp/tmp.txt'

Some thoughts:

The O sub-query could probably be top-level

A correlated sub-query in the where clause should generally use EXISTS, not
IN ...
... Though if you can get rid of the whole thing that would be better

You might want to use the statement_timeout GUC to kill the query if it
takes too long

You could separate the the select and the copy into two statements via use
of a temporary table - depending on how large the output is that would split
the action into two statements that could be independently monitored.
Though, to be honest now that I re-read I have no clue how this query
inter-plays with PostgresXC...

Can you run a smaller input size and use that estimate how long a 100M^2
query should take to complete?

Can you calculate in smaller block sizes on the input?

David J.






--
View this message in context:
http://postgresql.1045698.n5.nabble.com/Trouble-running-nested-select-Join-query-tp5804938p5804995.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.