Re: Why hash join instead of nested loop? - Mailing list pgsql-performance
From | Havasvölgyi Ottó |
---|---|
Subject | Re: Why hash join instead of nested loop? |
Date | |
Msg-id | 001801c59a0a$f4be7480$9a00a8c0@OTTO Whole thread Raw |
In response to | Why hash join instead of nested loop? (Rhett Garber <rhettg@gmail.com>) |
Responses |
Re: Why hash join instead of nested loop?
|
List | pgsql-performance |
Rhett, Please post the explain analyze for both queries. From that we can see the predicted and the actual costs of them. Regards, Otto ----- Original Message ----- From: "Rhett Garber" <rhettg@gmail.com> To: <pgsql-performance@postgresql.org> Sent: Friday, August 05, 2005 8:35 PM Subject: [PERFORM] Why hash join instead of nested loop? I've got similiar queries that I think should be evaluated (as displayed through 'explain') the same, but don't. Hopefully this is the rigth place to send such a question and one of you can help explain this to me. The Tables: Connection - 1.2 million entries, about 60 megs, 3 integer fields that link two tables together (i.e. an identifier and two foreign keys). has an index on the identifier and either of the foreign keys. rtmessagestate - very small, 5 entries rtmessage - pretty big, 80,000 entries The Queries: select rtmessagestate.* from rtmessagestate, connection where connection_registry_id = 40105 and obj1 = 73582 and obj2 = rtmessagestate.id; returns 1 in 13.7 ms select rtmessage.id, subject from rtmessage, connection where connection_registry_id = 40003 and obj1 = 4666 and obj2 = rtmessage.id; returns 12 in 2 ms Some more possibly important details: entries in Connection with connection_registry_id = 40105: 30,000 entries with this id and obj1 = 73582: 1 entries in Connection with connection_registry_id = 40003: 6,000 entries with this id and obj1 = 4666: 20 but as I said before, there is an btree index on (connection_registry_id, obj1) Explain: The first query, breaks down as: Hash Join (cost=5.96..7.04 rows=1 width=14) Hash Cond: ("outer".id = "inner".obj2) -> Seq Scan on rtmessagestate (cost=0.00..1.05 rows=5 width=14) -> Hash (cost=5.96..5.96 rows=1 width=4) -> Index Scan using connection_regid_obj1_index on connection (cost=0.00..5.96 rows=1 width=4) Index Cond: ((connection_registry_id = 40105) AND (obj1 = 73582))(6 rows) While the second query is: Nested Loop (cost=0.00..11.62 rows=2 width=38) -> Index Scan using connection_regid_obj1_index on connection (cost=0.00..5.96 rows=1 width=4) Index Cond: ((connection_registry_id = 40003) AND (obj1 = 4666)) -> Index Scan using rtmessage_pkey on rtmessage (cost=0.00..5.65 rows=1 width=38) Index Cond: ("outer".obj2 = rtmessage.id) (5 rows) Actually running these queries shows that the second one (nested loop) is much faster than the hash join, presumably because of hash startup costs. Any ideas how I can make them both use the nested loop. I assume that this would be the fastest for both. Oddly enough, running the 1st query (rtmessagestate) as two queries or with a sub query is way faster than doing the join. And yes, I realize this schema may not be the most efficient for these examples, but it seems to be the most flexible. I'm working on some schema improvements also but if I could understand why this is slow that woudl probably help also. Thanks for you help, Rhett ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
pgsql-performance by date: