Joining on CTE is unusually slow? - Mailing list pgsql-general

From Jon Rosebaugh
Subject Joining on CTE is unusually slow?
Date
Msg-id 1406676160.968728.147067773.4399E74D@webmail.messagingengine.com
Whole thread Raw
Responses Re: Joining on CTE is unusually slow?  (David G Johnston <david.g.johnston@gmail.com>)
List pgsql-general
I have a CTE that produces some row ids. I want to do a query with a
complicated join based on those row ids. I've tried running them split
into two (run CTE query, collect row ids, then run the complicated query
with id IN (id_1, id_2, id_3)) and it takes only a few seconds to run,
but when I run this as a single query it takes several minutes. Is this
expected behavior?

We have a message table (msg) where each message has a parent_id (a
nullable FK to the same table). I use the CTE to assemble a list of all
messages downstream of a given message (including that message). It
works pretty well:

WITH RECURSIVE downstream_thread(id, parent_id) AS
(SELECT pn_msg.id AS id, pn_msg.parent_id AS parent_id
FROM msg_table AS pn_msg JOIN msg_asset_table ON pn_msg.id =
msg_asset_table.msg_id JOIN asset_table ON asset_table.id =
msg_asset_table.asset_id
WHERE pn_msg.message_type = 'interest_notification' AND asset_table.type
= 'tp' AND asset_table.public_id = '59c89bdcaf6711e3b67f12313b0a607d'
UNION SELECT pr_msg.id AS id, pr_msg.parent_id AS parent_id
FROM msg_table AS pr_msg JOIN downstream_thread ON pr_msg.parent_id =
downstream_thread.id)
SELECT id from downstream_thread;

There are 2.3 million rows in msg_table, but for any expected asset
public_id, this query will return only a dozen rows at most, and runs in
6 seconds.

Once I have the row ids, I use this complicated join; basically I'm
finding replies to the original message that come from any user at the
company the original message was addressed to. This takes under a
second.

SELECT notification_reply_msg.*
FROM msg_table AS notification_reply_msg
JOIN reference_table AS notification_reply_ref ON msg_table.id =
notification_reply_ref.msg_id
JOIN thread_table AS notification_reply_thread ON
notification_reply_ref.thread_id = notification_reply_thread.id
JOIN user_table AS notification_reply_user ON
notification_reply_thread.user_id = notification_reply_user.id
JOIN user_table AS interest_notification_user ON
interest_notification_user.company_id =
notification_reply_user.company_id
JOIN thread_table AS interest_notification_thread ON
interest_notification_thread.user_id = interest_notification_user.id
JOIN reference_table AS interest_notification_ref ON
interest_notification_ref.thread_id = interest_notification_thread.id
JOIN msg_table AS interest_notification_msg ON
interest_notification_ref.msg_id = interest_notification_msg.id
WHERE interest_notification_msg.id IN (2648995, 2648996) and
notification_reply_msg.id IN (2648995, 2648996)
AND interest_notification_msg.message_type = 'interest_notification' AND
interest_notification_ref.header = 'to' AND
notification_reply_ref.header = 'from';

However, I tried combining the two queries:

WITH RECURSIVE downstream_thread(id, parent_id) AS
(SELECT pn_msg.id AS id, pn_msg.parent_id AS parent_id
FROM msg_table AS pn_msg JOIN msg_asset_table ON pn_msg.id =
msg_asset_table.msg_id JOIN asset_table ON asset_table.id =
msg_asset_table.asset_id
WHERE pn_msg.message_type = 'interest_notification' AND asset_table.type
= 'tp' AND asset_table.public_id = '59c89bdcaf6711e3b67f12313b0a607d'
UNION SELECT pr_msg.id AS id, pr_msg.parent_id AS parent_id
FROM msg_table AS pr_msg JOIN downstream_thread ON pr_msg.parent_id =
downstream_thread.id)
SELECT notification_reply_msg.*
FROM msg_table AS notification_reply_msg
JOIN reference_table AS notification_reply_ref ON msg_table.id =
notification_reply_ref.msg_id
JOIN thread_table AS notification_reply_thread ON
notification_reply_ref.thread_id = notification_reply_thread.id
JOIN user_table AS notification_reply_user ON
notification_reply_thread.user_id = notification_reply_user.id
JOIN user_table AS interest_notification_user ON
interest_notification_user.company_id =
notification_reply_user.company_id
JOIN thread_table AS interest_notification_thread ON
interest_notification_thread.user_id = interest_notification_user.id
JOIN reference_table AS interest_notification_ref ON
interest_notification_ref.thread_id = interest_notification_thread.id
JOIN msg_table AS interest_notification_msg ON
interest_notification_ref.msg_id = interest_notification_msg.id
WHERE interest_notification_msg.id IN (SELECT id from downstream_thread)
and notification_reply_msg.id IN (SELECT id from downstream_thread)
AND interest_notification_msg.message_type = 'interest_notification' AND
interest_notification_ref.header = 'to' AND
notification_reply_ref.header = 'from';

The only difference is that I've added the WITH RECURSIVE expression at
the beginning and changed the id list to "SELECT id FROM
downstream_thread".

This takes over eight minutes to run. Is this the expected behavior when
joining on CTE expressions?

I realize I haven't given the full schema/metadata/explain output as
explained in the "Slow Query Questions" wiki page; I wasn't sure if that
applied in this case because it's more a question of why combining these
two parts is much slower. However, I'm happy to provide it all; I'm just
worried about how to identify what's relevant, because that's a lot of
tables in that join.


pgsql-general by date:

Previous
From: John R Pierce
Date:
Subject: Re: Is there any way to recover updated values from the table(Need old value)
Next
From: David G Johnston
Date:
Subject: Re: Joining on CTE is unusually slow?