Apparent missed query optimization with self-join and inner grouping - Mailing list pgsql-general
From | Zack Weinberg |
---|---|
Subject | Apparent missed query optimization with self-join and inner grouping |
Date | |
Msg-id | CAKCAbMh5V51XS1UQukk9t49tz5PfdTd56e61Hjt6u5wRVQf93g@mail.gmail.com Whole thread Raw |
Responses |
Re: Apparent missed query optimization with self-join and inner grouping
Re: Apparent missed query optimization with self-join and inner grouping |
List | pgsql-general |
I have a table recording the results of a web crawl. (Table definition at the end of this message.) The relevant part of the data stored in it looks like this: id | url_id | full_url_id | experiment_id | redirect_num ------+--------+-------------+---------------+-------------- 2617 | 1312 | 1312 | 16 | 0 2618 | 1312 | 2311 | 16 | 1 2619 | 1312 | 2312 | 16 | 2 2620 | 1312 | 2313 | 16 | 3 2631 | 1320 | 1320 | 43 | 0 2633 | 1320 | 2312 | 43 | 2 2632 | 1320 | 2317 | 43 | 1 2634 | 1320 | 2318 | 43 | 3 For each (experiment_id, url_id) pair for some small subset of the experiment_ids, I need to query the full_url_id corresponding to the *largest* value of redirect_num. The query planner does something reasonable with this SELECT: => explain (analyze, verbose) select b.experiment_id, b.url_id, b.full_url_id from blockpage b, (select experiment_id, url_id, max(redirect_num) as redirect_num from blockpage group by experiment_id, url_id) bm where b.experiment_id = bm.experiment_id and b.url_id = bm.url_id and b.redirect_num = bm.redirect_num and bm.experiment_id in (16, 43); Nested Loop (cost=1.14..88505.96 rows=20 width=12) (actual time=0.041..1.723 rows=501 loops=1) Output: b.experiment_id, b.url_id, b.full_url_id -> GroupAggregate (cost=0.57..15442.73 rows=8543 width=12) (actual time=0.033..0.501 rows=501 loops=1) Output: blockpage.experiment_id, blockpage.url_id, max(blockpage.redirect_num) Group Key: blockpage.experiment_id, blockpage.url_id -> Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage (cost=0.57..15293.19 rows=8547 width=12) (actual time=0.026..0.283 rows=803 loops=1) Output: blockpage.experiment_id, blockpage.url_id, blockpage.full_url_id, blockpage.redirect_num, blockpage.html_tag_id Index Cond: (blockpage.experiment_id = ANY ('{16,43}'::integer[])) Heap Fetches: 803 -> Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage b (cost=0.57..8.53 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=501) Output: b.experiment_id, b.url_id, b.full_url_id, b.redirect_num, b.html_tag_id Index Cond: ((b.experiment_id = blockpage.experiment_id) AND (b.url_id = blockpage.url_id) AND (b.redirect_num = (max(blockpage.redirect_num)))) Heap Fetches: 501 Planning Time: 0.331 ms Execution Time: 1.784 ms But if I change the final part of the WHERE to reference b.experiment_id instead of bm.experiment_id, I get this much more expensive query plan: => explain (analyze, verbose) select b.experiment_id, b.url_id, b.full_url_id from blockpage b, (select experiment_id, url_id, max(redirect_num) as redirect_num from blockpage group by experiment_id, url_id) bm where b.experiment_id = bm.experiment_id and b.url_id = bm.url_id and b.redirect_num = bm.redirect_num and b.experiment_id in (16, 43); Hash Join (cost=2749504.19..2764864.13 rows=2 width=12) (actual time=144028.343..144029.545 rows=501 loops=1) Output: b.experiment_id, b.url_id, b.full_url_id Inner Unique: true Hash Cond: ((b.experiment_id = blockpage.experiment_id) AND (b.url_id = blockpage.url_id) AND (b.redirect_num = (max(blockpage.redirect_num)))) -> Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage b (cost=0.57..15293.19 rows=8547 width=16) (actual time=0.039..0.387 rows=803 loops=1) Output: b.experiment_id, b.url_id, b.full_url_id, b.redirect_num, b.html_tag_id Index Cond: (b.experiment_id = ANY ('{16,43}'::integer[])) Heap Fetches: 803 -> Hash (cost=2595219.62..2595219.62 rows=8816229 width=12) (actual time=143941.931..143941.931 rows=57061228 loops=1) Output: blockpage.experiment_id, blockpage.url_id, (max(blockpage.redirect_num)) Buckets: 67108864 (originally 16777216) Batches: 1 (originally 1) Memory Usage: 2976138kB -> HashAggregate (cost=2418895.04..2507057.33 rows=8816229 width=12) (actual time=90020.851..122656.924 rows=57061228 loops=1) Output: blockpage.experiment_id, blockpage.url_id, max(blockpage.redirect_num) Group Key: blockpage.experiment_id, blockpage.url_id -> Seq Scan on iclab.blockpage (cost=0.00..1757677.88 rows=88162288 width=12) (actual time=0.020..32910.709 rows=88164599 loops=1) Output: blockpage.id, blockpage.url_id, blockpage.full_url_id, blockpage.experiment_id, blockpage.blockpage_reason_id, blockpage.html_tag_id, blockpage.body_len, blockpage.blockpage_diff, blockpage.has_blockpage_regex, blockpage.http_status, blockpage.redirect_num, blockpage.response_failure Planning Time: 0.409 ms JIT: Functions: 17 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 2.098 ms, Inlining 27.226 ms, Optimization 140.416 ms, Emission 79.840 ms, Total 249.580 ms Execution Time: 145108.471 ms Since we have `b.experiment_id = bm.experiment_id` as a join condition, it seems to me that the query planner ought to have recognized that the `experiment_id in (16, 43)` condition could be pushed down into the sub-select in both cases. What is the best way to report this to the developers? Should I file a bug report? I'm using Postgres 12.2. Thanks, zw CREATE TABLE iclab.blockpage ( id BIGSERIAL PRIMARY KEY, url_id INTEGER NOT NULL, full_url_id INTEGER NOT NULL, experiment_id INTEGER NOT NULL, blockpage_reason_id INTEGER, html_tag_id INTEGER NOT NULL, body_len INTEGER, blockpage_diff REAL, has_blockpage_regex BOOLEAN, http_status INTEGER, redirect_num INTEGER NOT NULL, response_failure INTEGER, FOREIGN KEY (experiment_id) REFERENCES iclab.experiment_results(id), FOREIGN KEY(url_id) REFERENCES iclab.URL(id), FOREIGN KEY(url_id) REFERENCES iclab.URL(id), FOREIGN KEY(blockpage_reason_id) REFERENCES iclab.blockpage_reason(id), FOREIGN KEY(html_tag_id) REFERENCES iclab.html_tag(id) ); create unique index blockpage__experiment_id_url_id_redirect_num_blockpage_reason on iclab.blockpage(experiment_id , url_id, full_url_id, redirect_num, html_tag_id);
pgsql-general by date: