Re: [PERFORM] Big IN() clauses etc : feature proposal - Mailing list pgsql-hackers
From | PFC |
---|---|
Subject | Re: [PERFORM] Big IN() clauses etc : feature proposal |
Date | |
Msg-id | op.s9alfhjrcigqcu@apollo13 Whole thread Raw |
In response to | Re: Big IN() clauses etc : feature proposal (Greg Stark <gsstark@mit.edu>) |
Responses |
Re: [PERFORM] Big IN() clauses etc : feature proposal
|
List | pgsql-hackers |
> Creating cursors for a simple plan like a single sequential scan is fast > because it's using the original data from the table. I used the following query : SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20 It's a backward index scan + limit... not a seq scan. And it's damn fast : 0.042 ms BEGIN 0.153 ms DECLARE tmp SCROLL CURSOR WITHOUT HOLD FOR SELECT * FROM bookmarks ORDER BY annonce_id DESC LIMIT 20 0.246 ms FETCH ALL FROM tmp 0.048 ms MOVE FIRST IN tmp 0.246 ms FETCH ALL FROM tmp 0.048 ms CLOSE tmp 0.084 ms COMMIT > But your example was > predicated on this part of the job being a complex query. If it's a > complex > query involving joins and groupings, etc, then it will have to be > materialized > and there's no (good) reason for that to be any faster than a temporary > table > which is effectively the same thing. You mean the cursors'storage is in fact the same internal machinery as a temporary table ? In that case, this raises an interesting question : why is the cursor faster ? Let's try a real-life example from my website : it is a search query (quite complex) which is then joined to a lot of tables to resolve FKeys. To that query I must add add an application-made join using a big IN() clause extracted from the data. Timings includes the time to fetch the results into Python. The "running total" column is the sum of all timings since the BEGIN. query_time running_total rows query 0.061 ms 0.061 ms -1 BEGIN 23.420 ms 23.481 ms 85 SELECT * FROM (huge query with a lot of joins) 4.318 ms 27.799 ms 2 SELECT l.*, u.login, u.bg_color FROM annonces_log l, users u WHERE u.id=l.user_id AND l.annonce_id IN (list of ids from previous query) ORDER BY annonce_id, added 0.241 ms 28.040 ms -1 COMMIT (Just in case you want to hurt yourself, here's the EXPLAIN ANALYZE output : http://peufeu.com/temp/big_explain.txt) Using a cursor takes about the same time. Also, doing just the search query takes about 12 ms, the joins take up the rest. Now, I'll rewrite my query eliminating the joins and using a temp table. Storing the whole result in the temp table will be too slow, because there are too many columns. Therefore I will only store the primary and foreign key columns, and join again to the main table to get the full records. query_time running_total rows query 0.141 ms 0.141 ms -1 BEGIN Do the search : 8.229 ms 8.370 ms -1 CREATE TEMPORARY TABLE tmp AS SELECT id, city_id, zipcode, contact_id, contact_group_id, price/terrain as sort FROM (stripped down search query) 0.918 ms 9.287 ms -1 ANALYZE tmp Fetch the main data to display : 7.663 ms 16.951 ms 85 SELECT a.* FROM tmp t, annonces_display a WHERE a.id=t.id ORDER BY t.sort Fetch log entries associates with each row (one row to many log entries) : 1.021 ms 17.972 ms 2 SELECT l.*, u.login, u.bg_color FROM annonces_log l, users u, tmp t WHERE u.id=l.user_id AND l.annonce_id = t.id ORDER BY annonce_id, added 3.468 ms 21.440 ms 216 SELECT annonce_id, array_accum(list_id) AS list_ids, array_accum(COALESCE(user_id,0)) AS list_added_by, max(added) AS added_to_list FROM bookmarks GROUP BY annonce_id Resolve foreign key relations 1.034 ms 22.474 ms 37 SELECT r.annonce_id FROM read_annonces r, tmp t WHERE r.annonce_id = t.id 0.592 ms 23.066 ms 9 SELECT * FROM cities_dist_zipcode WHERE zipcode IN (SELECT zipcode FROM tmp) 0.716 ms 23.782 ms 11 SELECT * FROM cities_dist WHERE id IN (SELECT city_id FROM tmp) 1.125 ms 24.907 ms 45 SELECT * FROM contacts WHERE id IN (SELECT contact_id FROM tmp) 0.799 ms 25.705 ms 42 SELECT * FROM contact_groups WHERE id IN (SELECT contact_group_id FROM tmp) 0.463 ms 26.169 ms -1 DROP TABLE tmp 32.208 ms 58.377 ms -1 COMMIT From this we see : Using a temporary table is FASTER than doing the large query with all the joins. (26 ms versus 28 ms). It's also nicer and cleaner. However the COMMIT takes as much time as all the queries together ! Let's run with fsync=off : query_time running_total rows query 0.109 ms 0.109 ms -1 BEGIN 8.321 ms 8.430 ms -1 CREATE TEMPORARY TABLE tmp AS SELECT id, city_id, zipcode, contact_id, contact_group_id, price/terrain as sort FROM (stripped down search query) 0.849 ms 9.280 ms -1 ANALYZE tmp 7.360 ms 16.640 ms 85 SELECT a.* FROM tmp t, annonces_display a WHERE a.id=t.id ORDER BY t.sort 1.067 ms 17.707 ms 2 SELECT l.*, u.login, u.bg_color FROM annonces_log l, users u, tmp t WHERE u.id=l.user_id AND l.annonce_id = t.id ORDER BY annonce_id, added 3.322 ms 21.030 ms 216 SELECT annonce_id, array_accum(list_id) AS list_ids, array_accum(COALESCE(user_id,0)) AS list_added_by, max(added) AS added_to_list FROM bookmarks GROUP BY annonce_id 0.896 ms 21.926 ms 37 SELECT r.annonce_id FROM read_annonces r, tmp t WHERE r.annonce_id = t.id 0.573 ms 22.499 ms 9 SELECT * FROM cities_dist_zipcode WHERE zipcode IN (SELECT zipcode FROM tmp) 0.678 ms 23.177 ms 11 SELECT * FROM cities_dist WHERE id IN (SELECT city_id FROM tmp) 1.064 ms 24.240 ms 45 SELECT * FROM contacts WHERE id IN (SELECT contact_id FROM tmp) 0.772 ms 25.013 ms 42 SELECT * FROM contact_groups WHERE id IN (SELECT contact_group_id FROM tmp) 0.473 ms 25.485 ms -1 DROP TABLE tmp 1.777 ms 27.262 ms -1 COMMIT There, it's good again. So, when fsync=on, and temporary tables are used, something slow happens on commit (even if the temp table is ON COMMIT DROP...) Thoughts ?
pgsql-hackers by date: