Re: Hash join on int takes 8..114 seconds - Mailing list pgsql-performance
From | PFC |
---|---|
Subject | Re: Hash join on int takes 8..114 seconds |
Date | |
Msg-id | op.uk0xtfsrcigqcu@soyouz Whole thread Raw |
In response to | Re: Hash join on int takes 8..114 seconds (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-performance |
On Fri, 21 Nov 2008 21:07:02 +0100, Tom Lane <tgl@sss.pgh.pa.us> wrote: > PFC <lists@peufeu.com> writes: >> Index on orders_products( product_id ) and orders_products( order_id ): >> => Same plan > >> Note that in this case, a smarter planner would use the new index to >> perform a BitmapAnd before hitting the heap to get the rows. > > Considering that the query has no constraint on > orders_products.order_id, I'm not sure what you think the extra index is > supposed to be used *for*. > > (Well, we could put orders as the outside of a nestloop and then we'd > have such a constraint, but with 30000 orders rows to process that plan > would lose big.) > > (And yes, the planner did consider such a plan along the way. > See choose_bitmap_and.) > > regards, tom lane I think I didn't express myself correctly... Here the indexes are small (therefore well cached) but the orders_products table is large (and not cached). To reproduce this, I put this table on a crummy slow external USB drive. Between each of the following queries, pg was stopped, the USB drive unmounted, remounted, and pg restarted, to purge orders_products table out of all caches. I also modified the statistical distribution (see init script at bottom of message). EXPLAIN ANALYZE SELECT count(*) FROM orders JOIN orders_products USING (order_id) WHERE orders.order_date BETWEEN '2000-01-01' AND '2000-02-01' AND orders_products.product_id = 2345; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=5431.93..5431.94 rows=1 width=0) (actual time=5176.382..5176.382 rows=1 loops=1) -> Hash Join (cost=1575.13..5431.84 rows=35 width=0) (actual time=62.634..5176.332 rows=36 loops=1) Hash Cond: (orders_products.order_id = orders.order_id) -> Bitmap Heap Scan on orders_products (cost=21.27..3864.85 rows=1023 width=4) (actual time=7.041..5118.512 rows=1004 loops=1) Recheck Cond: (product_id = 2345) -> Bitmap Index Scan on orders_products_product_order (cost=0.00..21.02 rows=1023 width=0) (actual time=0.531..0.531 rows=1004 loops=1) Index Cond: (product_id = 2345) -> Hash (cost=1130.58..1130.58 rows=33862 width=4) (actual time=55.526..55.526 rows=31999 loops=1) -> Index Scan using orders_date on orders (cost=0.00..1130.58 rows=33862 width=4) (actual time=0.139..33.466 rows=31999 loops=1) Index Cond: ((order_date >= '2000-01-01'::date) AND (order_date <= '2000-02-01'::date)) Total runtime: 5176.659 ms This is the original query ; what I don't like about it is that it bitmapscans orders_products way too much, because it reads all orders for the specified product, not just orders in the date period we want. However, since Postgres scanned all order_id's corresponding to the date range already, to build the hash, the list of order_ids of interest is known at no extra cost. In this case, additionnally, correlation is 100% between order_id and date, so I can do : test=> SELECT max(order_id), min(order_id) FROM orders WHERE order_date BETWEEN '2000-01-01' AND '2000-02-01'; max | min -------+----- 31999 | 1 And I can add an extra condition to the query, like this : EXPLAIN ANALYZE SELECT count(*) FROM orders JOIN orders_products USING (order_id) WHERE orders.order_date BETWEEN '2000-01-01' AND '2000-02-01' AND orders_products.order_id BETWEEN 1 AND 31999 AND orders_products.product_id = 2345; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=426.80..426.81 rows=1 width=0) (actual time=179.233..179.233 rows=1 loops=1) -> Nested Loop (cost=0.00..426.79 rows=1 width=0) (actual time=6.667..179.190 rows=36 loops=1) -> Index Scan using orders_products_product_order on orders_products (cost=0.00..142.11 rows=34 width=4) (actual time=6.559..177.597 rows=36 loops=1) Index Cond: ((product_id = 2345) AND (order_id >= 1) AND (order_id <= 31999)) -> Index Scan using orders_pkey on orders (cost=0.00..8.36 rows=1 width=4) (actual time=0.039..0.041 rows=1 loops=36) Index Cond: (orders.order_id = orders_products.order_id) Filter: ((orders.order_date >= '2000-01-01'::date) AND (orders.order_date <= '2000-02-01'::date)) Total runtime: 179.392 ms This is with no cache on orders_products table. About 30X faster. Interestingly, when everything is cached, it's even faster (about 100X)... The plan I was thinking about was not a nested loop with 30K loops... this would be bad as you said. It would have been something like this : - There is an index on (product_id, order_id) - Build the hash from orders table (can't avoid it) -> Hash -> Index Scan using orders_date on orders Index Cond: ((order_date >= '2000-01-01'::date) AND (order_date <= '2000-02-01'::date)) - A slightly twisted bitmap scan form : -> Bitmap Heap Scan on orders_products Recheck Cond: (product_id = 2345) AND order_id IN (hash created above)) -> Bitmap Index Scan on orders_products_product_order Index Cond: (product_id = 2345 AND order_id IN (hash created above)) The Bitmap Index Scan sees the order_ids in the index it is scanning... they could be checked before checking the visibility in the heap for the big table. Test script: BEGIN; CREATE TABLE orders (order_id INTEGER NOT NULL, order_date DATE NOT NULL); CREATE TABLE products (product_id INTEGER NOT NULL, product_name TEXT NOT NULL); CREATE TABLE orders_products (order_id INTEGER NOT NULL, product_id INTEGER NOT NULL, padding1 TEXT, padding2 TEXT) TABLESPACE usb; INSERT INTO products SELECT n, 'product number ' || n::TEXT FROM generate_series(1,10001) AS n; INSERT INTO orders SELECT n,'2000-01-01'::date + (n/1000 * '1 DAY'::interval) FROM generate_series(1,1000000) AS n; SET work_mem TO '1GB'; INSERT INTO orders_products SELECT a,b,'aibaifbaurgbyioubyfazierugybfoaybofauez', 'hfohbdsqbhjhqsvdfiuazvfgiurvgazrhbazboifhaoifh' FROM (SELECT DISTINCT (1+(n/10))::INTEGER AS a, (1+(random()*10000))::INTEGER AS b FROM generate_series( 1,9999999 ) AS n) AS x; DELETE FROM orders_products WHERE product_id NOT IN (SELECT product_id FROM products); DELETE FROM orders_products WHERE order_id NOT IN (SELECT order_id FROM orders); ALTER TABLE orders ADD PRIMARY KEY (order_id); ALTER TABLE products ADD PRIMARY KEY (product_id); ALTER TABLE orders_products ADD PRIMARY KEY (order_id,product_id); ALTER TABLE orders_products ADD FOREIGN KEY (product_id) REFERENCES products( product_id ) ON DELETE CASCADE; ALTER TABLE orders_products ADD FOREIGN KEY (order_id) REFERENCES orders( order_id ) ON DELETE CASCADE; CREATE INDEX orders_date ON orders( order_date ); CREATE INDEX orders_products_product_order ON orders_products( product_id, order_id ); COMMIT; SET work_mem TO DEFAULT; ANALYZE;
pgsql-performance by date: