window function induces full table scan - Mailing list pgsql-performance
From | Thomas Mayer |
---|---|
Subject | window function induces full table scan |
Date | |
Msg-id | 52C5DAD1.20109@student.kit.edu Whole thread Raw |
Responses |
Re: window function induces full table scan
|
List | pgsql-performance |
When querying a view with a WHERE condition, postgresql normally is able to perform an index scan which reduces time for evaluation dramatically. However, if a window function is evaluated in the view, postgresql is evaluating the window function before the WHERE condition is applied. This induces a full table scan. These are the results of EXPLAIN: -- without window function (non-equivalent) explain select * from without_window_function where user_id = 43; QUERY PLAN ----------------------------------------------------------------------------------------------- Index Scan using idx_checkin_node_user_id on checkin_node (cost=0.43..26.06 rows=2 width=20) Index Cond: (user_id = 43) Filter: (((id % 1000) + 1) = 1) -- with window function explain select * from last_position where user_id = 43; QUERY PLAN ------------------------------------------------------------------------------------------ Subquery Scan on tmp_last_position (cost=973803.66..1151820.09 rows=2 width=20) Filter: ((tmp_last_position.datepos = 1) AND (tmp_last_position.user_id = 43)) -> WindowAgg (cost=973803.66..1080613.52 rows=4747105 width=32) -> Sort (cost=973803.66..985671.42 rows=4747105 width=32) Sort Key: checkin_node.user_id, checkin_node.date, checkin_node.id -> Seq Scan on checkin_node (cost=0.00..106647.05 rows=4747105 width=32) To work around this, I avoid using a view for that (equivalent): EXPLAIN SELECT user_id, latitude, longitude FROM ( SELECT user_id, latitude, longitude, rank() OVER (PARTITION BY user_id ORDER BY date DESC, id DESC) AS datepos FROM checkin_node WHERE user_id = 43 ) AS tmp_last_position WHERE datepos = 1; -- takes 2 ms QUERY PLAN ------------------------------------------------------------------------------------------------------------------- Subquery Scan on tmp_last_position (cost=39.70..52.22 rows=2 width=20) Filter: (tmp_last_position.datepos = 1) -> WindowAgg (cost=39.70..47.40 rows=385 width=32) -> Sort (cost=39.70..40.67 rows=385 width=32) Sort Key: checkin_node.date, checkin_node.id -> Index Scan using idx_checkin_node_user_id on checkin_node (cost=0.43..23.17 rows=385 width=32) Index Cond: (user_id = 43) I would expect postgresql to apply this query plan also for the view last_position. It's 6621ms vs. 2ms, so the speedup is 3310! Is it a bug in the optimizer? How to reproduce: ================= OS: ubuntu 12.04 Postgresql v9.3.2 get some sample data: wget -qO- http://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz|gunzip -c|dos2unix|awk '{ if (length($0) > 20) print }'>test.csv execute psql script: \timing on BEGIN; DROP TABLE IF EXISTS checkin_node CASCADE; CREATE TABLE checkin_node ( id SERIAL NOT NULL PRIMARY KEY, user_id INTEGER NOT NULL, date TIMESTAMP NOT NULL, latitude DOUBLE PRECISION NOT NULL, longitude DOUBLE PRECISION NOT NULL, original_id VARCHAR NOT NULL ); \COPY checkin_node (user_id, date, latitude, longitude, original_id) FROM 'test.csv' WITH DELIMITER E'\t'; ALTER TABLE checkin_node DROP COLUMN original_id; CREATE INDEX idx_checkin_node_user_id ON checkin_node(user_id); CREATE INDEX idx_checkin_node_date ON checkin_node(date); COMMIT; VACUUM ANALYZE checkin_node; -- doing window function in a view DROP VIEW IF EXISTS last_position CASCADE; CREATE VIEW last_position (user_id, latitude, longitude) AS ( SELECT user_id, latitude, longitude FROM ( SELECT user_id, latitude, longitude, rank() OVER (PARTITION BY user_id ORDER BY date DESC, id DESC) AS datepos FROM checkin_node ) AS tmp_last_position WHERE datepos = 1 ); select * from last_position where user_id = 43; -- takes 6621ms -- similar view but without window function (non-equivalent) DROP VIEW IF EXISTS without_window_function CASCADE; CREATE VIEW without_window_function (user_id, latitude, longitude) AS ( SELECT user_id, latitude, longitude FROM ( SELECT user_id, latitude, longitude, (id % 1000)+1 AS datepos --to not use a constant here FROM checkin_node ) AS tmp_last_position WHERE datepos = 1 ); select * from without_window_function where user_id = 43; -- takes 10ms -- workaround: avoid using views (equivalent) SELECT user_id, latitude, longitude FROM ( SELECT user_id, latitude, longitude, rank() OVER (PARTITION BY user_id ORDER BY date DESC, id DESC) AS datepos FROM checkin_node WHERE user_id = 43 ) AS tmp_last_position WHERE datepos = 1; -- takes 2 ms
pgsql-performance by date: