window function induces full table scan - Mailing list pgsql-performance

From Thomas Mayer
Subject window function induces full table scan
Date
Msg-id 52C5DAD1.20109@student.kit.edu
Whole thread Raw
Responses Re: window function induces full table scan
List pgsql-performance
When querying a view with a WHERE condition, postgresql normally is able
to perform an index scan which reduces time for evaluation dramatically.

However, if a window function is evaluated in the view, postgresql is
evaluating the window function before the WHERE condition is applied.
This induces a full table scan.

These are the results of EXPLAIN:
-- without window function (non-equivalent)
explain select * from without_window_function where user_id = 43;
           QUERY PLAN
-----------------------------------------------------------------------------------------------
  Index Scan using idx_checkin_node_user_id on checkin_node
(cost=0.43..26.06 rows=2 width=20)
    Index Cond: (user_id = 43)
    Filter: (((id % 1000) + 1) = 1)

-- with window function
  explain select * from last_position where user_id = 43;
           QUERY PLAN
------------------------------------------------------------------------------------------
  Subquery Scan on tmp_last_position  (cost=973803.66..1151820.09 rows=2
width=20)
    Filter: ((tmp_last_position.datepos = 1) AND
(tmp_last_position.user_id = 43))
    ->  WindowAgg  (cost=973803.66..1080613.52 rows=4747105 width=32)
          ->  Sort  (cost=973803.66..985671.42 rows=4747105 width=32)
                Sort Key: checkin_node.user_id, checkin_node.date,
checkin_node.id
                ->  Seq Scan on checkin_node  (cost=0.00..106647.05
rows=4747105 width=32)

To work around this, I avoid using a view for that (equivalent):
EXPLAIN SELECT user_id, latitude, longitude
  FROM (
         SELECT
           user_id,
           latitude,
           longitude,
           rank() OVER (PARTITION BY user_id ORDER BY date DESC, id
DESC) AS datepos
         FROM checkin_node
         WHERE user_id = 43
       ) AS tmp_last_position
  WHERE datepos = 1; -- takes 2 ms
         QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
  Subquery Scan on tmp_last_position  (cost=39.70..52.22 rows=2 width=20)
    Filter: (tmp_last_position.datepos = 1)
    ->  WindowAgg  (cost=39.70..47.40 rows=385 width=32)
          ->  Sort  (cost=39.70..40.67 rows=385 width=32)
                Sort Key: checkin_node.date, checkin_node.id
                ->  Index Scan using idx_checkin_node_user_id on
checkin_node  (cost=0.43..23.17 rows=385 width=32)
                      Index Cond: (user_id = 43)


I would expect postgresql to apply this query plan also for the view
last_position. It's 6621ms vs. 2ms, so the speedup is 3310!

Is it a bug in the optimizer?

How to reproduce:
=================
OS: ubuntu 12.04
Postgresql v9.3.2

get some sample data:
wget -qO-
http://snap.stanford.edu/data/loc-brightkite_totalCheckins.txt.gz|gunzip
-c|dos2unix|awk '{ if (length($0) > 20) print }'>test.csv

execute psql script:


\timing on
BEGIN;
DROP TABLE IF EXISTS checkin_node CASCADE;
CREATE TABLE checkin_node (
   id SERIAL NOT NULL PRIMARY KEY,
   user_id INTEGER NOT NULL,
   date TIMESTAMP NOT NULL,
   latitude DOUBLE PRECISION NOT NULL,
   longitude DOUBLE PRECISION NOT NULL,
   original_id VARCHAR NOT NULL
);
\COPY checkin_node (user_id, date, latitude, longitude, original_id)
FROM 'test.csv' WITH DELIMITER E'\t';

ALTER TABLE checkin_node DROP COLUMN original_id;
CREATE INDEX idx_checkin_node_user_id ON checkin_node(user_id);
CREATE INDEX idx_checkin_node_date ON checkin_node(date);

COMMIT;

VACUUM ANALYZE checkin_node;

-- doing window function in a view

DROP VIEW IF EXISTS last_position CASCADE;
CREATE VIEW last_position (user_id, latitude, longitude) AS (
   SELECT user_id, latitude, longitude
   FROM (
          SELECT
            user_id,
            latitude,
            longitude,
            rank() OVER (PARTITION BY user_id ORDER BY date DESC, id
DESC) AS datepos
          FROM checkin_node
        ) AS tmp_last_position
   WHERE datepos = 1
);

select * from last_position where user_id = 43; -- takes 6621ms

-- similar view but without window function (non-equivalent)

DROP VIEW IF EXISTS without_window_function CASCADE;
CREATE VIEW without_window_function (user_id, latitude, longitude) AS (
   SELECT user_id, latitude, longitude
   FROM (
          SELECT
            user_id,
            latitude,
            longitude,
            (id % 1000)+1 AS datepos --to not use a constant here
          FROM checkin_node
        ) AS tmp_last_position
   WHERE datepos = 1
);
select * from without_window_function where user_id = 43; -- takes 10ms


-- workaround: avoid using views (equivalent)

SELECT user_id, latitude, longitude
FROM (
        SELECT
          user_id,
          latitude,
          longitude,
          rank() OVER (PARTITION BY user_id ORDER BY date DESC, id DESC)
AS datepos
        FROM checkin_node
        WHERE user_id = 43
      ) AS tmp_last_position
WHERE datepos = 1; -- takes 2 ms


pgsql-performance by date:

Previous
From: Dave Johansen
Date:
Subject: Re: DATE_TRUNC() and GROUP BY?
Next
From: Tom Lane
Date:
Subject: Re: window function induces full table scan