Thread: Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

From
Behrang Saeedzadeh
Date:
Hi,

I just stumbled upon this article from 2012 [1], according to which (emphasis mine):

Window functions offer yet another way to implement pagination in SQL. This is a flexible, and above all, standards-compliant method. However, only SQL Server and the Oracle database can use them for a pipelined top-N query. PostgreSQL does not use indexes for those queries and therefore executes them very inefficiently. MySQL does not support window functions at all.

Is this still the case? Or is PostgreSQL 9.3 capable to execute suchlike queries efficiently?


Best regards,
Behrang Saeedzadeh

Re: Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

From
Thomas Kellerer
Date:
Behrang Saeedzadeh, 15.02.2014 02:35:
> Hi,
>
> I just stumbled upon this article from 2012 [1], according to which
> (emphasis mine):
>
> Window functions offer yet another way to implement pagination in
> SQL. This is a flexible, and above all, standards-compliant method.
> However, only SQL Server and the Oracle database can use them for a
> pipelined top-N query. */PostgreSQL does not use indexes for those
> queries and therefore executes them very inefficiently./* MySQL does
> not support window functions at all.
>
>
> Is this still the case? Or is PostgreSQL 9.3 capable to execute
> suchlike queries efficiently?
>
> [1] http://use-the-index-luke.com/sql/partial-results/window-functions


My local Postgres 9.3 installation does use an index for such a query.

I ran a quick (an un-scientific) test on a sample table filled with auto-generated test data:

postgres=> \d+ products
                                    Table "public.products"
      Column       |          Type          | Modifiers | Storage  | Stats target | Description
-------------------+------------------------+-----------+----------+--------------+-------------
 product_id        | integer                | not null  | plain    |              |
 ean_code          | bigint                 | not null  | plain    |              |
 product_name      | character varying(100) | not null  | extended |              |
 manufacturer_name | character varying      | not null  | extended |              |
 price             | numeric(10,2)          | not null  | main     |              |
 publish_date      | date                   | not null  | plain    |              |
Indexes:
    "products_pkey" PRIMARY KEY, btree (product_id)
    "idx_publish_date" btree (publish_date, product_id)
Has OIDs: no


postgres=> select count(*) from products;
  count
---------
 1000000
(1 row)

Then I tried the following statement:

select *
from (
  select products.*,
         row_number() over (order by publish_date, product_id) as rn
  from products
) tmp
where rn between 200 and 300
order by publish_date, product_id;

http://explain.depesz.com/s/5u9

And Postgres does use the index idx_publish_date.

Interesting enough: my local Oracle 11.2 does *not* use an index scan for the above test (same test data).

On the other hand Oracle's table scan is much faster (about ~0.5 seconds) for the first "pages" but than gets slower
whenincreasing the limits of the pagincation.  

Oracle takes over 5 seconds when changing the limit to "between 900000 and 900100" whereas Postgres execution time
prettymuch stays the same. 











Re: Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

From
Merlin Moncure
Date:
On Fri, Feb 14, 2014 at 7:35 PM, Behrang Saeedzadeh <behrangsa@gmail.com> wrote:
> Hi,
>
> I just stumbled upon this article from 2012 [1], according to which
> (emphasis mine):
>
> Window functions offer yet another way to implement pagination in SQL. This
> is a flexible, and above all, standards-compliant method. However, only SQL
> Server and the Oracle database can use them for a pipelined top-N query.
> PostgreSQL does not use indexes for those queries and therefore executes
> them very inefficiently. MySQL does not support window functions at all.
>
>
> Is this still the case? Or is PostgreSQL 9.3 capable to execute suchlike
> queries efficiently?

oracle:
SELECT *
  FROM ( SELECT sales.*
              , ROW_NUMBER() OVER (ORDER BY sale_date DESC
                                          , sale_id   DESC) rn
           FROM sales
       ) tmp
 WHERE rn between 11 and 20
 ORDER BY sale_date DESC, sale_id DESC;

postgres:
SELECT * FROM sales s
WHERE (sale_date, sale_id) < (last_date, last_Id)
ORDER BY sale_date DESC, sale_id DESC
LIMIT 10;

The postgres variant is superior in my opinion (it will be faster for
large offsets).  last_date, last_id are the lowest values you
previously read off. It will use an index on those two columns if you
have one.  One interesting distinction is that the postgres variant
will always move forward while the oracle variant can appear to move
backwards if you are doing a non transactional scan.

Also, you can always use a cursor in either database.

merlin


Hi!

I'd like to clarify this as the original author of the page in question.

Fist of all, I also recommend the row-value syntax as you can see on the "previous" page:
http://use-the-index-luke.com/sql/partial-results/fetch-next-page

I've also explained this procedure at conferences. Here are the slides:
http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way


And now about this quote:
> However, only SQL Server and the Oracle database can use them for a pipelined top-N query. PostgreSQL does not use
indexesfor those queries and therefore executes them very inefficiently. 

The very important thing here is "pipelined top-N query". This term is introduced two pages earlier:
http://use-the-index-luke.com/sql/partial-results/top-n-queries

The pipelined top-n query has two very important properties:
(1) it utilizes the index order to avoid the sort operation required to satisfy ORDER BY clause
(2) it realizes that it can stop processing as soon as it has delivered enough rows.

The execution plan from Thomas Kellerer sees to fulfill requirement (1) but definitively not (2).

Even with 9.3.2, I were not able to reproduce the result of Thomas (not showing any sort operation in the execution
plan)with the test data I also published at my website: 
   http://use-the-index-luke.com/sql/example-schema/postgresql/partial-results

Then I started fiddling around with the planner's cost settings and finally managed to get a plan similar to Thomas'
whensetting random_page_cost to 0.1 (setting it to 1, equal to seq_page_cost was not enough). However, that proves the
pointthat PostgreSQL can use an index to avoid the sort operation caused by order by (even for window functions). I'd
becurious what settings caused Thomas to get this result. 

The second requirement for pipelined top-n queries is not satisfied in Thomas' execution plan: it does read the full
index(actual rows=1000000), and applies the window function over all rows. Only in the end it throws away all
non-confirmingrows (Rows Removed by Filter: 999899). A pipelined top-n execution would not cause more than 300 rows
readfrom the index, and only 200 rows removed by filter. That's what the Oracle DB and SQL Server manage to do (Thomas:
pingme to track down why Oracle didn't for you). Considering that PG can use the index order to avoid the sort, it
stilldoesn't make very much sense if it cannot abort the index scan after fetching enough rows. So, not using the index
mighteven be right choice in unfiltered cases like this. 

Interestingly, I did not even get an Index Only Scan when just selecting column from the index with random_page_cost=1
andseq_page_cost=1. Again I had to reduce random_page_cost further down (0.1) to get an Index Only Scan. That does not
makesense to me. When both _page_costs are same, the Index Only Scan should get lower costs because the index is
smaller(338mb vs. 31 mb in my case). On top of that, the Index Only Scan avoids the sort. However, with 9.3.2 I get the
samecost for Index Only Scan as for Index Scan (had to enable_seqscan=off and enable_bitmapscan=off to get that). 

So, I have to change my page  (&book) to say something like this:

> PostgreSQL does not abort the index scan after fetching enough rows for those queries and therefore executes them
veryinefficiently. 


Thanks for the hint and always feel free to put my on CC regarding questions about stuff on Use The Index, Luke!

-markus

ps.: It's perfectly possible that PG could use indexes for window-functions before 9.3. I did definitively not fiddle
aroundwith cost settings at that time to force it into this plan. 
pps.: sorry for the delay, I'm not subscribed (just too much) but somebody was nice enough to ping me about this.
ppps: then I wondered how to properly reply without having the original messages. So I downloaded the .mbox from the
archiveand pushed reply there. Hope it ends up in the right thread :) 

Markus Winand
markus.winand@winand.at
T +43 1 9444047

> "A wonderful book…I highly recommend it." -Anders Janmyr
> http://sql-performance-explained.com/

Maderspergerstr. 1-3/9/11
1160 Wien
AUSTRIA





2014-03-14 13:02 GMT+01:00 Markus Winand <markus.winand@winand.at>:
Hi!

I'd like to clarify this as the original author of the page in question.

Fist of all, I also recommend the row-value syntax as you can see on the "previous" page:
http://use-the-index-luke.com/sql/partial-results/fetch-next-page

I've also explained this procedure at conferences. Here are the slides:
http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way


And now about this quote:
> However, only SQL Server and the Oracle database can use them for a pipelined top-N query. PostgreSQL does not use indexes for those queries and therefore executes them very inefficiently.

The very important thing here is "pipelined top-N query". This term is introduced two pages earlier:
http://use-the-index-luke.com/sql/partial-results/top-n-queries

The pipelined top-n query has two very important properties:
(1) it utilizes the index order to avoid the sort operation required to satisfy ORDER BY clause
(2) it realizes that it can stop processing as soon as it has delivered enough rows.

The execution plan from Thomas Kellerer sees to fulfill requirement (1) but definitively not (2).

Even with 9.3.2, I were not able to reproduce the result of Thomas (not showing any sort operation in the execution plan) with the test data I also published at my website:
   http://use-the-index-luke.com/sql/example-schema/postgresql/partial-results

Then I started fiddling around with the planner's cost settings and finally managed to get a plan similar to Thomas' when setting random_page_cost to 0.1 (setting it to 1, equal to seq_page_cost was not enough). However, that proves the point that PostgreSQL can use an index to avoid the sort operation caused by order by (even for window functions). I'd be curious what settings caused Thomas to get this result.

The second requirement for pipelined top-n queries is not satisfied in Thomas' execution plan: it does read the full index (actual rows=1000000), and applies the window function over all rows. Only in the end it throws away all non-confirming rows (Rows Removed by Filter: 999899). A pipelined top-n execution would not cause more than 300 rows read from the index, and only 200 rows removed by filter. That's what the Oracle DB and SQL Server manage to do (Thomas: ping me to track down why Oracle didn't for you). Considering that PG can use the index order to avoid the sort, it still doesn't make very much sense if it cannot abort the index scan after fetching enough rows. So, not using the index might even be right choice in unfiltered cases like this.

Interestingly, I did not even get an Index Only Scan when just selecting column from the index with random_page_cost=1 and seq_page_cost=1. Again I had to reduce random_page_cost further down (0.1) to get an Index Only Scan. That does not make sense to me. When both _page_costs are same, the Index Only Scan should get lower costs because the index is smaller (338mb vs. 31 mb in my case). On top of that, the Index Only Scan avoids the sort. However, with 9.3.2 I get the same cost for Index Only Scan as for Index Scan (had to enable_seqscan=off and enable_bitmapscan=off to get that).

So, I have to change my page  (&book) to say something like this:

> PostgreSQL does not abort the index scan after fetching enough rows for those queries and therefore executes them very inefficiently.

I am thinking so LIMIT is not propagated to down - so window function should be calculated in full range.

Regards

Pavel
 


Thanks for the hint and always feel free to put my on CC regarding questions about stuff on Use The Index, Luke!

-markus

ps.: It's perfectly possible that PG could use indexes for window-functions before 9.3. I did definitively not fiddle around with cost settings at that time to force it into this plan.
pps.: sorry for the delay, I'm not subscribed (just too much) but somebody was nice enough to ping me about this.
ppps: then I wondered how to properly reply without having the original messages. So I downloaded the .mbox from the archive and pushed reply there. Hope it ends up in the right thread :)

Markus Winand
markus.winand@winand.at
T +43 1 9444047

> "A wonderful book…I highly recommend it." -Anders Janmyr
> http://sql-performance-explained.com/

Maderspergerstr. 1-3/9/11
1160 Wien
AUSTRIA


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Is PostgreSQL 9.3 using indexes for pipelined top-N window function queries?

From
Thomas Kellerer
Date:
Markus,

thanks for your reply.

> The pipelined top-n query has two very important properties:
> (1) it utilizes the index order to avoid the sort operation required to satisfy ORDER BY clause
> (2) it realizes that it can stop processing as soon as it has delivered enough rows.
>
> The execution plan from Thomas Kellerer sees to fulfill requirement
> (1) but definitively not (2).
>
> Even with 9.3.2, I were not able to reproduce the result of Thomas
> (not showing any sort operation in the execution plan) with the test
> data I also published at my website:
> http://use-the-index-luke.com/sql/example-schema/postgresql/partial-results
>
> Then I started fiddling around with the planner's cost settings and
> finally managed to get a plan similar to Thomas' when setting
> random_page_cost to 0.1 (setting it to 1, equal to seq_page_cost was
> not enough). However, that proves the point that PostgreSQL can use
> an index to avoid the sort operation caused by order by (even for
> window functions). I'd be curious what settings caused Thomas to get
> this result.

Good point, I should adopt the habit to mention config settings for this kind of things:

shared_buffers = 2048MB
temp_buffers = 16MB
work_mem = 32MB
seq_page_cost = 1.0
random_page_cost = 1.5
cpu_tuple_cost = 0.001
cpu_index_tuple_cost = 0.001
effective_cache_size = 2048MB

> The second requirement for pipelined top-n queries is not satisfied
> in Thomas' execution plan: it does read the full index (actual
> rows=1000000), and applies the window function over all rows. Only in
> the end it throws away all non-confirming rows (Rows Removed by
> Filter: 999899). A pipelined top-n execution would not cause more
> than 300 rows read from the index, and only 200 rows removed by
> filter. That's what the Oracle DB and SQL Server manage to do
> (Thomas: ping me to track down why Oracle didn't for you).
> Considering that PG can use the index order to avoid the sort, it
> still doesn't make very much sense if it cannot abort the index scan
> after fetching enough rows. So, not using the index might even be
> right choice in unfiltered cases like this.

I probably should have read the definition of "index usage" more carefully, thanks for the clarification.


I created the testdata from your webpage locally and then ran the window function statement from
http://use-the-index-luke.com/sql/example-schema/postgresql/partial-results

SELECT *
  FROM ( SELECT sales.*
              , ROW_NUMBER() OVER (ORDER BY sale_date DESC
                                          , sale_id   DESC) rn
           FROM sales
       ) tmp
 WHERE rn between 11 and 20
 ORDER BY sale_date DESC, sale_id DESC;

This does an "Index Scan Backward"  on an index defined as (sale_date, sale_id) but takes nearly 2.5 seconds on my
computer. 

The version using OFFSET .. LIMIT took about 0.05 seconds and increases when the offset increases which is to be
expected- whereas the window function version is pretty much constant even with a startpoint of 100001 - but the offset
versionis still *much* faster then. 

Regards
Thomas

P.S.: Btw: thanks for your book and your webpage, both are very insipring.