Thread: EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.

EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.

From
Pól Ua Laoínecháin
Date:
Hi all,

Noticed this today - relatively simple query - table with 7 records
(all code is shown at the bottom of this post and on the fiddle here):

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=efe73a37d29af43f33d2bc79de2b6c97

Sample (2 of 7 records);

======================
INSERT INTO test
VALUES
(1, 'Pól'  , '2021-06-01', '2021-06-06'),
(2, 'Bill' , '2021-06-02', '2021-06-10');
=========================

Query:

========================
SELECT
  id,
  GENERATE_SERIES
    (t.start_date, t.end_date, '1 DAY')::DATE AS sd,
  t.end_date
FROM test t
ORDER BY t.id, t.start_date;
=========================

Now, the EXPLAIN (ANALYZE, BUFFERS) of this query is as follow - see
fiddle - with 7 records:

====================
QUERY PLAN
Sort (cost=165708.62..168608.62 rows=1160000 width=10) (actual
time=0.083..0.087 rows=42 loops=1)
Sort Key: id, (((generate_series((start_date)::timestamp with time
zone, (end_date)::timestamp with time zone, '1
day'::interval)))::date)
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Result (cost=0.00..29036.10 rows=1160000 width=10) (actual
time=0.019..0.056 rows=42 loops=1)
Buffers: shared hit=1
-> ProjectSet (cost=0.00..5836.10 rows=1160000 width=14) (actual
time=0.018..0.042 rows=42 loops=1)
Buffers: shared hit=1
-> Seq Scan on test t (cost=0.00..21.60 rows=1160 width=10) (actual
time=0.008..0.010 rows=7 loops=1)
Buffers: shared hit=1
Planning Time: 0.061 ms
Execution Time: 0.131 ms
12 rows
===========================

Now, the first line of the PLAN has

> Sort (cost=165708.62..168608.62 rows=1160000 width=10)

and two other lines below this also contain the figure 1,160,000.

Where *_on earth_* is PostgreSQL obtaining 1.16M rows? And where do
the cost numbers come from for a query on 7 records?

This query - a recursive CTE (no GENERATE_SERIES) as follows:

===================
WITH RECURSIVE cte (id, sd, ed) AS
(
  SELECT t.id, t.start_date, t.end_date
  FROM test t
  UNION ALL
  SELECT  c.id, (c.sd + INTERVAL '1 DAY')::DATE, c.ed
  FROM cte c
  WHERE c.sd < (SELECT z.end_date FROM test z WHERE z.id = c.id)
)
SELECT * FROM cte c2
ORDER BY c2.id, c2.sd, c2.ed;
=============================

gives a PLAN as follows:

==========================
QUERY PLAN
Sort (cost=955181.47..955281.05 rows=39830 width=10) (actual
time=0.262..0.266 rows=42 loops=1)
Sort Key: c2.id, c2.sd, c2.ed
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=85
CTE cte
-> Recursive Union (cost=0.00..951341.55 rows=39830 width=10) (actual
time=0.010..0.216 rows=42 loops=1)
Buffers: shared hit=85
-> Seq Scan on test t (cost=0.00..21.60 rows=1160 width=10) (actual
time=0.009..0.010 rows=7 loops=1)
Buffers: shared hit=1
-> WorkTable Scan on cte c (cost=0.00..95052.33 rows=3867 width=10)
(actual time=0.006..0.020 rows=4 loops=9)
Filter: (sd < (SubPlan 1))
Rows Removed by Filter: 1
Buffers: shared hit=84
SubPlan 1
-> Index Scan using test_pkey on test z (cost=0.15..8.17 rows=1
width=4) (actual time=0.003..0.003 rows=1 loops=42)
Index Cond: (id = c.id)
Buffers: shared hit=84
-> CTE Scan on cte c2 (cost=0.00..796.60 rows=39830 width=10) (actual
time=0.011..0.233 rows=42 loops=1)
Buffers: shared hit=85
Planning Time: 0.137 ms
Execution Time: 0.324 ms
21 rows
===============================

Now, this PLAN is more complicated - and I totally get that! However,
where do these numbers:

> (cost=955181.47..955281.05 rows=39830 width=10)

come from?

Now, we've gone from 1,160,000 to 39,830 rows (progress? :-) ) and a
cost of ~ 1M (compared with 168k for the first query).

I probably need to read up on the EXPLAIN (ANALYZE, BUFFERS)
functionality - but I would appreciate:

a) an (overview) explanation of what's going on here in particular and

b) any good references to URLs, papers whatever which would be of
benefit to a (hopefully) reasonably competent SQL programmer with a
desire to grasp internals - how to interpret PostgreSQL's EXPLAIN
functionality.

TIA and rgs,


Pól...



Re: EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.

From
Vijaykumar Jain
Date:
how is cost calculated?

row estimation 

FYI, I just read it when I had queries like why rows were estimated when the table has now rows.
but i think it requires a deeper reading into the source for cost calculation which i did not do at that time as i did not understand :)
but incase above helps to start.



On Fri, 4 Jun 2021 at 16:29, Pól Ua Laoínecháin <linehanp@tcd.ie> wrote:
Hi all,

Noticed this today - relatively simple query - table with 7 records
(all code is shown at the bottom of this post and on the fiddle here):

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=efe73a37d29af43f33d2bc79de2b6c97

Sample (2 of 7 records);

======================
INSERT INTO test
VALUES
(1, 'Pól'  , '2021-06-01', '2021-06-06'),
(2, 'Bill' , '2021-06-02', '2021-06-10');
=========================

Query:

========================
SELECT
  id,
  GENERATE_SERIES
    (t.start_date, t.end_date, '1 DAY')::DATE AS sd,
  t.end_date
FROM test t
ORDER BY t.id, t.start_date;
=========================

Now, the EXPLAIN (ANALYZE, BUFFERS) of this query is as follow - see
fiddle - with 7 records:

====================
QUERY PLAN
Sort (cost=165708.62..168608.62 rows=1160000 width=10) (actual
time=0.083..0.087 rows=42 loops=1)
Sort Key: id, (((generate_series((start_date)::timestamp with time
zone, (end_date)::timestamp with time zone, '1
day'::interval)))::date)
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Result (cost=0.00..29036.10 rows=1160000 width=10) (actual
time=0.019..0.056 rows=42 loops=1)
Buffers: shared hit=1
-> ProjectSet (cost=0.00..5836.10 rows=1160000 width=14) (actual
time=0.018..0.042 rows=42 loops=1)
Buffers: shared hit=1
-> Seq Scan on test t (cost=0.00..21.60 rows=1160 width=10) (actual
time=0.008..0.010 rows=7 loops=1)
Buffers: shared hit=1
Planning Time: 0.061 ms
Execution Time: 0.131 ms
12 rows
===========================

Now, the first line of the PLAN has

> Sort (cost=165708.62..168608.62 rows=1160000 width=10)

and two other lines below this also contain the figure 1,160,000.

Where *_on earth_* is PostgreSQL obtaining 1.16M rows? And where do
the cost numbers come from for a query on 7 records?

This query - a recursive CTE (no GENERATE_SERIES) as follows:

===================
WITH RECURSIVE cte (id, sd, ed) AS
(
  SELECT t.id, t.start_date, t.end_date
  FROM test t
  UNION ALL
  SELECT  c.id, (c.sd + INTERVAL '1 DAY')::DATE, c.ed
  FROM cte c
  WHERE c.sd < (SELECT z.end_date FROM test z WHERE z.id = c.id)
)
SELECT * FROM cte c2
ORDER BY c2.id, c2.sd, c2.ed;
=============================

gives a PLAN as follows:

==========================
QUERY PLAN
Sort (cost=955181.47..955281.05 rows=39830 width=10) (actual
time=0.262..0.266 rows=42 loops=1)
Sort Key: c2.id, c2.sd, c2.ed
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=85
CTE cte
-> Recursive Union (cost=0.00..951341.55 rows=39830 width=10) (actual
time=0.010..0.216 rows=42 loops=1)
Buffers: shared hit=85
-> Seq Scan on test t (cost=0.00..21.60 rows=1160 width=10) (actual
time=0.009..0.010 rows=7 loops=1)
Buffers: shared hit=1
-> WorkTable Scan on cte c (cost=0.00..95052.33 rows=3867 width=10)
(actual time=0.006..0.020 rows=4 loops=9)
Filter: (sd < (SubPlan 1))
Rows Removed by Filter: 1
Buffers: shared hit=84
SubPlan 1
-> Index Scan using test_pkey on test z (cost=0.15..8.17 rows=1
width=4) (actual time=0.003..0.003 rows=1 loops=42)
Index Cond: (id = c.id)
Buffers: shared hit=84
-> CTE Scan on cte c2 (cost=0.00..796.60 rows=39830 width=10) (actual
time=0.011..0.233 rows=42 loops=1)
Buffers: shared hit=85
Planning Time: 0.137 ms
Execution Time: 0.324 ms
21 rows
===============================

Now, this PLAN is more complicated - and I totally get that! However,
where do these numbers:

> (cost=955181.47..955281.05 rows=39830 width=10)

come from?

Now, we've gone from 1,160,000 to 39,830 rows (progress? :-) ) and a
cost of ~ 1M (compared with 168k for the first query).

I probably need to read up on the EXPLAIN (ANALYZE, BUFFERS)
functionality - but I would appreciate:

a) an (overview) explanation of what's going on here in particular and

b) any good references to URLs, papers whatever which would be of
benefit to a (hopefully) reasonably competent SQL programmer with a
desire to grasp internals - how to interpret PostgreSQL's EXPLAIN
functionality.

TIA and rgs,


Pól...




--
Thanks,
Vijay
Mumbai, India

Re: EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.

From
David Rowley
Date:
On Fri, 4 Jun 2021 at 22:59, Pól Ua Laoínecháin <linehanp@tcd.ie> wrote:
> Now, we've gone from 1,160,000 to 39,830 rows (progress? :-) ) and a
> cost of ~ 1M (compared with 168k for the first query).

The estimates are not that meaningful due to a lack of table
statistics on the "test" table.  If you run ANALYZE on the table you
might get something closer to the truth.

If there are no stats on a table then the planner has a pretty hard
job guessing how many tuples there are.  All it does is count the
number of pages currently in the table and look at the columns in the
table and figure out how many tuples are likely to fit assuming each
of those pages is full of tuples.  If you just have a couple of tuples
and they only take up a tiny fraction of the page then you're not
going to get a very accurate number there.   Generally, since there's
so little to go on here, the code is purposefully designed to be more
likely to overestimate the number of tuples than underestimate.
Underestimations tend to produce worse plans than overestimations.
It's also common for people to create tables then quickly load a bunch
of records and start running queries. We want to do something sane
there if that all happens before auto-analyze can get a chance to
gather stats for the table.

As for the call to generate_series, you're not likely to ever get any
great estimation from that.  The number of rows returned by a call to
that particular function are just whatever is set in pg_proc.prorows,
in this case, 1000.  The other generate_series functions which take
INT and BIGINT inputs do have a prosupport function. Generally, those
will do a better job since those support functions look at the input
arguments.  However, that still might not go well since your inputs
are columns in a table.

David



Re: EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.

From
Laurenz Albe
Date:
On Sat, 2021-06-05 at 00:12 +1200, David Rowley wrote:
> As for the call to generate_series, you're not likely to ever get any
> great estimation from that.  The number of rows returned by a call to
> that particular function are just whatever is set in pg_proc.prorows,
> in this case, 1000.  The other generate_series functions which take
> INT and BIGINT inputs do have a prosupport function.

"generate_series" has a support function from v12 on:

EXPLAIN SELECT * FROM generate_series(1, 25);
                             QUERY PLAN
═════════════════════════════════════════════════════════════════════
 Function Scan on generate_series  (cost=0.00..0.25 rows=25 width=4)
(1 row)

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Re: EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.

From
David Rowley
Date:
On Sat, 5 Jun 2021 at 00:55, Laurenz Albe <laurenz.albe@cybertec.at> wrote:
>
> On Sat, 2021-06-05 at 00:12 +1200, David Rowley wrote:
> > As for the call to generate_series, you're not likely to ever get any
> > great estimation from that.  The number of rows returned by a call to
> > that particular function are just whatever is set in pg_proc.prorows,
> > in this case, 1000.  The other generate_series functions which take
> > INT and BIGINT inputs do have a prosupport function.
>
> "generate_series" has a support function from v12 on:

I'd class that as one of "the other generate_series functions", which
I mentioned.  This is not the one being used in this case, which is
why I talked about prorows.

postgres=# select proname,prosupport,prorettype::regtype from pg_proc
where proname = 'generate_series';
     proname     |          prosupport          |         prorettype
-----------------+------------------------------+-----------------------------
 generate_series | generate_series_int4_support | integer
 generate_series | generate_series_int4_support | integer
 generate_series | generate_series_int8_support | bigint
 generate_series | generate_series_int8_support | bigint
 generate_series | -                            | numeric
 generate_series | -                            | numeric
 generate_series | -                            | timestamp without time zone
 generate_series | -                            | timestamp with time zone

I believe the one being used here is the 2nd last one in the above
list.  There's no prosupport function mentioned as of current master.

David



Re: EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.

From
Tom Lane
Date:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Sat, 2021-06-05 at 00:12 +1200, David Rowley wrote:
>> As for the call to generate_series, you're not likely to ever get any
>> great estimation from that.

> "generate_series" has a support function from v12 on:

True, but I don't think it can do anything with non-constant inputs,
as we have in the OP's case.

            regards, tom lane



Re: EXPLAIN (ANALYZE, BUFFERS) - puzzling numbers for a simple query.

From
Pól Ua Laoínecháin
Date:
> Laurenz Albe <laurenz.albe@cybertec.at> writes:

> > "generate_series" has a support function from v12 on:

> True, but I don't think it can do anything with non-constant inputs,
> as we have in the OP's case.


As confirmed by this:

https://dbfiddle.uk/?rdbms=postgres_12&fiddle=d0b70717faca3b094be8b4a096758b94

Plan:

==============
Sort (cost=623.22..640.72 rows=7000 width=10) (actual
time=0.067..0.070 rows=42 loops=1)
Sort Key: id, (((generate_series((start_date)::timestamp with time
zone, (end_date)::timestamp with time zone, '1
day'::interval)))::date)
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=1
-> Result (cost=0.00..176.16 rows=7000 width=10) (actual
time=0.014..0.045 rows=42 loops=1)
Buffers: shared hit=1
-> ProjectSet (cost=0.00..36.16 rows=7000 width=14) (actual
time=0.013..0.033 rows=42 loops=1)
CUT
==============

It goes for the full 7,000 even after a VACUUM FULL VERBOSE ANALYZE test;


Pól...

>                         regards, tom lane