Thread: Strange behavior of function date_trunc

Strange behavior of function date_trunc

From
Pavel Luzanov
Date:
Hello,

It is very likely that the date_trunc function in the following example 
is executed for each line of the query. Although it marked as a STABLE 
and could only be called once.

EXPLAIN (ANALYZE)
SELECT * FROM generate_series('2021-01-01', '2021-06-01', '1 
s'::interval) AS g(x) WHERE g.x >= date_trunc('day', 
'2021-05-05'::timestamptz);
                                                           QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
  Function Scan on generate_series g  (cost=0.00..15.00 rows=333 
width=8) (actual time=2801.884..3263.328 rows=2332801 loops=1)
    Filter: (x >= date_trunc('day'::text, '2021-05-05 
00:00:00+03'::timestamp with time zone))
    Rows Removed by Filter: 10713600
  Planning Time: 0.040 ms
  Execution Time: 3336.657 ms

When replacing date_trunc with now, the query is much faster:

EXPLAIN (ANALYZE)
SELECT * FROM generate_series('2021-01-01', '2021-06-01', '1 
s'::interval) AS g(x)
WHERE g.x >= now();
                                                           QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
  Function Scan on generate_series g  (cost=0.00..15.00 rows=333 
width=8) (actual time=1648.777..1845.430 rows=2275325 loops=1)
    Filter: (x >= now())
    Rows Removed by Filter: 10771076
  Planning Time: 0.039 ms
  Execution Time: 1918.767 ms

The variant with now works almost as fast as with the constant. This 
suggests me that perhaps date_trunc is being executed for every line of 
the query:

EXPLAIN (ANALYZE)
SELECT * FROM generate_series('2021-01-01', '2021-06-01', '1 
s'::interval) AS g(x)
WHERE g.x >= '2021-05-05'::timestamptz;
                                                           QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
  Function Scan on generate_series g  (cost=0.00..12.50 rows=333 
width=8) (actual time=1628.743..1826.841 rows=2332801 loops=1)
    Filter: (x >= '2021-05-05 00:00:00+03'::timestamp with time zone)
    Rows Removed by Filter: 10713600
  Planning Time: 0.033 ms
  Execution Time: 1901.680 ms

In this regard, I have two questions:
1. How can I find out exactly how many times the date_trunc function has 
been executed? So far, these are just my assumptions.
2. If date_trunc is indeed called multiple times, why is this happening?

-- 
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company




Re: Strange behavior of function date_trunc

From
Tomas Vondra
Date:

On 5/5/21 3:23 PM, Pavel Luzanov wrote:
> Hello,
> 
> It is very likely that the date_trunc function in the following example 
> is executed for each line of the query. Although it marked as a STABLE 
> and could only be called once.
> 

It could, but that's just an option - the database may do that, but it's 
not required to do it. In this case it might be beneficial, but it'd 
make the planner more complex etc.


> EXPLAIN (ANALYZE)
> SELECT * FROM generate_series('2021-01-01', '2021-06-01', '1 
> s'::interval) AS g(x) WHERE g.x >= date_trunc('day', 
> '2021-05-05'::timestamptz);
>                                                            QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------

> 
>   Function Scan on generate_series g  (cost=0.00..15.00 rows=333 
> width=8) (actual time=2801.884..3263.328 rows=2332801 loops=1)
>     Filter: (x >= date_trunc('day'::text, '2021-05-05 
> 00:00:00+03'::timestamp with time zone))
>     Rows Removed by Filter: 10713600
>   Planning Time: 0.040 ms
>   Execution Time: 3336.657 ms
> 
> When replacing date_trunc with now, the query is much faster:
> 
> EXPLAIN (ANALYZE)
> SELECT * FROM generate_series('2021-01-01', '2021-06-01', '1 
> s'::interval) AS g(x)
> WHERE g.x >= now();
>                                                            QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------

> 
>   Function Scan on generate_series g  (cost=0.00..15.00 rows=333 
> width=8) (actual time=1648.777..1845.430 rows=2275325 loops=1)
>     Filter: (x >= now())
>     Rows Removed by Filter: 10771076
>   Planning Time: 0.039 ms
>   Execution Time: 1918.767 ms
> 
> The variant with now works almost as fast as with the constant. This 
> suggests me that perhaps date_trunc is being executed for every line of 
> the query:
> 
> EXPLAIN (ANALYZE)
> SELECT * FROM generate_series('2021-01-01', '2021-06-01', '1 
> s'::interval) AS g(x)
> WHERE g.x >= '2021-05-05'::timestamptz;
>                                                            QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------

> 
>   Function Scan on generate_series g  (cost=0.00..12.50 rows=333 
> width=8) (actual time=1628.743..1826.841 rows=2332801 loops=1)
>     Filter: (x >= '2021-05-05 00:00:00+03'::timestamp with time zone)
>     Rows Removed by Filter: 10713600
>   Planning Time: 0.033 ms
>   Execution Time: 1901.680 ms
> 
> In this regard, I have two questions:
> 1. How can I find out exactly how many times the date_trunc function has 
> been executed? So far, these are just my assumptions.
> 2. If date_trunc is indeed called multiple times, why is this happening?
> 

Well, it'd not like date_trunc is executed for each row while now() is 
executed only once. The functions are executed for each row in both 
cases, but now() is simply much cheaper - it just returns a value that 
is already calculated, while date_trunc has to parse and truncate the 
value, etc.

You can use CTE to execute it just once, I think:

   with x as (select date_trunc('day', '2021-04-01'::timestamptz) as x)
   select * from t where a > (select x from x);


regards
Tomas



Re: Strange behavior of function date_trunc

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> On 5/5/21 3:23 PM, Pavel Luzanov wrote:
>> It is very likely that the date_trunc function in the following example
>> is executed for each line of the query. Although it marked as a STABLE
>> and could only be called once.

> It could, but that's just an option - the database may do that, but it's
> not required to do it. In this case it might be beneficial, but it'd
> make the planner more complex etc.

Yeah, there simply is not any provision for caching the results of
stable functions in the way Pavel seems to be imagining.  People
have played around with patches for that, but nothing's been accepted.

> You can use CTE to execute it just once, I think:
>    with x as (select date_trunc('day', '2021-04-01'::timestamptz) as x)
>    select * from t where a > (select x from x);

Actually it's sufficient to write

select * from t where a > (select date_trunc('day', '2021-04-01'::timestamptz))

Postgres interprets that as an uncorrelated sub-select, so it's only done
once per outer query.  I think that these days, the CTE form would be
flattened into that anyway (without MATERIALIZED).

            regards, tom lane



Re: Strange behavior of function date_trunc

From
Pavel Luzanov
Date:
Hello,

On 05.05.2021 16:55, Tomas Vondra wrote:
Well, it'd not like date_trunc is executed for each row while now() is executed only once. The functions are executed for each row in both cases, but now() is simply much cheaper - it just returns a value that is already calculated, while date_trunc has to parse and truncate the value, etc.


Thanks for the explanation.


You can use CTE to execute it just once, I think:

  with x as (select date_trunc('day', '2021-04-01'::timestamptz) as x)
  select * from t where a > (select x from x);


I think it could be even easier with scalar subquery:

EXPLAIN (ANALYZE)
SELECT * FROM generate_series('2021-01-01', '2021-06-01', '1 s'::interval) AS g(x)
WHERE g.x >= (SELECT date_trunc('day', '2021-05-05'::timestamptz));
                                                          QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series g  (cost=0.02..12.51 rows=333 width=8) (actual time=1615.436..1815.724 rows=2332801 loops=1)
   Filter: (x >= $0)
   Rows Removed by Filter: 10713600
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.005..0.005 rows=1 loops=1)
 Planning Time: 0.051 ms
 Execution Time: 1889.434 ms

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

Re: Strange behavior of function date_trunc

From
Pavel Luzanov
Date:
On 05.05.2021 17:11, Tom Lane wrote:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
On 5/5/21 3:23 PM, Pavel Luzanov wrote:
It is very likely that the date_trunc function in the following example is executed for each line of the query. Although it marked as a STABLE and could only be called once.
It could, but that's just an option - the database may do that, but it's not required to do it. In this case it might be beneficial, but it'd make the planner more complex etc.
Yeah, there simply is not any provision for caching the results of stable functions in the way Pavel seems to be imagining. People have played around with patches for that, but nothing's been accepted.

Thank you for the clarification. It's not very obvious, at least for me.


--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

Re: Strange behavior of function date_trunc

From
Pavel Luzanov
Date:
Hello,

On 05.05.2021 16:55, Tomas Vondra wrote:
> Well, it'd not like date_trunc is executed for each row while now() is 
> executed only once. The functions are executed for each row in both 
> cases...
A couple more experiments.
Since I can't to track usage of system functions, I decided to play with 
user defined stable function.

In this case it is a wrapper for date_trunc:
CREATE OR REPLACE FUNCTION user_date_trunc(field text, source 
timestamptz) RETURNS timestamptz
AS $$
BEGIN
     RETURN date_trunc(field, source);
END;
$$ LANGUAGE plpgsql STABLE;

SET track_functions = 'all';
SELECT pg_stat_reset_single_function_counters('user_date_trunc'::regproc);


Let's create a test table:
CREATE TABLE t AS
     SELECT g.x, g.x::text AS y
     FROM generate_series('2021-01-01', '2021-06-01', '1 s'::interval) 
AS g(x) ORDER BY random();
VACUUM ANALYZE t;
SET random_page_cost = 1.1; -- for ssd disks


Previous query with user defined function:
EXPLAIN (ANALYZE, SETTINGS)
SELECT * FROM t
WHERE t.x >= user_date_trunc('year', '2021-05-06'::timestamptz);
                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
  Seq Scan on t  (cost=0.00..3520631.00 rows=13045175 width=31) (actual 
time=63.977..7157.852 rows=13046401 loops=1)
    Filter: (x >= user_date_trunc('year'::text, '2021-05-06 
00:00:00+03'::timestamp with time zone))
  Settings: random_page_cost = '1.1'
  Planning Time: 0.373 ms
  JIT:
    Functions: 2
    Options: Inlining true, Optimization true, Expressions true, 
Deforming true
    Timing: Generation 0.404 ms, Inlining 53.122 ms, Optimization 7.100 
ms, Emission 3.600 ms, Total 64.227 ms
  Execution Time: 7502.564 ms

SELECT funcname, calls FROM pg_stat_user_functions WHERE funcid = 
'user_date_trunc'::regproc;
     funcname     |  calls
-----------------+----------
  user_date_trunc | 13046402

The function is executed once for each row in the table +1. As you 
explained earlier.
Ok.

But when there is an index on the column, the situation changes:
CREATE INDEX ON t(x);
ANALYZE t;
SELECT pg_stat_reset_single_function_counters('user_date_trunc'::regproc);

EXPLAIN (ANALYZE, SETTINGS)
SELECT * FROM t
WHERE t.x >= user_date_trunc('year', '2021-05-06'::timestamptz);
                                                             QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------
  Index Scan using t_x_idx on t  (cost=0.69..373188.48 rows=13046480 
width=31) (actual time=3.163..26476.012 rows=13046401 loops=1)
    Index Cond: (x >= user_date_trunc('year'::text, '2021-05-06 
00:00:00+03'::timestamp with time zone))
  Settings: random_page_cost = '1.1'
  Planning Time: 0.294 ms
  JIT:
    Functions: 3
    Options: Inlining false, Optimization false, Expressions true, 
Deforming true
    Timing: Generation 0.551 ms, Inlining 0.000 ms, Optimization 0.380 
ms, Emission 2.587 ms, Total 3.518 ms
  Execution Time: 26787.377 ms

SELECT funcname, calls FROM pg_stat_user_functions WHERE funcid = 
'user_date_trunc'::regproc;
     funcname     | calls
-----------------+-------
  user_date_trunc |     2

Only two times!
Does having an index allow the function value to be cached?

By the way, an index is used to access the table, although seq scan are 
much better.

Now with scalar subquery:
EXPLAIN (ANALYZE, SETTINGS)
SELECT * FROM t
WHERE t.x >= (SELECT user_date_trunc('year', '2021-05-06'::timestamptz));
                                                             QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
  Index Scan using t_x_idx on t  (cost=0.70..194745.67 rows=4348827 
width=31) (actual time=3.000..26543.395 rows=13046401 loops=1)
    Index Cond: (x >= $0)
    InitPlan 1 (returns $0)
      ->  Result  (cost=0.00..0.26 rows=1 width=8) (actual 
time=0.030..0.030 rows=1 loops=1)
  Settings: random_page_cost = '1.1'
  Planning Time: 0.071 ms
  JIT:
    Functions: 4
    Options: Inlining false, Optimization false, Expressions true, 
Deforming true
    Timing: Generation 0.671 ms, Inlining 0.000 ms, Optimization 0.188 
ms, Emission 2.583 ms, Total 3.441 ms
  Execution Time: 26875.105 ms

SELECT funcname, calls FROM pg_stat_user_functions WHERE funcid = 
'user_date_trunc'::regproc;
     funcname     | calls
-----------------+-------
  user_date_trunc |     3

The function is executed once, but the index is still in use.

The second question. What is the reason for choosing an index scan?

Seq scan with constant value only:
EXPLAIN (ANALYZE, SETTINGS)
SELECT * FROM t
WHERE t.x >= '2021-01-01'::timestamptz;
                                                    QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
  Seq Scan on t  (cost=0.00..259011.00 rows=13046480 width=31) (actual 
time=2.135..913.628 rows=13046401 loops=1)
    Filter: (x >= '2021-01-01 00:00:00+03'::timestamp with time zone)
  Settings: random_page_cost = '1.1'
  Planning Time: 0.140 ms
  JIT:
    Functions: 2
    Options: Inlining false, Optimization false, Expressions true, 
Deforming true
    Timing: Generation 0.392 ms, Inlining 0.000 ms, Optimization 0.163 
ms, Emission 1.837 ms, Total 2.391 ms
  Execution Time: 1195.985 ms


The statistics on t.x shows that the condition in the query is not 
selective and seq scan are preferred over index scan.
SELECT n_distinct, (histogram_bounds::text::text[])[1]
FROM pg_stats WHERE tablename = 't' AND attname = 'x';
  n_distinct |    histogram_bounds
------------+------------------------
          -1 | 2021-01-01 00:01:10+03

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company




Re: Strange behavior of function date_trunc

From
Tom Lane
Date:
Pavel Luzanov <p.luzanov@postgrespro.ru> writes:
> Does having an index allow the function value to be cached?

For an indexscan, the comparison value is evaluated once and used to
search the index.  The point of the "stable" marking is actually to
promise that this will give the same result as the naive interpretation
of a WHERE clause, ie that the WHERE expression is notionally evaluated
at every row.

This case is the reason we invented the "stable" attribute to begin
with.  People have since misinterpreted it as authorizing caching of
function results, but that's not what it was intended for.

> The second question. What is the reason for choosing an index scan?

Probably the planner is picking that precisely to reduce the number
of calls of the user-defined function.  Since you left the function's
cost as default, which for PL functions is 100x the default cost of
a built-in function, that could well be a large enough number to
change the plan choice.  (You could experiment with altering the
COST property to see where the plan changes.)

            regards, tom lane



Re: Strange behavior of function date_trunc

From
"David G. Johnston"
Date:
On Thu, May 6, 2021 at 6:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
This case is the reason we invented the "stable" attribute to begin
with.  People have since misinterpreted it as authorizing caching of
function results, but that's not what it was intended for.


This is a good paragraph...if something like it gets added to the create function documentation mis-interpretations are likely to decrease.  I personally did not draw this conclusion after having read that particular piece of documentation multiple times over the years.

David J.

Re: Strange behavior of function date_trunc

From
Pavel Luzanov
Date:
Hello,

On 06.05.2021 16:44, Tom Lane wrote:
Pavel Luzanov <p.luzanov@postgrespro.ru> writes:
Does having an index allow the function value to be cached?
For an indexscan, the comparison value is evaluated once and used to search the index. The point of the "stable" marking is actually to promise that this will give the same result as the naive interpretation of a WHERE clause, ie that the WHERE expression is notionally evaluated at every row. This case is the reason we invented the "stable" attribute to begin with. People have since misinterpreted it as authorizing caching of function results, but that's not what it was intended for.

I think I'm starting to understand! ))

I knew that the STABLE mark was not a guarantee for the value to be cached. The planner has the right to execute the function once, but this is not required. Now it is clear under what conditions this happens. Stable functions can be executed once, when they are used in an index expression. In other cases (in a select list, expression for seq scan) they are evaluated for each row.

The second question. What is the reason for choosing an index scan?
Probably the planner is picking that precisely to reduce the number of calls of the user-defined function. Since you left the function's cost as default, which for PL functions is 100x the default cost of a built-in function, that could well be a large enough number to change the plan choice. (You could experiment with altering the COST property to see where the plan changes.)

Yes, if the cost of the function is reduced to 3 (or less), than seq scan begins to be used. And the function is executed for each row.
It's clear now.

One thing remains unclear.

Why, if a scalar subquery is used to materialize the function value(even constant), then an inefficient index scan is chosen:

EXPLAIN (ANALYZE, SETTINGS)

SELECT * FROM t                                 
WHERE t.x >= (SELECT '2021-01-01'::timestamptz);
                                                            QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using t_x_idx on t  (cost=0.45..194740.46 rows=4348742 width=31) (actual time=2.831..26947.394 rows=13046401 loops=1)
   Index Cond: (x >= $0)
   InitPlan 1 (returns $0)
     ->  Result  (cost=0.00..0.01 rows=1 width=8) (actual time=0.005..0.005 rows=1 loops=1)
 Settings: random_page_cost = '1.1'
 Planning Time: 0.077 ms
 JIT:
   Functions: 4
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.720 ms, Inlining 0.000 ms, Optimization 0.184 ms, Emission 2.429 ms, Total 3.333 ms
 Execution Time: 27262.793 ms
--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company

Re: Strange behavior of function date_trunc

From
Tom Lane
Date:
Pavel Luzanov <p.luzanov@postgrespro.ru> writes:
> One thing remains unclear.
> Why, if a scalar subquery is used to materialize the function value(even 
> constant), then an inefficient index scan is chosen:

The scalar subquery prevents the planner from seeing the actual
comparison value, so it falls back to a default selectivity estimate
(notice the fairly bad rowcount estimate).  I'm a bit surprised that
that would end in choosing an indexscan over a seqscan, but that
might be a consequence of the small random_page_cost you're using.

            regards, tom lane



Re: Strange behavior of function date_trunc

From
Pavel Luzanov
Date:
I will try to summarize what was said before.

We have discussed the details of executing STABLE functions in queries 
of the form:
SELECT * FROM t WHERE col oper stable_func();

* Checking STABLE does not guarantee that the function will be executed 
only once. If the table is scanned sequentially, the function is 
executed for each row of the query.

* If the table has an index on the col column, the planner can choose to 
scan the index. In this case, the STABLE mark gives the right to 
calculate the function value once and use that value to search the index.

* In the case of a sequential scan, the total cost of the plan includes, 
among other things, the cost of the function multiplied by the number of 
rows.  For user-defined functions, the default cost is 100. It may be 
worth changing this value for a more adequate estimate. Decreasing the 
cost of a function will decrease the cost of a seq scan and vice versa. 
Refining the function cost estimate will enable the planner to make a 
more accurate choice between seq scan and index scan.

* If seq scan is preferred, you can avoid executing the function 
multiple times by materializing the result of the function.

* There are two ways to materialize the result: a scalar subquery and a CTE.
     SELECT * FROM t WHERE col oper (SELECT stable_func();
     WITH m AS MATERIALIZED (SELECT stable_func() AS f) SELECT * FROM t, 
m WHERE col oper m.f;

* When materializing a function result, the planner has no way to use 
the function value to build the plan. Therefore, it will not be able to 
use the statistics for the t.col to select the optimal plan.  The 
generic algorithm will be used.

Thank you very much for sharing.

--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company





Re: Strange behavior of function date_trunc

From
Pavel Luzanov
Date:
David,


On 06.05.2021 17:28, David G. Johnston wrote:
On Thu, May 6, 2021 at 6:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
This case is the reason we invented the "stable" attribute to begin
with.  People have since misinterpreted it as authorizing caching of
function results, but that's not what it was intended for.


This is a good paragraph...if something like it gets added to the create function documentation mis-interpretations are likely to decrease.


I found additional details in the documentation. In particular about the index scanning for stable functions:
https://www.postgresql.org/docs/13/xfunc-volatility.html

The link to this section there is in the create function page. Maybe that's enough.
--
Pavel Luzanov
Postgres Professional: https://postgrespro.com
The Russian Postgres Company