Thread: Performance Evaluation of Result Cache by using TPC-DS

Performance Evaluation of Result Cache by using TPC-DS

From
Yuya Watari
Date:
Hello,

Recently, the result cache feature was committed to PostgreSQL. I
tested its performance by executing TPC-DS. As a result, I found that
there were some regressions in the query performance.

I used the TPC-DS scale factor 100 in the evaluation. I executed all
of the 99 queries in the TPC-DS, and the result cache worked in the 21
queries of them. However, some queries took too much time, so I
skipped their execution. I set work_mem to 256MB, and
max_parallel_workers_per_gather to 0.

Evaluation results are as follows. The negative speedup ratio
indicates that the execution time increased by the result cache.

Query No  |   Execution time with result cache  |   Execution time
without result cache  |  Speedup ratio
7       142.1       142.2        0.03%
8       144.0       142.8       -0.82%
13      164.6       162.0       -1.65%
27      138.9       138.7       -0.16%
34      135.7       137.1        1.02%
43      209.5       207.2       -1.10%
48      181.5       170.7       -6.32%
55      130.6       123.8       -5.48%
61      0.014       0.037       62.06%
62       66.7        59.9      -11.36%
68      131.3       127.2       -3.17%
72      567.0       563.4       -0.65%
73      130.1       129.8       -0.29%
88     1044.5      1048.7        0.40%
91        1.2         1.1       -7.93%
96      132.2       131.7       -0.37%

As you can see from these results, many queries have a negative
speedup ratio, which means that there are negative impacts on the
query performance. In query 62, the execution time increased by
11.36%. I guess these regressions are due to the misestimation of the
cost in the planner. I attached the execution plan of query 62.

The result cache is currently enabled by default. However, if this
kind of performance regression is common, we have to change its
default behavior.

Best regards,
Yuya Watari

Attachment

Re: Performance Evaluation of Result Cache by using TPC-DS

From
David Rowley
Date:
On Tue, 13 Apr 2021 at 21:29, Yuya Watari <watari.yuya@gmail.com> wrote:
> I used the TPC-DS scale factor 100 in the evaluation. I executed all
> of the 99 queries in the TPC-DS, and the result cache worked in the 21
> queries of them. However, some queries took too much time, so I
> skipped their execution. I set work_mem to 256MB, and
> max_parallel_workers_per_gather to 0.

Many thanks for testing this.

> As you can see from these results, many queries have a negative
> speedup ratio, which means that there are negative impacts on the
> query performance. In query 62, the execution time increased by
> 11.36%. I guess these regressions are due to the misestimation of the
> cost in the planner. I attached the execution plan of query 62.

Can you share if these times were to run EXPLAIN ANALYZE or if they
were just the queries being executed normally?

The times in the two files you attached do look very similar to the
times in your table, so I suspect either TIMING ON is not that high an
overhead on your machine, or the results are that of EXPLAIN ANALYZE.

It would be really great if you could show the EXPLAIN (ANALYZE,
TIMING OFF) for query 62.   There's a chance that the slowdown comes
from the additional EXPLAIN ANALYZE timing overhead with the Result
Cache version.

> The result cache is currently enabled by default. However, if this
> kind of performance regression is common, we have to change its
> default behavior.

Yes, the feedback we get during the beta period will help drive that
decision or if the costing needs to be adjusted.

David



Re: Performance Evaluation of Result Cache by using TPC-DS

From
Yuya Watari
Date:
Hello David,

Thank you for your reply.

> Can you share if these times were to run EXPLAIN ANALYZE or if they
> were just the queries being executed normally?

These times were to run EXPLAIN ANALYZE. I executed each query twice,
and the **average** execution time was shown in the table of the last
e-mail. Therefore, the result of the table is not simply equal to that
of the attached file. I'm sorry for the insufficient explanation.

> It would be really great if you could show the EXPLAIN (ANALYZE,
> TIMING OFF) for query 62.   There's a chance that the slowdown comes
> from the additional EXPLAIN ANALYZE timing overhead with the Result
> Cache version.

I ran query 62 by "EXPLAIN (ANALYZE, TIMING OFF)" and normally. I
attached these execution results to this e-mail. At this time, I
executed each query only once (not twice). The results are as follows.

Method |  Execution time with result cache (s) |  Execution time
without result cache (s) | Speedup ratio
EXPLAIN (ANALYZE, TIMING ON)      67.161     59.615    -12.66%
EXPLAIN (ANALYZE, TIMING OFF)     66.142     60.660     -9.04%
Normal                            66.611     60.955     -9.28%

Although there is variation in the execution time, the speedup ratio
is around -10%. So, the result cache has a 10% regression in query 62.
The overhead of EXPLAIN ANALYZE and TIMING ON do not seem to be high.

Best regards,
Yuya Watari

On Tue, Apr 13, 2021 at 7:13 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Tue, 13 Apr 2021 at 21:29, Yuya Watari <watari.yuya@gmail.com> wrote:
> > I used the TPC-DS scale factor 100 in the evaluation. I executed all
> > of the 99 queries in the TPC-DS, and the result cache worked in the 21
> > queries of them. However, some queries took too much time, so I
> > skipped their execution. I set work_mem to 256MB, and
> > max_parallel_workers_per_gather to 0.
>
> Many thanks for testing this.
>
> > As you can see from these results, many queries have a negative
> > speedup ratio, which means that there are negative impacts on the
> > query performance. In query 62, the execution time increased by
> > 11.36%. I guess these regressions are due to the misestimation of the
> > cost in the planner. I attached the execution plan of query 62.
>
> Can you share if these times were to run EXPLAIN ANALYZE or if they
> were just the queries being executed normally?
>
> The times in the two files you attached do look very similar to the
> times in your table, so I suspect either TIMING ON is not that high an
> overhead on your machine, or the results are that of EXPLAIN ANALYZE.
>
> It would be really great if you could show the EXPLAIN (ANALYZE,
> TIMING OFF) for query 62.   There's a chance that the slowdown comes
> from the additional EXPLAIN ANALYZE timing overhead with the Result
> Cache version.
>
> > The result cache is currently enabled by default. However, if this
> > kind of performance regression is common, we have to change its
> > default behavior.
>
> Yes, the feedback we get during the beta period will help drive that
> decision or if the costing needs to be adjusted.
>
> David

Attachment

Re: Performance Evaluation of Result Cache by using TPC-DS

From
David Rowley
Date:
On Wed, 14 Apr 2021 at 17:11, Yuya Watari <watari.yuya@gmail.com> wrote:
> I ran query 62 by "EXPLAIN (ANALYZE, TIMING OFF)" and normally. I
> attached these execution results to this e-mail. At this time, I
> executed each query only once (not twice). The results are as follows.

Thanks for running that again.  I see from the EXPLAIN ANALYZE output
that the planner did cost the Result Cache plan slightly more
expensive than the Hash Join plan.  It's likely that add_path() did
not consider the Hash Join plan to be worth keeping because it was not
more than 1% better than the Result Cache plan. STD_FUZZ_FACTOR is set
so new paths need to be at least 1% better than existing paths for
them to be kept.  That's pretty unfortunate and that alone does not
mean the costs are incorrect.  It would be good to know if that's the
case for the other queries too.

To test that, I've set up TPC-DS locally, however, it would be good if
you could send me the list of indexes that you've created.  I see the
tool from the transaction processing council for TPC-DS only comes
with the list of tables.

Can you share the output of:

select pg_get_indexdef(indexrelid) from pg_index where indrelid::regclass in (
'call_center',
'catalog_page',
'catalog_returns',
'catalog_sales',
'customer',
'customer_address',
'customer_demographics',
'date_dim',
'dbgen_version',
'household_demographics',
'income_band',
'inventory',
'item',
'promotion',
'reason',
'ship_mode',
'store',
'store_returns',
'store_sales',
'time_dim')
order by indrelid;

from your TPC-DS database?

David



Re: Performance Evaluation of Result Cache by using TPC-DS

From
Yuya Watari
Date:
Hello David,

Thank you for your reply.

> Thanks for running that again.  I see from the EXPLAIN ANALYZE output
> that the planner did cost the Result Cache plan slightly more
> expensive than the Hash Join plan.  It's likely that add_path() did
> not consider the Hash Join plan to be worth keeping because it was not
> more than 1% better than the Result Cache plan. STD_FUZZ_FACTOR is set
> so new paths need to be at least 1% better than existing paths for
> them to be kept.  That's pretty unfortunate and that alone does not
> mean the costs are incorrect.  It would be good to know if that's the
> case for the other queries too.

Thanks for your analysis. I understood why HashJoin was not selected
in this query plan.

> To test that, I've set up TPC-DS locally, however, it would be good if
> you could send me the list of indexes that you've created.  I see the
> tool from the transaction processing council for TPC-DS only comes
> with the list of tables.
>
> Can you share the output of:

I listed all indexes on my machine by executing your query. I attached
the result to this e-mail. I hope it will help you.

Best regards,
Yuya Watari

Attachment

Re: Performance Evaluation of Result Cache by using TPC-DS

From
David Rowley
Date:
On Tue, 20 Apr 2021 at 16:43, Yuya Watari <watari.yuya@gmail.com> wrote:
> I listed all indexes on my machine by executing your query. I attached
> the result to this e-mail. I hope it will help you.

Thanks for sending that.

I've now run some benchmarks of TPC-DS both with enable_resultcache on
and off.  I think I've used the same scale of test as you did. -SCALE
10.

tpcds=# \l+ tpcds
                                               List of databases
 Name  |  Owner  | Encoding |   Collate   |    Ctype    | Access
privileges | Size  | Tablespace | Description
-------+---------+----------+-------------+-------------+-------------------+-------+------------+-------------
 tpcds | drowley | UTF8     | en_NZ.UTF-8 | en_NZ.UTF-8 |
     | 28 GB | pg_default |
(1 row)

The following settings were non-standard:

tpcds=# select name,setting from pg_Settings where setting <> boot_val;
               name               |      setting
----------------------------------+--------------------
 application_name                 | psql
 archive_command                  | (disabled)
 client_encoding                  | UTF8
 data_directory_mode              | 0700
 DateStyle                        | ISO, DMY
 default_text_search_config       | pg_catalog.english
 enable_resultcache               | off
 fsync                            | off
 jit                              | off
 lc_collate                       | en_NZ.UTF-8
 lc_ctype                         | en_NZ.UTF-8
 lc_messages                      | en_NZ.UTF-8
 lc_monetary                      | en_NZ.UTF-8
 lc_numeric                       | en_NZ.UTF-8
 lc_time                          | en_NZ.UTF-8
 log_file_mode                    | 0600
 log_timezone                     | Pacific/Auckland
 max_parallel_maintenance_workers | 10
 max_parallel_workers_per_gather  | 0
 max_stack_depth                  | 2048
 server_encoding                  | UTF8
 shared_buffers                   | 2621440
 TimeZone                         | Pacific/Auckland
 unix_socket_permissions          | 0777
 wal_buffers                      | 2048
 work_mem                         | 262144
(26 rows)

This is an AMD 3990x CPU with 64GB of RAM.

I didn't run all of the queries. To reduce the benchmark times and to
make the analysis easier, I just ran the queries where EXPLAIN shows
at least 1 Result Cache node.

The queries in question are: 1 2 6 7 15 16 21 23 24 27 34 43 44 45 66
69 73 79 88 89 91 94 99.

The one exception here is query 58. It did use a Result Cache node
when enable_resultcache=on, but the query took more than 6 hours to
run.  This slowness is not due to Result Cache. It's due to the
following correlated subquery.

  and i.i_current_price > 1.2 *
             (select avg(j.i_current_price)
             from item j
             where j.i_category = i.i_category)

That results in:

SubPlan 2
                                         ->  Aggregate
(cost=8264.44..8264.45 rows=1 width=32) (actual time=87.592..87.592
rows=1 loops=255774)

87.592 * 255774 is 6.22 hours.  So 6.22 hours of executing that
subplan. The query took 6.23 hours in total. (A Result Cache on the
subplan would help here! :-)  there are only 10 distinct categories)

Results
======

Out of the 23 queries that used Result Cache, only 7 executed more
quickly than with enable_resultcache = off.  However, with 15 of the
23 queries, the Result Cache plan was not cheaper. This means the
planner rejected some other join method that would have made a cheaper
plan in 15 out of 23 queries.  This is likely due to the add_path()
fuzziness not keeping the cheaper plan.

In only 5 of 23 queries, the Result Cache plan was both cheaper and
slower to execute. These are queries 1, 6, 27, 88 and 99. These cost
0.55%,  0.04%, 0.25%, 0.25% and 0.01% more than the plan that was
picked when enable_resultcache=off. None of those costs seem
significantly cheaper than the alternative plan.

So, in summary, I'd say there are two separate problems here:

1. The planner does not always pick the cheapest plan due to add_path
fuzziness.  (15 of 23 queries have this problem, however, 4 of these
15 queries were faster with result cache, despite costing more)
2. Sometimes the Result Cache plan is cheaper and slower than the plan
that is picked with enable_resultcache = off. (5 of 23 queries have
this problem)

Overall with result cache enabled, the benchmark ran 1.15% faster.
This is mostly due to query 69 which ran over 40 seconds more quickly
with result cache enabled.  Unfortunately, 16 of the 23 queries became
slower due to result cache with only the remaining 7 becoming faster.
That's not a good track record.  I never expected that we'd use a
Result Cache node correctly in every planning problem we ever try to
solve, but only getting that right 30.4% of the time is not quite as
close to that 100% mark as I'd have liked. However, maybe that's
overly harsh on the Result Cache code as it was only 5 queries that we
costed cheaper and were slower. So 18 of 23 seem to have more
realistic costs, which is 78% of queries.

What can be done?
===============

I'm not quite sure. The biggest problem is add_path's fuzziness.  I
could go and add some penalty cost to Result Cache paths so that
they're picked less often.  If I make that penalty more than 1% of the
cost, then that should get around add_path rejecting the other join
method that is not fuzzily good enough.  Adding some sort of penalty
might also help the 5 of 23 queries that were cheaper and slower than
the alternative.

I've attached a spreadsheet with all of the results and also the
EXPLAIN / EXPLAIN ANALYZE and times from both runs.

The query times in the spreadsheet are to run the query once with
pgbench (i.e -t 1). Not the EXPLAIN ANALYZE time.

I've also zipped the entire benchmark results and attached as results.tar.bz2.

David

Attachment

Re: Performance Evaluation of Result Cache by using TPC-DS

From
Yuya Watari
Date:
Hello David,

Thank you for running experiments on your machine and I really
appreciate your deep analysis.

Your results are very interesting. In 5 queries, the result cache is
cheaper but slower. Especially, in query 88, although the cost with
result cache is cheaper, it has 34.23% degradation in query execution
time. This is big regression.

> What can be done?
> ===============
>
> I'm not quite sure. The biggest problem is add_path's fuzziness.  I
> could go and add some penalty cost to Result Cache paths so that
> they're picked less often.  If I make that penalty more than 1% of the
> cost, then that should get around add_path rejecting the other join
> method that is not fuzzily good enough.  Adding some sort of penalty
> might also help the 5 of 23 queries that were cheaper and slower than
> the alternative.

Based on your idea, I have implemented a penalty for the cost of the
result cache. I attached the patch to this e-mail. Please be noted
that this patch is experimental, so it lacks comments, documents,
tests, etc. This patch adds a new GUC, resultcache_cost_factor. The
cost of the result cache is multiplied by this factor. If the factor
is greater than 1, we impose a penalty on the result cache.

The cost calculation has been modified as follows.

=====
@@ -2541,6 +2542,13 @@ cost_resultcache_rescan(PlannerInfo *root,
ResultCachePath *rcpath,
         */
        startup_cost += cpu_tuple_cost;

+       /*
+        * We multiply the costs by resultcache_cost_factor to control the
+        * aggressiveness of result cache.
+        */
+       startup_cost *= resultcache_cost_factor;
+       total_cost *= resultcache_cost_factor;
=====
@@ -1618,9 +1618,14 @@ create_resultcache_path(PlannerInfo *root,
RelOptInfo *rel, Path *subpath,
         * Add a small additional charge for caching the first entry.  All the
         * harder calculations for rescans are performed in
         * cost_resultcache_rescan().
+        *
+        * We multiply the costs by resultcache_cost_factor to control the
+        * aggressiveness of result cache.
         */
-       pathnode->path.startup_cost = subpath->startup_cost + cpu_tuple_cost;
-       pathnode->path.total_cost = subpath->total_cost + cpu_tuple_cost;
+       pathnode->path.startup_cost =
+               (subpath->startup_cost + cpu_tuple_cost) *
resultcache_cost_factor;
+       pathnode->path.total_cost =
+               (subpath->total_cost + cpu_tuple_cost) *
resultcache_cost_factor;
        pathnode->path.rows = subpath->rows;

        return pathnode;
=====

As this factor increases, the result cache becomes less and less
likely to be adopted. I conducted an experiment to clarify the
threshold of the factor. I ran EXPLAIN (not EXPLAIN ANALYZE) command
with different factors. The threshold is defined as the factor at
which the result cache no longer appears in the query plan. The factor
more than the threshold indicates the planner does not select the
result cache.

This experiment was conducted on my machine, so the results may differ
from those on your machine.

I attached the thresholds as Excel and PDF files. The thresholds vary
from 1.1 to 9.6. The threshold of 9.6 indicates that a penalty of 860%
must be imposed to avoid the result cache.

The Excel and PDF files also contain the chart showing the
relationship between speedup ratio and threshold. Unfortunately, there
is no clear correlation. If we set the factor to 5, we can avoid 11%
degradation of query 62 because the threshold of the query is 4.7.
However, we cannot gain a 62% speedup of query 61 with this factor.
Therefore, this factor does not work well and should be reconsidered.

In this patch, I impose a penalty on the result cache node. An
alternative way is to increase the cost of a nested loop that contains
a result cache. If so, there is no need to impose a penalty of 860%,
but a penalty of about 1% is enough.

This approach of introducing resultcache_cost_factor is not an
essential solution. However, it is reasonable to offer a way of
controlling the aggressiveness of the result cache.

Repeatedly, this patch is experimental, so it needs feedback and modifications.

Best regards,
Yuya Watari

Attachment

Re: Performance Evaluation of Result Cache by using TPC-DS

From
David Rowley
Date:
Thanks for doing further analysis on this.

On Mon, 26 Apr 2021 at 20:31, Yuya Watari <watari.yuya@gmail.com> wrote:
> Thank you for running experiments on your machine and I really
> appreciate your deep analysis.
>
> Your results are very interesting. In 5 queries, the result cache is
> cheaper but slower. Especially, in query 88, although the cost with
> result cache is cheaper, it has 34.23% degradation in query execution
> time. This is big regression.

That's certainly one side of it.   On the other side, it's pretty
important to also note that in 4 of 23 queries the result cache plan
executed faster but the planner costed it as more expensive.

I'm not saying the costing is perfect, but what I am saying is, as you
noted above, in 5 of 23 queries the result cache was cheaper and
slower, and, as I just noted, in 4 of 23 queries, result cache was
more expensive and faster.  We know that costing is never going to be
a perfect representation of what the execution time will be  However,
in these examples, we've just happened to get quite a good balance. If
we add a penalty to result cache then it'll just subtract from one
problem group and add to the other.

Overall, in my tests execution was 1.15% faster with result cache
enabled than it was without.

I could maybe get on board with adding a small fixed cost penalty. I'm
not sure exactly what it would be, maybe a cpu_tuple_cost instead of a
cpu_operator_cost and count it in for forming/deforming cached tuples.
I think the patch you wrote to add the resultcache_cost_factor is only
suitable for running experiments with.

The bigger concerns I have with the costing are around what happens
when an n_distinct estimate is far too low on one of the join columns.
I think it is more likely to be concerns like that one which would
cause us to default enable_resultcache to off.

David



Re: Performance Evaluation of Result Cache by using TPC-DS

From
Yuya Watari
Date:
Hello David,

Thank you for your reply.

> That's certainly one side of it.   On the other side, it's pretty
> important to also note that in 4 of 23 queries the result cache plan
> executed faster but the planner costed it as more expensive.
>
> I'm not saying the costing is perfect, but what I am saying is, as you
> noted above, in 5 of 23 queries the result cache was cheaper and
> slower, and, as I just noted, in 4 of 23 queries, result cache was
> more expensive and faster.  We know that costing is never going to be
> a perfect representation of what the execution time will be  However,
> in these examples, we've just happened to get quite a good balance. If
> we add a penalty to result cache then it'll just subtract from one
> problem group and add to the other.
>
> Overall, in my tests execution was 1.15% faster with result cache
> enabled than it was without.

Thank you for your analysis. I agree with your opinion.

> I think it is more likely to be concerns like that one which would
> cause us to default enable_resultcache to off.

I am not sure whether this kind of degradation is common, but setting
default behavior to off is one of the realistic solutions.

Best regards,
Yuya Watari