Thread: Cardinality estimate of the inner relation

Cardinality estimate of the inner relation

From
Frédéric Yhuel
Date:
My colleague Christophe Courtois and I have been trying to fix a bad 
plan for one of Dalibo's clients. It is a (probably well-known) problem 
with skewed data and a parameterized Nested Loop with an underestimation 
of the cardinality of the inner relation.

Here is a test case (the script to create and populate the two tables is 
at the end):

EXPLAIN (ANALYZE, SETTINGS)
   SELECT name, sum(quantity)
   FROM products p
   JOIN orders o ON (p.id = o.product_id)
   WHERE p.name = 'babar'        /*    28% of orders    */
   GROUP BY name;
  
QUERY PLAN 


--------------------------------------------------------------------------------------------------------------------------------------------------
  GroupAggregate  (cost=5.50..1022.27 rows=1 width=40) (actual 
time=66.213..66.215 rows=1 loops=1)
    ->  Nested Loop  (cost=5.50..1019.29 rows=1189 width=36) (actual 
time=0.083..56.991 rows=320000 loops=1)
          ->  Bitmap Heap Scan on products p  (cost=5.08..250.17 rows=85 
width=40) (actual time=0.058..0.153 rows=80 loops=1)
                Recheck Cond: (name = 'babar'::text)
                Heap Blocks: exact=80
                ->  Bitmap Index Scan on products_name_idx 
(cost=0.00..5.06 rows=85 width=0) (actual time=0.030..0.031 rows=80 loops=1)
                      Index Cond: (name = 'babar'::text)
          ->  Index Scan using orders_product_id_idx on orders o 
(cost=0.43..8.79 rows=26 width=12) (actual time=0.006..0.440 rows=4000 
loops=80)
                Index Cond: (product_id = p.id)
  Planning Time: 0.560 ms
  Execution Time: 66.268 ms

The cardinaly estimate on 'orders' is wrong (26 rows against 4000), 
hence the Nested Loop.
Any other value for product.name gives less rows and a good plan (i.e. 
NL makes more sense).

If I'm not wrong, the row count estimate for the inner table is computed 
as follow:

SELECT n_distinct AS product_id_n_distinct FROM pg_stats
   WHERE tablename = 'orders' AND attname = 'product_id' \gset

SELECT reltuples AS orders_tuples FROM pg_class
   WHERE relname = 'orders' \gset

SELECT :orders_tuples / :product_id_n_distinct AS inner_estimation;
   inner_estimation
---------------------
  26.4049450290190157

I have two questions:

1) Is there a known workaround for this problem (besides unatural 
partitioning of the inner table)?

2) Would it be possible to add cross-table statistics to fix this? Has 
this been discussed?


Here is the script to create and populate the two tables:

  SELECT 80000 AS nbp \gset

DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS products;

CREATE UNLOGGED TABLE products(
         id bigint PRIMARY KEY,
         name TEXT,
         price INT
);

CREATE UNLOGGED TABLE orders(
         id bigint generated always as identity,
         product_id BIGINT, -- REFERENCES products(id),
         quantity INT
);

INSERT INTO products (id, name, price)
   SELECT
     i,
     CASE WHEN i % 1000 = 0 THEN 'babar' ELSE md5(i::text) END,
     random() * 1000
   FROM generate_series(1, :nbp) AS T(i);

INSERT INTO orders (product_id, quantity)
   SELECT product_id, quantity FROM (
     SELECT
       id AS product_id,
       random() * 10 AS quantity,
       name
     FROM products,
     LATERAL (SELECT * FROM generate_series (1, CASE WHEN name = 'babar' 
THEN 4000 ELSE 10 END)) T) U;

CREATE INDEX ON orders (product_id);
VACUUM ANALYZE products, orders;

ALTER TABLE orders ADD FOREIGN KEY (product_id) REFERENCES products(id);

CREATE INDEX ON products (name);



Re: Cardinality estimate of the inner relation

From
Andrei Lepikhov
Date:
On 22/11/2024 22:53, Frédéric Yhuel wrote:
> My colleague Christophe Courtois and I have been trying to fix a bad 
> plan for one of Dalibo's clients. It is a (probably well-known) problem 
> with skewed data and a parameterized Nested Loop with an underestimation 
> of the cardinality of the inner relation.
> 
> Here is a test case (the script to create and populate the two tables is 
> at the end):
Thanks for the case provided!

I wonder if data science has invented a statistic or selectivity 
estimation technique that could tackle your case in general. As I see, 
we should touch the table of products to realise which specific ID has 
the product named 'Babar'.
I can imagine the trick when you build MCV on (id, name) and have a 
chance to find this popular ID, cache it, and use it during join clause 
estimation later. But it seems too expensive to do the same for 
arbitrary incoming queries.

If you want a workaround, such cases fit query-driven techniques. Among 
open-source ones, I can point your attention to the AQO extension 
(honestly, designed under my command). It can save information about an 
estimation error and correct the query next time.

In an enterprise-grade area, you can pick the sr_plan extension, which 
is designed to store the plan for a specific query (you can choose 
parameterisation on your own) and spread it globally across all 
instances' backends.

-- 
regards, Andrei Lepikhov




Re: Cardinality estimate of the inner relation

From
Frédéric Yhuel
Date:

On 11/23/24 03:07, Andrei Lepikhov wrote:
> Thanks for the case provided!
> 

Thanks for your answer!

> I wonder if data science has invented a statistic or selectivity 
> estimation technique that could tackle your case in general. As I see, 
> we should touch the table of products to realise which specific ID has 
> the product named 'Babar'.

There are many such IDs in my test case (only one in our initial 
customer's case, but I thought the test case should be more generic). It 
doesn't really change the reasoning much, but I say it for the sake of 
accuracy.

> I can imagine the trick when you build MCV on (id, name) and have a 
> chance to find this popular ID, cache it, and use it during join clause 
> estimation later. But it seems too expensive to do the same for 
> arbitrary incoming queries.
> 

I'm not sure that I understand the idea, but I've never tried to hack 
into the planner's code yet, so...

> If you want a workaround, such cases fit query-driven techniques. Among 
> open-source ones, I can point your attention to the AQO extension 
> (honestly, designed under my command). It can save information about an 
> estimation error and correct the query next time.

Interesting, thanks!