Thread: Planner misestimation for JOIN with VARCHAR

Planner misestimation for JOIN with VARCHAR

From
Sebastian Dressler
Date:
Helloes,

I do have a set of tables which contain user data and users can choose to have columns as constrained VARCHAR, limit is
typically100. While users can also choose from different types, quite often they go the VARCHAR route. Furthermore,
theycan pick PKs almost freely. As a result, I quite often see tables with the following DDL:
 

    CREATE TABLE example_1(
        a VARCHAR(100)
      , b VARCHAR(100)
      , c VARCHAR(100)
      , payload TEXT
    );
    ALTER TABLE example_1 ADD PRIMARY KEY (a, b, c);

Due to processing, these need to be joined together sometimes considering the complete PK. For instance, assume
example_1and example_2 having the same structure as above. Then, when I do
 

    SELECT *
    FROM example_1 t1
    INNER JOIN example_2 t2 ON(
          t1.a = t2.a
      AND t1.b = t2.b
      AND t1.c = t2.c
    );

the planner will very likely estimate a single resulting row for this operation. For instance:

     Gather  (cost=1510826.53..3100992.19 rows=1 width=138)
       Workers Planned: 13
       ->  Parallel Hash Join  (cost=1510726.53..3100892.04 rows=1 width=138)
             Hash Cond: (((t1.a)::text = (t2.a)::text) AND ((t1.b)::text = (t2.b)::text) AND ((t1.c)::text =
(t1.c)::text))
             ->  Parallel Seq Scan on example_1 t1  (cost=0.00..1351848.61 rows=7061241 width=69)
             ->  Parallel Hash  (cost=1351848.61..1351848.61 rows=7061241 width=69)
                   ->  Parallel Seq Scan on example_2 t2  (cost=0.00..1351848.61 rows=7061241 width=69)

This does not create a problem when joining just two tables on their own. However, with a more complex query, there
willbe more than one single-row estimates. Hence, I typically see a nested loop which takes very long to process
eventually.

This runs on PG 12, and I ensured that the tables are analyzed, my default_statistics_target is 2500. However, it
seems,that with more VARCHARs being in the JOIN, the estimates becomes worse. Given the table definition as above, I
wonderwhether I have overlooked anything in terms of settings or additional indices which could help here.
 

Things tried so far without any noticeable change:

- Add an index on top of the whole PK
- Add indexes onto other columns trying to help the JOIN
- Add additional statistics on two related columns

Another idea I had was to make use of generated columns and hash the PKs together to an BIGINT and solely use this for
theJOIN. However, this would not work when not all columns of the PK are used for the JOIN.
 


Thanks,
Sebastian

--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B 


Re: Planner misestimation for JOIN with VARCHAR

From
Michael Lewis
Date:
On Tue, Jun 9, 2020 at 12:34 PM Sebastian Dressler <sebastian@swarm64.com> wrote:
- Add an index on top of the whole PK
- Add indexes onto other columns trying to help the JOIN
- Add additional statistics on two related columns

Another idea I had was to make use of generated columns and hash the PKs together to an BIGINT and solely use this for the JOIN. However, this would not work when not all columns of the PK are used for the JOIN.

Can you expand on the additional statistics you created? Why was it on only two columns? Did you include MCVs type of extended stats?

Re: Planner misestimation for JOIN with VARCHAR

From
David Rowley
Date:
On Wed, 10 Jun 2020 at 07:31, Michael Lewis <mlewis@entrata.com> wrote:
>
> On Tue, Jun 9, 2020 at 12:34 PM Sebastian Dressler <sebastian@swarm64.com> wrote:
>>
>> - Add an index on top of the whole PK
>> - Add indexes onto other columns trying to help the JOIN
>> - Add additional statistics on two related columns
>>
>> Another idea I had was to make use of generated columns and hash the PKs together to an BIGINT and solely use this
forthe JOIN. However, this would not work when not all columns of the PK are used for the JOIN.
 
>
>
> Can you expand on the additional statistics you created? Why was it on only two columns? Did you include MCVs type of
extendedstats?
 

Unfortunately, the join selectivity functions have yet to learn about
extended statistics.

David



Re: Planner misestimation for JOIN with VARCHAR

From
Michael Lewis
Date:
the join selectivity functions have yet to learn about extended statistics.

That is very interesting to me. So, extended statistics would help to properly estimate the result set coming out of a single table when comparing each of those columns to one or many values, but not when joining up another table? Is that on a roadmap or in progress as far as you know?

Re: Planner misestimation for JOIN with VARCHAR

From
David Rowley
Date:
On Wed, 10 Jun 2020 at 09:05, Michael Lewis <mlewis@entrata.com> wrote:
>>
>> the join selectivity functions have yet to learn about extended statistics.
>
>
> That is very interesting to me. So, extended statistics would help to properly estimate the result set coming out of
asingle table when comparing each of those columns to one or many values, but not when joining up another table? Is
thaton a roadmap or in progress as far as you know? 

Yeah, they're currently only used for selectivity estimates on base
table.  Using extended statistics for join selectivity estimation is
being worked on [1], so there's a chance that it may happen for PG14.

David

[1] https://commitfest.postgresql.org/28/2386/



Re: Planner misestimation for JOIN with VARCHAR

From
Sebastian Dressler
Date:

> On 9. Jun 2020, at 21:30, Michael Lewis <mlewis@entrata.com> wrote:
> 
>> On Tue, Jun 9, 2020 at 12:34 PM Sebastian Dressler <sebastian@swarm64.com> wrote:
>> - Add an index on top of the whole PK
>> - Add indexes onto other columns trying to help the JOIN
>> - Add additional statistics on two related columns
>> 
>> Another idea I had was to make use of generated columns and hash the PKs together to an BIGINT and solely use this
forthe JOIN. However, this would not work when not all columns of the PK are used for the JOIN.
 
> 
> Can you expand on the additional statistics you created? Why was it on only two columns? Did you include MCVs type of
extendedstats?
 

Sure, for the MCVs specifically I tried to select which columns belong together to also have value for the end-user
whenrunning a query. Basically in accordance what the docs suggest, i.e. [1] section 14.2.2.3 last paragraph. Given the
natureof the data I however think this can be very difficult to do without requiring further user input. Likewise, as
theothers suggested, it did not help for this particular case.
 

Cheers,
Sebastian

[1]: https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED

--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B