Re: pgsql: Extended statistics on expressions - Mailing list pgsql-committers

From Tomas Vondra
Subject Re: pgsql: Extended statistics on expressions
Date
Msg-id c61cfe32-251f-e9cb-1be1-0e15218cda17@enterprisedb.com
Whole thread Raw
In response to Re: pgsql: Extended statistics on expressions  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: pgsql: Extended statistics on expressions  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-committers

On 3/31/21 7:54 PM, Tomas Vondra wrote:
> On 3/31/21 7:08 PM, Alvaro Herrera wrote:
>> On 2021-Mar-31, Tomas Vondra wrote:
>>
>>> Thanks for the report, I'll take a look. You're right this seems like an
>>> out-of-bounds access, but mcv_match_expression is only expected to be
>>> run on expressions we know are in the statistics (because we pick the
>>> statistics like that). Clearly, that does not happen here, not sure why.
>>>
>>> It's quite weird that we end up running textlike(), when the statistics
>>> is on (double precision, boolean) columns ...
>>
>> Uninitialized values somewhere?  Maybe valgrind would help.
>>
> 
> Unlikely, I've ran it through valgrind repeatedly, including right
> before commit (both on x86_64 and arm).
> 
> FWIW I'm unable to reproduce it, so not sure what's going on. David,
> what configure option are you using? Anything special?
> 
> 
> It's a bit strange, because statext_mcv_clauselist_selectivity should
> only estimate "matching" clauses on the statistics. So how come this
> estimates such a complex expression using textlike(), when neither of
> those columns is text?
> 
> It'd be interesting to know what's happening in the code after
> 
>   stat = choose_best_statistics(...);
> 
> i.e. what clauses it considers "compatible" with the statistics and why.
> In fact, I wouldn't have expected the statistics to be used at all.
> 

OK, I managed to reproduce/trigger the issue. The simplest query that
triggers the issue for me is this:

    SELECT t1.c0 FROM ONLY t1 WHERE
    (
      upper('x') LIKE ('x'||('[0,1]'::int4range))
      AND
      (t1.c0 IN (0, 1) OR t1.c1)
    )

I think the code matching clauses to the statistics gets a bit confused
when processing the AND clause. It extracts 2 attnums for the OR part,
but the first part should be "incompatible" with the statistics. But
after picking the statistics to apply, it gets confused and includes the
first expression (the whole LIKE clause) as compatible too.

The attached patch fixes this for me. David, can you check if this
resolves the issue for you?

I don't feel like I want to push a fix at midnight, and I'd like to
think about maybe making this part of the code a bit clearer tomorrow.
It's not very comprehensible, I'm afraid.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-committers by date:

Previous
From: Alvaro Herrera
Date:
Subject: pgsql: Disable force_parallel_mode in libpq_pipeline
Next
From: Alvaro Herrera
Date:
Subject: pgsql: Initialize conn->Pfdebug to NULL when creating a connection