Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head) - Mailing list pgsql-hackers

From Jesper Krogh
Subject Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)
Date
Msg-id 4D5D9224.7080105@krogh.cc
Whole thread Raw
Responses Re: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Estimates not taking null_frac element into account with @@ operator? (8.4 .. git-head)  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi All.

The NULL element always suprises me in unpleasant ways..  my brain simply
cant really understand the logic, so please let me know if this is
one of the cases where I just should spend way more efforts into fixing
that instead.

I have a table with a "null_frac" of 0.5 and i have tested that a where
clause that evaluates to null isnt included in the result:

testdb=# select id from testtable where  null @@ to_tsquery('testterm80');
  id
----
(0 rows)

Then I'd expect to have the null_fraq taken into account when computing the
estimates for the query:

testdb=# explain select id from testtable where fts @@
to_tsquery('testterm80');
                           QUERY PLAN
---------------------------------------------------------------
  Seq Scan on testtable  (cost=0.00..1985.03 rows=1966 width=4)
    Filter: (fts @@ to_tsquery('testterm80'::text))
(2 rows)

Whereas it actually does it if I explicitly add the "fts is not null"
clause to the query.

testdb=# explain select id from testtable where fts @@
to_tsquery('testterm80') and fts is not null;
                                      QUERY PLAN
-------------------------------------------------------------------------------------
  Bitmap Heap Scan on testtable  (cost=130.34..1735.19 rows=983 width=4)
    Recheck Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts IS
NOT NULL))
    ->  Bitmap Index Scan on testtable_fts_idx  (cost=0.00..130.09
rows=983 width=0)
          Index Cond: ((fts @@ to_tsquery('testterm80'::text)) AND (fts
IS NOT NULL))
(4 rows)

When something evaluates to "null" isn't included in the result,
shouldn't the query-planner
then take the null_frac into account when computing the estimate?

Trying to do the same thing using integers and the < operator seem to
take the null_frac into
account.

Below snippet allows to reproduce the dataset.


create table testtable (id serial primary key, document text, fts tsvector);
create index on testtable using gist(fts);
CREATE OR REPLACE FUNCTION public.filltable(rows integer)
  RETURNS boolean
  LANGUAGE plpgsql
AS $function$
     DECLARE
         count integer;
     BEGIN
         count := 0;
         LOOP
             EXIT WHEN count = rows;
             count := count +1;
             insert into testtable(document,fts) select
document,to_tsvector('english',document) from (select
string_agg(concat,' ') as document from (select concat('testterm' ||
generate_series(1,floor(random()*100)::integer))) as foo) as bar;
         END LOOP;
         RETURN TRUE;
     END;
$function$

select filltable(10000);
testdb=# update testtable set fts = null where id % 2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO:  analyzing "public.testtable"
INFO:  "testtable": scanned 1835 of 1835 pages, containing 10002 live
rows and 5001 dead rows; 10002 rows in sample, 10002 estimated total rows
ANALYZE
testdb=# select null_frac from pg_stats where attname = 'fts';
  null_frac
-----------
        0.5
(1 row)

... trying with integers:

testdb=# ALTER  TABLE testtable add column testint integer;
ALTER TABLE
testdb=# update testtable set testint = floor(random()*100);
UPDATE 10002
testdb=# ANALYZE verbose testtable;
INFO:  analyzing "public.testtable"
INFO:  "testtable": scanned 2186 of 2186 pages, containing 10002 live
rows and 10002 dead rows; 10002 rows in sample, 10002 estimated total rows
ANALYZE
testdb=# update testtable set testint = null where id %2 = 0;
UPDATE 5001
testdb=# ANALYZE verbose testtable;
INFO:  analyzing "public.testtable"
INFO:  "testtable": scanned 2282 of 2282 pages, containing 10002 live
rows and 13335 dead rows; 10002 rows in sample, 10002 estimated total rows
analyzze ANALYZE
testdb=# explain select id from testtable where testint = 50;
                          QUERY PLAN
-------------------------------------------------------------
  Seq Scan on testtable  (cost=0.00..2407.03 rows=64 width=4)
    Filter: (testint = 50)
(2 rows)

testdb=# explain select id from testtable where testint = 1;
                          QUERY PLAN
-------------------------------------------------------------
  Seq Scan on testtable  (cost=0.00..2407.03 rows=48 width=4)
    Filter: (testint = 1)
(2 rows)

testdb=# explain select id from testtable where testint < 50;
                           QUERY PLAN
---------------------------------------------------------------
  Seq Scan on testtable  (cost=0.00..2407.03 rows=2470 width=4)
    Filter: (testint < 50)
(2 rows)


(found on 8.4 and reproduced on git-head)

Attached patch tries to align the behaviour

Thanks.

--
Jesper

Attachment

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Replication server timeout patch
Next
From: Robert Haas
Date:
Subject: Re: Replication server timeout patch