[BUGS] BUG #14495: Cost of comparator is not taken into account in sorting - Mailing list pgsql-bugs

From zszabo@chemaxon.com
Subject [BUGS] BUG #14495: Cost of comparator is not taken into account in sorting
Date
Msg-id 20170116111312.1441.16240@wrigleys.postgresql.org
Whole thread Raw
Responses Re: [BUGS] BUG #14495: Cost of comparator is not taken into account in sorting
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      14495
Logged by:          Zsuzsanna Szabo
Email address:      zszabo@chemaxon.com
PostgreSQL version: 9.5.4
Operating system:   ubuntu linux
Description:

Dear PostgreSQL Support,

We are developing a chemical extension to PostgreSQL with a user defined
type Molecule. 

We define a btree operator class for this Molecule type to enable sorting by
that type:

CREATE OR REPLACE FUNCTION molecule_relevance_compare(Molecule, Molecule)
RETURNS integer
     AS 'MODULE_PATHNAME'
     LANGUAGE C RETURNS NULL ON NULL INPUT IMMUTABLE COST 40000;

CREATE OPERATOR CLASS molecule_btree_ops
DEFAULT FOR TYPE Molecule USING btree 
AS
    OPERATOR        1       &%<%&,
    OPERATOR        2        &%<=%&,
    OPERATOR        3       &%=%&,
    OPERATOR        4       &%>=%&,
    OPERATOR        5       &%>%&,
    FUNCTION        1       molecule_relevance_compare(Molecule,  Molecule);



Sorting is performed correctly by PostgreSQL if ORDER BY Molecule is used,
our function molecule_relevance_compare() is called for performing the sort.
There is no problem with that.

Our problem is that the cost of sorting by the Molecule column does not take
into account the cost of the comparison by molecule_relevance_compare() at
all!

We have defined a pre-sorted index on the Molecule type called
sortedchemindex:

CREATE OPERATOR CLASS sortedchemindex_int_ops
DEFAULT FOR TYPE Molecule USING sortedchemindex
AS
    OPERATOR        1       &%<%&,
    OPERATOR        2        &%<=%&,
    OPERATOR        3       &%=%&,
    OPERATOR        4       &%>=%&,
    OPERATOR        5       &%>%&,
    OPERATOR        6       |>|,
    OPERATOR        7        |<|,
    OPERATOR        8       |=|,
    OPERATOR        9       |~>| (Molecule, sim_filter),
    OPERATOR        10      |<~| (sim_filter, Molecule),
    OPERATOR        11      |<~| (Molecule, sim_filter),
    OPERATOR        12      |~>| (sim_filter, Molecule),
    OPERATOR        13      |~>| (Molecule, sim_order),
    OPERATOR        14      |<~| (sim_order, Molecule),
    OPERATOR        15      |<~| (Molecule, sim_order),
    OPERATOR        16      |~>| (sim_order, Molecule);
    
We also set amcanorderby to true in sortedchemindex enrty of pg_am. The
index is used nicely for simple sorted queries, like in the example below.

The table is defined as:
postgres=# \d pbch
                               Table "public.pbch"
 Column |        Type         |                     Modifiers
     
--------+---------------------+---------------------------------------------------
 mol    | molecule('default') | 
 id     | integer             | not null default
nextval('pbch_id_seq'::regclass)
Indexes:
    "pbch_id_idx" PRIMARY KEY, btree (id)
    "pbch_index" sortedchemindex (mol)

And we get the explain plan:
postgres=# explain select mol from pbch where 'Br' |<| mol order by mol;
                                    QUERY PLAN
      
-----------------------------------------------------------------------------------
 Index Scan using pbch_index on pbch  (cost=102.60..17729.49 rows=2349
width=1146)
   Index Cond: ('Br'::molecule |<| mol)
(2 rows)

No sorting is performed, great!

However on joint queries the plan optimizer decides to use hash join and
sorting although it is much more expensive than performing a nested loop and
rely on the sorting in the index:

postgres=# explain select * from pbch p, real_table r where 'Br' |<| p.mol
and r.val < 0.8 and p.id = r.id order by mol;
                                           QUERY PLAN
                    
-------------------------------------------------------------------------------------------------
 Sort  (cost=18531.27..18535.95 rows=1872 width=1158)
   Sort Key: p.mol
   ->  Hash Join  (cost=775.10..18429.52 rows=1872 width=1158)
         Hash Cond: (p.id = r.id)
         ->  Index Scan using pbch_index on pbch p  (cost=102.60..17729.49
rows=2349 width=1150)
               Index Cond: ('Br'::molecule |<| mol)
         ->  Hash  (cost=423.50..423.50 rows=19920 width=8)
               ->  Seq Scan on real_table r  (cost=0.00..423.50 rows=19920
width=8)
                     Filter: (val < '0.8'::double precision)
(9 rows)

postgres=# select * from pbch p, real_table r where 'Br' |<| p.mol and r.val
< 0.8 and p.id = r.id order by mol;
Time: 5881,363 ms

If I set not to use sorting explicitely:
postgres=# set enable_sort to off;
SET

Then the plan becomes:
postgres=# explain select * from pbch p, real_table r where 'Br' |<| p.mol
and r.val < 0.8 and p.id = r.id order by mol;
                                         QUERY PLAN
               
--------------------------------------------------------------------------------------------
 Nested Loop  (cost=102.89..19203.30 rows=1872 width=1158)
   ->  Index Scan using pbch_index on pbch p  (cost=102.60..17729.49
rows=2349 width=1150)
         Index Cond: ('Br'::molecule |<| mol)
   ->  Index Scan using real_table_id_idx on real_table r  (cost=0.29..0.62
rows=1 width=8)
         Index Cond: (id = p.id)
         Filter: (val < '0.8'::double precision)
(6 rows)

And the execution is much faster because the expensive comparisons performed
by molecule_relevance_compare() do not need to be performed:
postgres=# select * from pbch p, real_table r where 'Br' |<| p.mol and r.val
< 0.8 and p.id = r.id order by mol;
Time: 74,368 ms

It can be seen that the cost of molecule_relevance_compare() has no effect
on the sorting cost because if I reset it to another value the hash join
execution plan remains the same with the same costs.

The below examle illustrates it:

1. molecule_relevance_compare() cost set to 40000:
postgres=# explain select molecule_relevance_compare(mol, 'C') from pbch;

                           QUERY PLAN                            
-----------------------------------------------------------------
 Seq Scan on pbch  (cost=0.00..2324917.10 rows=23210 width=1146)
(1 row)

postgres=# explain select * from pbch p join real_table r on p.id = r.id
where 'Br' |<| p.mol and r.val < 0.8 order by mol;
                                           QUERY PLAN
                    
-------------------------------------------------------------------------------------------------
 Sort  (cost=18531.27..18535.95 rows=1872 width=1153)
   Sort Key: p.mol
   ->  Hash Join  (cost=775.10..18429.52 rows=1872 width=1153)
         Hash Cond: (p.id = r.id)
         ->  Index Scan using pbch_index on pbch p  (cost=102.60..17729.49
rows=2349 width=1145)
               Index Cond: ('Br'::molecule |<| mol)
         ->  Hash  (cost=423.50..423.50 rows=19920 width=8)
               ->  Seq Scan on real_table r  (cost=0.00..423.50 rows=19920
width=8)
                     Filter: (val < '0.8'::double precision)
(9 rows)

2. molecule_relevance_compare() cost set to 80000: 
postgres=# explain select molecule_relevance_compare(mol, 'C') from pbch;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Seq Scan on pbch  (cost=0.00..4645917.10 rows=23210 width=1141)
(1 row)

Cost increased from 2324917.10 to 4645917.10!

postgres=# explain select * from pbch p join real_table r on p.id = r.id
where 'Br' |<| p.mol and r.val < 0.8 order by mol;
                                           QUERY PLAN
                    
-------------------------------------------------------------------------------------------------
 Sort  (cost=18531.27..18535.95 rows=1872 width=1153)
   Sort Key: p.mol
   ->  Hash Join  (cost=775.10..18429.52 rows=1872 width=1153)
         Hash Cond: (p.id = r.id)
         ->  Index Scan using pbch_index on pbch p  (cost=102.60..17729.49
rows=2349 width=1145)
               Index Cond: ('Br'::molecule |<| mol)
         ->  Hash  (cost=423.50..423.50 rows=19920 width=8)
               ->  Seq Scan on real_table r  (cost=0.00..423.50 rows=19920
width=8)
                     Filter: (val < '0.8'::double precision)
(9 rows)

But there is no change in the cost here!!!

Please reply and let me know when this problem can be fixed.

Best regards,
Zsuzsanna Szabo
Developer at chemaxon.com




--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

pgsql-bugs by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: [BUGS] BUG #14494: Regression - Null arrays are not queryable
Next
From: zszabo@chemaxon.com
Date:
Subject: [BUGS] BUG #14496: Cost of comparator is not taken into account in sorting