Thread: string casting for index usage

string casting for index usage

From
Michael Adler
Date:
In porting an application from v7.2 and v7.3, I noticed that a join on a varchar column and a text column was ignoring
indicesthat were helpful in v7.2. When I explicitly cast the text to a varchar (or set ENABLE_SEQSCAN TO false) the
indexis scanned and it works as efficiently as in v7.2.  

Obviously there were many casting improvements made in 7.3, but our application doesn't exactly see it that way. Is
explicitcasting the only solution (other than schema modification)? I haven't found anything in the documentation on
thissubject. 

Thanks,
Mike

Re: string casting for index usage

From
Tom Lane
Date:
Michael Adler <adler@pobox.com> writes:
> In porting an application from v7.2 and v7.3, I noticed that a join on a varchar column and a text column was
ignoringindices that were helpful in v7.2. When I explicitly cast the text to a varchar (or set ENABLE_SEQSCAN TO
false)the index is scanned and it works as efficiently as in v7.2.  

Maybe you should be moving to 7.4, instead.

A desultory test didn't show any difference between 7.2.4 and 7.3.6
in this respect, however.  Perhaps you forgot to ANALYZE yet in the
new database?

            regards, tom lane

Re: string casting for index usage

From
Tom Lane
Date:
Michael Adler <adler@pobox.com> writes:
> On Thu, Mar 18, 2004 at 03:39:12PM -0500, Tom Lane wrote:
>> A desultory test didn't show any difference between 7.2.4 and 7.3.6
>> in this respect, however.  Perhaps you forgot to ANALYZE yet in the
>> new database?

> I have a test with sample data and queries to demonstrate what I'm seeing.

Ah.  I had been testing the equivalent of this query with an INNER join
instead of a LEFT join.  Both 7.2 and 7.3 pick a plan with an inner
indexscan on t1 in that case.  The LEFT join prevents use of such a
plan, and the only way to do it quickly in those releases is to use an
inner indexscan on t2.

7.2 is really cheating here, because what is happening under the hood is
that the parser resolves the query as "textcol texteq varcharcol::text",
there not being any direct text=varchar operator.  (text is chosen as
the preferred type over varchar when it would otherwise be a coin flip.)
But then the planner would simply assume that it's okay to substitute
varchareq for texteq, apparently on the grounds that if the input types
are binary compatible then the operators must be interchangeable.  That
made it possible to match the join clause to the varchar-opclass index
on t2.  But of course this theory is ridiculous on its face ... it
happens to be okay for varchar and text but in general you'd not have
the same comparison semantics for two different operators.  (As an
example, int4 and OID are binary compatible but their index operators
are definitely not interchangeable, because one is signed comparison and
the other unsigned.)

7.3 is an intermediate state in which we'd ripped out the bogus planner
assumption but not developed fully adequate substitutes.

7.4 is substantially smarter than either, and can generate merge and
hash joins as well as ye plain olde indexed nestloop for this query.
In a quick test, it seemed that all three plan types yielded about the
same runtimes for this query with this much data.  I didn't have time
to try scaling up the amount of data to see where things went, but I'd
expect the nestloop to be a loser at large scales even with an inner
indexscan.

Anyway, bottom line is that 7.4 and CVS tip are competitive with 7.2
again, only they do it right this time ...

            regards, tom lane

Re: string casting for index usage

From
Michael Adler
Date:
On Thu, Mar 18, 2004 at 03:39:12PM -0500, Tom Lane wrote:
> Michael Adler <adler@pobox.com> writes:
> > In porting an application from v7.2 and v7.3, I noticed that a join on a varchar column and a text column was
ignoringindices that were helpful in v7.2. When I explicitly cast the text to a varchar (or set ENABLE_SEQSCAN TO
false)the index is scanned and it works as efficiently as in v7.2.  
>
> Maybe you should be moving to 7.4, instead.

That's a fair suggestion, but it's not practical for our 75 sites, most without decent network access. If this is in
factaddressed in newer releases, then my point is mostly inconsequential. 

We use Debian stable (7.2.1-2woody4) and Debian testing (7.3.4-9).

> A desultory test didn't show any difference between 7.2.4 and 7.3.6
> in this respect, however.  Perhaps you forgot to ANALYZE yet in the
> new database?

I have a test with sample data and queries to demonstrate what I'm seeing. I hope it is useful.

Having to do manual casts is not cruel and unusual, but it's not encouraging to see performance go down after an
upgrade.If anyone has any clever solutions, let me know.  

tables, data, and queries:
http://www.panix.com/~adler/manual-cast-for-index-scan.sql

my test output:
http://www.panix.com/~adler/manual-cast-for-index-scan_7.3.4-9.out
http://www.panix.com/~adler/manual-cast-for-index-scan_7.2.1-2woody4.out

(the times are not horrific in these specific examples, but the sequential scan makes them unscalable).


manual-cast-for-index-scan_7.3.4-9.out:

DROP TABLE t1;
DROP TABLE
DROP TABLE t2;
DROP TABLE
CREATE TABLE t1 (
    key_col text,
    grp text
);
CREATE TABLE
COPY t1 FROM stdin;
CREATE UNIQUE INDEX tempindex1 ON t1 USING btree (key_col);
CREATE INDEX
CREATE TABLE t2 (
    item_num character varying(5),
    key_col character varying(14)
);
CREATE TABLE
COPY t2 FROM stdin;
CREATE INDEX tempindex2 ON t2 USING btree (key_col);
CREATE INDEX
VACUUM ANALYZE;
VACUUM
SELECT version();
 PostgreSQL 7.3.4 on i386-pc-linux-gnu, compiled by GCC i386-linux-gcc (GCC) 3.3.2 (Debian)

EXPLAIN ANALYZE  SELECT item_num, t1.key_col FROM t1 LEFT JOIN t2 ON (t2.key_col = t1.key_col) WHERE grp = '24';
 Nested Loop  (cost=0.00..23803.27 rows=194 width=31) (actual time=20.95..1401.46 rows=69 loops=1)
   Join Filter: (("inner".key_col)::text = "outer".key_col)
   ->  Seq Scan on t1  (cost=0.00..492.94 rows=194 width=18) (actual time=0.32..30.27 rows=69 loops=1)
         Filter: (grp = '24'::text)
   ->  Seq Scan on t2  (cost=0.00..66.87 rows=4287 width=13) (actual time=0.01..12.06 rows=4287 loops=69)
 Total runtime: 1401.73 msec

EXPLAIN ANALYZE  SELECT item_num, t1.key_col FROM t1 LEFT JOIN t2 ON (t2.key_col::text = t1.key_col) WHERE grp = '24';
 Nested Loop  (cost=0.00..23803.27 rows=194 width=31) (actual time=20.27..1398.82 rows=69 loops=1)
   Join Filter: (("inner".key_col)::text = "outer".key_col)
   ->  Seq Scan on t1  (cost=0.00..492.94 rows=194 width=18) (actual time=0.26..25.91 rows=69 loops=1)
         Filter: (grp = '24'::text)
   ->  Seq Scan on t2  (cost=0.00..66.87 rows=4287 width=13) (actual time=0.01..12.02 rows=4287 loops=69)
 Total runtime: 1399.08 msec

EXPLAIN ANALYZE  SELECT item_num, t1.key_col FROM t1 LEFT JOIN t2 ON (t2.key_col = t1.key_col::varchar(24)) WHERE grp =
'24';
 Nested Loop  (cost=0.00..4819.13 rows=194 width=31) (actual time=0.52..27.46 rows=69 loops=1)
   ->  Seq Scan on t1  (cost=0.00..492.94 rows=194 width=18) (actual time=0.27..25.94 rows=69 loops=1)
         Filter: (grp = '24'::text)
   ->  Index Scan using tempindex2 on t2  (cost=0.00..22.17 rows=12 width=13) (actual time=0.01..0.01 rows=0 loops=69)
         Index Cond: (t2.key_col = ("outer".key_col)::character varying(24))
 Total runtime: 27.70 msec



manual-cast-for-index-scan_7.2.1-2woody4.out:

DROP TABLE t1;
DROP
DROP TABLE t2;
DROP
CREATE TABLE t1 (
    key_col text,
    grp text
);
CREATE
COPY t1 FROM stdin;
CREATE UNIQUE INDEX tempindex1 ON t1 USING btree (key_col);
CREATE
CREATE TABLE t2 (
    item_num character varying(5),
    key_col character varying(14)
);
CREATE
COPY t2 FROM stdin;
CREATE INDEX tempindex2 ON t2 USING btree (key_col);
CREATE
VACUUM ANALYZE;
VACUUM
SELECT version();
 PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.4

EXPLAIN ANALYZE  SELECT item_num, t1.key_col FROM t1 LEFT JOIN t2 ON (t2.key_col = t1.key_col) WHERE grp = '24';
psql:castedneed.sql:29127: NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..1405.88 rows=204 width=32) (actual time=0.46..40.60 rows=69 loops=1)
  ->  Seq Scan on t1  (cost=0.00..505.94 rows=204 width=18) (actual time=0.35..39.09 rows=69 loops=1)
  ->  Index Scan using tempindex2 on t2  (cost=0.00..4.27 rows=11 width=14) (actual time=0.01..0.01 rows=0 loops=69)
Total runtime: 40.81 msec

EXPLAIN
EXPLAIN ANALYZE  SELECT item_num, t1.key_col FROM t1 LEFT JOIN t2 ON (t2.key_col::text = t1.key_col) WHERE grp = '24';
psql:castedneed.sql:29128: NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..1405.88 rows=204 width=32) (actual time=0.40..39.88 rows=69 loops=1)
  ->  Seq Scan on t1  (cost=0.00..505.94 rows=204 width=18) (actual time=0.35..38.44 rows=69 loops=1)
  ->  Index Scan using tempindex2 on t2  (cost=0.00..4.27 rows=11 width=14) (actual time=0.01..0.01 rows=0 loops=69)
Total runtime: 40.07 msec

EXPLAIN
EXPLAIN ANALYZE  SELECT item_num, t1.key_col FROM t1 LEFT JOIN t2 ON (t2.key_col = t1.key_col::varchar(24)) WHERE grp =
'24';
psql:castedneed.sql:29129: NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..1416.66 rows=4383 width=32) (actual time=0.40..41.59 rows=69 loops=1)
  ->  Seq Scan on t1  (cost=0.00..505.94 rows=204 width=18) (actual time=0.36..40.05 rows=69 loops=1)
  ->  Index Scan using tempindex2 on t2  (cost=0.00..4.30 rows=11 width=14) (actual time=0.01..0.01 rows=0 loops=69)
Total runtime: 41.78 msec

EXPLAIN