Re: indexing elements of a csv ? - Mailing list pgsql-general

From Ian Lawrence Barwick
Subject Re: indexing elements of a csv ?
Date
Msg-id CAB8KJ=iJtdtRdw69fcFQLnWBd8OW6CR9PHUndU=C9mAbhCEvfg@mail.gmail.com
Whole thread Raw
In response to Re: indexing elements of a csv ?  (Ian Lawrence Barwick <barwick@gmail.com>)
List pgsql-general
2013/3/13 Ian Lawrence Barwick <barwick@gmail.com>:
> 2013/3/12 Gauthier, Dave <dave.gauthier@intel.com>:
>> Hi:
>>
>> v9.0.1 on linux.
>>
>> I have a table with a column that is a csv.  Users will select records based
>> upon the existence of an element of the csv.  There is an index on that
>> column but I'm thinking that it won't be of much use in this situation.  Is
>> there a way to facilitate these queries?
>>
>> Example:
>>
>> create table foo (col0 text, col1 text);
>>
>> create index foo_col1 on foo (col1);
>>
>> insert into foo (col0,col1) values
>> ('moe','aa,bbb,c'),('larry','xxxxx,bbb,yyy'),('curly','m,nnnn,oo');
>>
>> now...
>>
>> select col0 from foo where <the csv element 'bbb' exists as a csv element of
>> col1>
>>
>>
>> Some attempts, which get the right answers, but which probably won't be very
>> efficient...
>>
>> select col0 from foo where string_to_array('bbb','') <@
>> string_to_array(col1);
>>
>> select col0 from foo where ','||col1||','  like  '%,bbb,%';
>>
>> select col0 from foo where ((col1 like 'bbb,%') or (col1 like '%,bbb,%') or
>> (col1 like '%,bbb'));
>>
>> Long shot, but I thought I'd ask anyway.
>
> A GIN index might do the trick:
>
> CREATE INDEX ix_col1_ix ON foo USING GIN(string_to_array(col1,','));
>
> (This is assuming the CSV values can be cleanly converted to
> an array using "string_to_array()").
>
> You could then query it with:
> SELECT col0 FROM foo WHERE string_to_array(col1,',') @>  '{bbb}'::text[];

Just out of interest, I populated the table with around 1,000,000 rows of
randomly generated data (three items of random upper case characters
in col1), results with and without index below (using an untuned 9.2
installation
on a laptop with a slow hard drive).
Note that adding the index doubled the total table size, which might
be something
to watch out for if the table is very big and you have a lot of unique
values in the
"CSV" column.

Regards

Ian Barwick


testdb=# SELECT * from foo where string_to_array(col1,',') @>  '{PKRY}'::text[];
  col0  |       col1
--------+-------------------
 ARWC   | JIJ,MBDVU,PKRY
 FUNWOA | JKEK,PKRY,MQFUQTJ
 PJGTD  | KSO,HSTB,PKRY
(3 rows)

Time: 1325.536 ms
testdb=# explain SELECT * from foo where string_to_array(col1,',') @>
'{PKRY}'::text[];
                            QUERY PLAN
------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..28400.42 rows=5021 width=76)
   Filter: (string_to_array(col1, ','::text) @> '{PKRY}'::text[])
(2 rows)

testdb=# CREATE INDEX ix_col1_ix ON foo using gin(string_to_array(col1,','));
CREATE INDEX
Time: 170533.158 ms
testdb=# ANALYZE foo;
ANALYZE
Time: 1431.665 ms
testdb=# SELECT * from foo where string_to_array(col1,',') @>  '{PKRY}'::text[];
  col0  |       col1
--------+-------------------
 ARWC   | JIJ,MBDVU,PKRY
 FUNWOA | JKEK,PKRY,MQFUQTJ
 PJGTD  | KSO,HSTB,PKRY
(3 rows)

Time: 0.906 ms
testdb=# explain SELECT * from foo where string_to_array(col1,',') @>
'{PKRY}'::text[];
                                 QUERY PLAN
----------------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=20.79..389.58 rows=101 width=24)
   Recheck Cond: (string_to_array(col1, ','::text) @> '{PKRY}'::text[])
   ->  Bitmap Index Scan on ix_col1_ix  (cost=0.00..20.76 rows=101 width=0)
         Index Cond: (string_to_array(col1, ','::text) @> '{PKRY}'::text[])
(4 rows)

Time: 0.377 ms


pgsql-general by date:

Previous
From: Steve Erickson
Date:
Subject: Re: indexing elements of a csv ?
Next
From: Perry Smith
Date:
Subject: Testing Technique when using a DB