Re: grep -f keyword data query - Mailing list pgsql-general

From Arjen Nienhuis
Subject Re: grep -f keyword data query
Date
Msg-id CAG6W84JXVLnQ6njNyM4-itC=guZBq9ScPaQuTWzxWMmrJfUz2g@mail.gmail.com
Whole thread Raw
In response to Re: grep -f keyword data query  (Hiroyuki Sato <hiroysato@gmail.com>)
Responses Re: grep -f keyword data query  (Hiroyuki Sato <hiroysato@gmail.com>)
List pgsql-general


On Dec 28, 2015 00:55, "Hiroyuki Sato" <hiroysato@gmail.com> wrote:
>
> Hello Andreas and Tom
>
> Thank you for replying.
>
> Sorry, I re-created my questions. I was mis-pasted query log on previous question.
> (@~ operator is PGroonga extension (http://pgroonga.github.io))
> Please ignore it. 
>
> Best regards.
>
> 1, Problem.
>   (1) Following query is exteme slow. (478sec)
>     SELECT
>       u.url 
>     FROM 
>       url_lists4 u,
>       keywords4 k
>     WHERE
>       u.url like k.url
>     AND
>       k.name = 'esc_url';
>
>
>   (2) grep -f kwd.txt sample.txt (exec time under 1sec)
>

These are not the same 'query'. Grep will match any of the patterns for each url while postgres finds all matching combinations. You need to use '... OR ... OR ...' or 'ANY()' in postgres to get the same result.

> 2, Questions
>
>   (1) Is it possible to improve this query like the command ``grep -f keyword data``?

I get the best results by using OR of all the different patterns:

SELECT url FROM url_lists4
WHERE
    url LIKE 'http://ak.yahoo.co.jp/xwv/%'
    OR url LIKE 'http://ao.yahoo.co.jp/wdl/%'
    OR ...

In theory you could use:

... WHERE url LIKE ANY(ARRAY(SELECT url FROM keywords4 k WHERE k.name = 'esc_url'));

but that's very slow.

>   (2) What kind of Index should I create on url_lists table?

Both btree text_pattern_ops and gin trigram on the URL.

>     
> 3, Environment 
>   OS: CentOS7
>   PostgreSQL 9.4
>
> 4, sample source
>   https://github.com/hiroyuki-sato/postgres_like_test
>     
>
> 5, Create table 
>
>     drop table if exists url_lists4;
>     create table url_lists4 (
>       id int not null primary key,
>       url text not null
>     );
>     create index ix_url_url_lists4 on url_lists4(url);
>       
>     drop table if exists keywords4;
>     create table keywords4 (
>       id int not null primary key,
>       name varchar(40) not null,
>       url text not null
>     );
>
>     create index ix_url_keywords4 on keywords4(url);
>     create index ix_name_keywords4 on keywords4(name);
>       
>
>     \copy url_lists4(id,url) from 'sample.txt' with delimiter ',';
>     \copy keywords4(id,name,url) from 'keyword.txt' with delimiter ',';
>
>     vacuum url_lists4;
>     vacuum keywords4;
>     analyze url_lists4;
>     analyze keywords4;
>
> 6, Query
>
>     EXPLAIN SELECT
>       u.url 
>     FROM 
>       url_lists4 u,
>       keywords4 k
>     WHERE
>       u.url like k.url
>     AND
>       k.name = 'esc_url';
>
>     EXPLAIN ANALYZE SELECT
>       u.url 
>     FROM 
>       url_lists4 u,
>       keywords4 k
>     WHERE
>       u.url like k.url
>     AND
>       k.name = 'esc_url';
>
>
>     SELECT
>       u.url 
>     FROM 
>       url_lists4 u,
>       keywords4 k
>     WHERE
>       u.url like k.url
>     AND
>       k.name = 'esc_url';
>
> 7, EXPLAIN
>
>                                      QUERY PLAN                                  
>     -----------------------------------------------------------------------------
>      Nested Loop  (cost=0.00..37510799.00 rows=12500000 width=57)
>        Join Filter: (u.url ~~ k.url)
>        ->  Seq Scan on url_lists4 u  (cost=0.00..10682.00 rows=500000 width=57)
>        ->  Materialize  (cost=0.00..129.50 rows=5000 width=28)
>              ->  Seq Scan on keywords4 k  (cost=0.00..104.50 rows=5000 width=28)
>                    Filter: ((name)::text = 'esc_url'::text)
>     (6 rows)
>
> 8, EXPLAIN ANALYZE
>                                                              QUERY PLAN                                                          
>     -----------------------------------------------------------------------------------------------------------------------------
>      Nested Loop  (cost=0.00..37510799.00 rows=12500000 width=57) (actual time=6011.642..478011.117 rows=4850 loops=1)
>        Join Filter: (u.url ~~ k.url)
>        Rows Removed by Join Filter: 2499995150
>        ->  Seq Scan on url_lists4 u  (cost=0.00..10682.00 rows=500000 width=57) (actual time=0.034..192.646 rows=500000 loops=1)
>        ->  Materialize  (cost=0.00..129.50 rows=5000 width=28) (actual time=0.000..0.261 rows=5000 loops=500000)
>              ->  Seq Scan on keywords4 k  (cost=0.00..104.50 rows=5000 width=28) (actual time=0.021..1.705 rows=5000 loops=1)
>                    Filter: ((name)::text = 'esc_url'::text)
>      Planning time: 0.061 ms
>      Execution time: 478011.773 ms
>     (9 rows)
>
>
> 2015年12月28日(月) 3:39 Tom Lane <tgl@sss.pgh.pa.us>:
>>
>> Andreas Kretschmer <andreas@a-kretschmer.de> writes:
>> >> Tom Lane <tgl@sss.pgh.pa.us> hat am 27. Dezember 2015 um 19:11 geschrieben:
>> >> What in the world is this @~ operator?  And what sort of index are
>> >> you using now, that can accept it?  Are the rowcount estimates in
>> >> the EXPLAIN output accurate?  (If they are, it's hardly surprising
>> >> that the query takes a long time.)
>>
>> > in a privat mail he called an other operator: ~~. I think, the @~ is an error.
>>
>> Well, ~~ isn't directly indexable by btree indexes either, so there's
>> still something wrong with either the EXPLAIN output or the claimed
>> index definitions.
>>
>>                         regards, tom lane

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Regex "embedded options" does only work on the whole pattern?
Next
From: Hiroyuki Sato
Date:
Subject: Re: grep -f keyword data query