Re: grep -f keyword data query - Mailing list pgsql-general

From Hiroyuki Sato
Subject Re: grep -f keyword data query
Date
Msg-id CA+Tq-RqUm+pPii60VCOxsf+K=9-oypZLth+aW3VW6g9BgfnSBQ@mail.gmail.com
Whole thread Raw
In response to Re: grep -f keyword data query  (Arjen Nienhuis <a.g.nienhuis@gmail.com>)
List pgsql-general
Hello Arjen.

Thank you for replying.

I'll try OR query on my environment.
Thanks.



2016年1月4日(月) 23:03 Arjen Nienhuis <a.g.nienhuis@gmail.com>:


On Jan 4, 2016 09:45, "Hiroyuki Sato" <hiroysato@gmail.com> wrote:
>
> Hello Arjen
>
> Thank you for replying.
>
> 2016年1月4日(月) 16:49 Arjen Nienhuis <a.g.nienhuis@gmail.com>:
>>
>>
>> On Dec 28, 2015 00:55, "Hiroyuki Sato" <hiroysato@gmail.com> wrote:
>> >
>> > Hello Andreas and Tom
>> >
>> > Thank you for replying.
>> >
>> > Sorry, I re-created my questions. I was mis-pasted query log on previous question.
>> > (@~ operator is PGroonga extension (http://pgroonga.github.io))
>> > Please ignore it. 
>> >
>> > Best regards.
>> >
>> > 1, Problem.
>> >   (1) Following query is exteme slow. (478sec)
>> >     SELECT
>> >       u.url 
>> >     FROM 
>> >       url_lists4 u,
>> >       keywords4 k
>> >     WHERE
>> >       u.url like k.url
>> >     AND
>> >       k.name = 'esc_url';
>> >
>> >
>> >   (2) grep -f kwd.txt sample.txt (exec time under 1sec)
>> >
>>
>> These are not the same 'query'. Grep will match any of the patterns for each url while postgres finds all matching combinations. You need to use '... OR ... OR ...' or 'ANY()' in postgres to get the same result.
>
> I thought the following query are same meaning.
> Could you point me same examples about two differences?

If one URL matches two patterns, grep will return the match only once. In your SQL query it will match multiple times and will be in the result multiple times.

You can test this by putting both 'http' and 'yahoo' in the wordlist.

>
>   (1) u.url like k.url
>   (2) u.url like 'k.url 1', or u.url like 'k.url2' ...
>
>> > 2, Questions
>> >
>> >   (1) Is it possible to improve this query like the command ``grep -f keyword data``?
>>
>> I get the best results by using OR of all the different patterns:
>>
>> SELECT url FROM url_lists4
>> WHERE
>>     url LIKE 'http://ak.yahoo.co.jp/xwv/%'
>>     OR url LIKE 'http://ao.yahoo.co.jp/wdl/%'
>>     OR ...
>
>
> I'll try it.
>
> BTW Do you know how many OR can I use ?
> I have 5000 URLs.

There is no real limit. I tried with your test file (5000 patterns). I got 6 seconds planning time and 12 seconds execution time.

(I also tried making one big regular expression with url1|url2|... but that did fail)

>
>> In theory you could use:
>>
>> ... WHERE url LIKE ANY(ARRAY(SELECT url FROM keywords4 k WHERE k.name = 'esc_url'));
>>
>> but that's very slow.
>
> It's very interesting. It seems same query.
> I'll check EXPLAIN 
>
>  
>>
>> >   (2) What kind of Index should I create on url_lists table?
>>
>> Both btree text_pattern_ops and gin trigram on the URL.
>>
>> >     
>> > 3, Environment 
>> >   OS: CentOS7
>> >   PostgreSQL 9.4
>> >
>> > 4, sample source
>> >   https://github.com/hiroyuki-sato/postgres_like_test
>> >     
>> >
>> > 5, Create table 
>> >
>> >     drop table if exists url_lists4;
>> >     create table url_lists4 (
>> >       id int not null primary key,
>> >       url text not null
>> >     );
>> >     create index ix_url_url_lists4 on url_lists4(url);
>> >       
>> >     drop table if exists keywords4;
>> >     create table keywords4 (
>> >       id int not null primary key,
>> >       name varchar(40) not null,
>> >       url text not null
>> >     );
>> >
>> >     create index ix_url_keywords4 on keywords4(url);
>> >     create index ix_name_keywords4 on keywords4(name);
>> >       
>> >
>> >     \copy url_lists4(id,url) from 'sample.txt' with delimiter ',';
>> >     \copy keywords4(id,name,url) from 'keyword.txt' with delimiter ',';
>> >
>> >     vacuum url_lists4;
>> >     vacuum keywords4;
>> >     analyze url_lists4;
>> >     analyze keywords4;
>> >
>> > 6, Query
>> >
>> >     EXPLAIN SELECT
>> >       u.url 
>> >     FROM 
>> >       url_lists4 u,
>> >       keywords4 k
>> >     WHERE
>> >       u.url like k.url
>> >     AND
>> >       k.name = 'esc_url';
>> >
>> >     EXPLAIN ANALYZE SELECT
>> >       u.url 
>> >     FROM 
>> >       url_lists4 u,
>> >       keywords4 k
>> >     WHERE
>> >       u.url like k.url
>> >     AND
>> >       k.name = 'esc_url';
>> >
>> >
>> >     SELECT
>> >       u.url 
>> >     FROM 
>> >       url_lists4 u,
>> >       keywords4 k
>> >     WHERE
>> >       u.url like k.url
>> >     AND
>> >       k.name = 'esc_url';
>> >
>> > 7, EXPLAIN
>> >
>> >                                      QUERY PLAN                                  
>> >     -----------------------------------------------------------------------------
>> >      Nested Loop  (cost=0.00..37510799.00 rows=12500000 width=57)
>> >        Join Filter: (u.url ~~ k.url)
>> >        ->  Seq Scan on url_lists4 u  (cost=0.00..10682.00 rows=500000 width=57)
>> >        ->  Materialize  (cost=0.00..129.50 rows=5000 width=28)
>> >              ->  Seq Scan on keywords4 k  (cost=0.00..104.50 rows=5000 width=28)
>> >                    Filter: ((name)::text = 'esc_url'::text)
>> >     (6 rows)
>> >
>> > 8, EXPLAIN ANALYZE
>> >                                                              QUERY PLAN                                                          
>> >     -----------------------------------------------------------------------------------------------------------------------------
>> >      Nested Loop  (cost=0.00..37510799.00 rows=12500000 width=57) (actual time=6011.642..478011.117 rows=4850 loops=1)
>> >        Join Filter: (u.url ~~ k.url)
>> >        Rows Removed by Join Filter: 2499995150
>> >        ->  Seq Scan on url_lists4 u  (cost=0.00..10682.00 rows=500000 width=57) (actual time=0.034..192.646 rows=500000 loops=1)
>> >        ->  Materialize  (cost=0.00..129.50 rows=5000 width=28) (actual time=0.000..0.261 rows=5000 loops=500000)
>> >              ->  Seq Scan on keywords4 k  (cost=0.00..104.50 rows=5000 width=28) (actual time=0.021..1.705 rows=5000 loops=1)
>> >                    Filter: ((name)::text = 'esc_url'::text)
>> >      Planning time: 0.061 ms
>> >      Execution time: 478011.773 ms
>> >     (9 rows)
>> >
>> >
>> > 2015年12月28日(月) 3:39 Tom Lane <tgl@sss.pgh.pa.us>:
>> >>
>> >> Andreas Kretschmer <andreas@a-kretschmer.de> writes:
>> >> >> Tom Lane <tgl@sss.pgh.pa.us> hat am 27. Dezember 2015 um 19:11 geschrieben:
>> >> >> What in the world is this @~ operator?  And what sort of index are
>> >> >> you using now, that can accept it?  Are the rowcount estimates in
>> >> >> the EXPLAIN output accurate?  (If they are, it's hardly surprising
>> >> >> that the query takes a long time.)
>> >>
>> >> > in a privat mail he called an other operator: ~~. I think, the @~ is an error.
>> >>
>> >> Well, ~~ isn't directly indexable by btree indexes either, so there's
>> >> still something wrong with either the EXPLAIN output or the claimed
>> >> index definitions.
>> >>
>> >>                         regards, tom lane
>
>
> Best regards.
>  

pgsql-general by date:

Previous
From: Andreas Kretschmer
Date:
Subject: Re: Streaming replication stacked.
Next
From: Andreas Kretschmer
Date:
Subject: Re: Streaming replication stacked.