Thread: grep -f keyword data query
Hiroyuki Sato <hiroysato@gmail.com> wrote: > Hello. > > I would like to create the query like the following. > It work well, but extreme slow. > Is it possible to improve this query like the command ``grep -f keyword data``? > > What kind of Index should I create on url_lists table? can you show us the create table and create index - statements? And, the explain analyse of your query? And, finally, the pg-version? You got an index-scan, seems okay for me. Andreas -- Really, I'm not out to destroy Microsoft. That will just be a completely unintentional side effect. (Linus Torvalds) "If I was god, I would recompile penguin with --enable-fly." (unknown) Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°
Hiroyuki Sato <hiroysato@gmail.com> writes: > I would like to create the query like the following. > It work well, but extreme slow. > ... > Explain output. > Nested Loop (cost=0.45..1570856063.28 rows=57122000000 width=57) > -> Index Scan using ix_name_keywords on keywords k (cost=0.28..221.78 rows=5000 width=30) > Index Cond: ((name)::text = 'esc_url'::text) > -> Index Only Scan using ix_url_url_lists on url_lists u (cost=0.17..199927.17 rows=11424400 width=57) > Index Cond: (url @~ k.url) What in the world is this @~ operator? And what sort of index are you using now, that can accept it? Are the rowcount estimates in the EXPLAIN output accurate? (If they are, it's hardly surprising that the query takes a long time.) regards, tom lane
> Tom Lane <tgl@sss.pgh.pa.us> hat am 27. Dezember 2015 um 19:11 geschrieben: > > > Hiroyuki Sato <hiroysato@gmail.com> writes: > > I would like to create the query like the following. > > It work well, but extreme slow. > > ... > > Explain output. > > > Nested Loop (cost=0.45..1570856063.28 rows=57122000000 width=57) > > -> Index Scan using ix_name_keywords on keywords k (cost=0.28..221.78 > > rows=5000 width=30) > > Index Cond: ((name)::text = 'esc_url'::text) > > -> Index Only Scan using ix_url_url_lists on url_lists u > > (cost=0.17..199927.17 rows=11424400 width=57) > > Index Cond: (url @~ k.url) > > What in the world is this @~ operator? And what sort of index are > you using now, that can accept it? Are the rowcount estimates in > the EXPLAIN output accurate? (If they are, it's hardly surprising > that the query takes a long time.) in a privat mail he called an other operator: ~~. I think, the @~ is an error. Here the tables and indexes (from his private mail): create table url_lists ( id int not null primary key, url text not null ); create table keywords ( id int not null primary key, name varchar(40) not null, url text not null ); create index ix_url_url_lists on url_lists(url); create index ix_url_keywords on keywords(url); create index ix_name_keywords on keywords(name); And he gave me a new explain, but not a explain ANALYSE. (it was the same explain as above). Regards, Andreas.
Andreas Kretschmer <andreas@a-kretschmer.de> writes: >> Tom Lane <tgl@sss.pgh.pa.us> hat am 27. Dezember 2015 um 19:11 geschrieben: >> What in the world is this @~ operator? And what sort of index are >> you using now, that can accept it? Are the rowcount estimates in >> the EXPLAIN output accurate? (If they are, it's hardly surprising >> that the query takes a long time.) > in a privat mail he called an other operator: ~~. I think, the @~ is an error. Well, ~~ isn't directly indexable by btree indexes either, so there's still something wrong with either the EXPLAIN output or the claimed index definitions. regards, tom lane
Andreas Kretschmer <andreas@a-kretschmer.de> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> hat am 27. Dezember 2015 um 19:11 geschrieben:
>> What in the world is this @~ operator? And what sort of index are
>> you using now, that can accept it? Are the rowcount estimates in
>> the EXPLAIN output accurate? (If they are, it's hardly surprising
>> that the query takes a long time.)
> in a privat mail he called an other operator: ~~. I think, the @~ is an error.
Well, ~~ isn't directly indexable by btree indexes either, so there's
still something wrong with either the EXPLAIN output or the claimed
index definitions.
regards, tom lane
Hiroyuki Sato <hiroysato@gmail.com> writes: > Sorry, I re-created my questions. I was mis-pasted query log on previous > question. > (@~ operator is PGroonga extension (http://pgroonga.github.io)) > [ "like" is actually the operator of interest ] Ah. You might get some good results with trigram indexes (ie, contrib/pg_trgm and a GIST or GIN index). See http://www.postgresql.org/docs/9.4/static/pgtrgm.html regards, tom lane
Hiroyuki Sato <hiroysato@gmail.com> writes:
> Sorry, I re-created my questions. I was mis-pasted query log on previous
> question.
> (@~ operator is PGroonga extension (http://pgroonga.github.io))
> [ "like" is actually the operator of interest ]
Ah. You might get some good results with trigram indexes (ie,
contrib/pg_trgm and a GIST or GIN index). See
http://www.postgresql.org/docs/9.4/static/pgtrgm.html
regards, tom lane
Hiroyuki Sato <hiroysato@gmail.com> writes: > I re-created index with pg_trgm. > Execution time is 210sec. > Yes It is faster than btree index. But still slow. > It is possible to improve this query speed? > Should I use another query or idex? Did you try a GIN index? regards, tom lane
On Sun, Dec 27, 2015 at 3:53 PM, Hiroyuki Sato <hiroysato@gmail.com> wrote: > Hello Andreas and Tom > > Thank you for replying. > > Sorry, I re-created my questions. I was mis-pasted query log on previous > question. > (@~ operator is PGroonga extension (http://pgroonga.github.io)) > Please ignore it. > > Best regards. > > 1, Problem. > (1) Following query is exteme slow. (478sec) > SELECT > u.url > FROM > url_lists4 u, > keywords4 k > WHERE > u.url like k.url > AND > k.name = 'esc_url'; > > > (2) grep -f kwd.txt sample.txt (exec time under 1sec) Certainly not in my hands. The best I can get is 9 seconds. > > 2, Questions > > (1) Is it possible to improve this query like the command ``grep -f > keyword data``? You will not get a general tool to match a specialized tool in the specialized tool's own domain. fgrep is good at what fgrep does. Since your queries all have constant text strings at the beginning, they could use the index. But if you are not using the C collation, then you need build a special index: create index on url_lists4 (url text_pattern_ops); But, the planner refuses to use this index for your query anyway, because it can't see that the patterns are all left-anchored. Really, your best bet is refactor your url data so it is stored with a url_prefix and url_suffix column. Then you can do exact matching rather than pattern matching. Cheers, Jeff
Hiroyuki Sato <hiroysato@gmail.com> writes:
> I re-created index with pg_trgm.
> Execution time is 210sec.
> Yes It is faster than btree index. But still slow.
> It is possible to improve this query speed?
> Should I use another query or idex?
Did you try a GIN index?
regards, tom lane
On Sun, Dec 27, 2015 at 3:53 PM, Hiroyuki Sato <hiroysato@gmail.com> wrote:
> Hello Andreas and Tom
>
> Thank you for replying.
>
> Sorry, I re-created my questions. I was mis-pasted query log on previous
> question.
> (@~ operator is PGroonga extension (http://pgroonga.github.io))
> Please ignore it.
>
> Best regards.
>
> 1, Problem.
> (1) Following query is exteme slow. (478sec)
> SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
>
> (2) grep -f kwd.txt sample.txt (exec time under 1sec)
Certainly not in my hands. The best I can get is 9 seconds.
>
> 2, Questions
>
> (1) Is it possible to improve this query like the command ``grep -f
> keyword data``?
You will not get a general tool to match a specialized tool in the
specialized tool's own domain. fgrep is good at what fgrep does.
Since your queries all have constant text strings at the beginning,
they could use the index. But if you are not using the C collation,
then you need build a special index:
create index on url_lists4 (url text_pattern_ops);
But, the planner refuses to use this index for your query anyway,
because it can't see that the patterns are all left-anchored.
Really, your best bet is refactor your url data so it is stored with a
url_prefix and url_suffix column. Then you can do exact matching
rather than pattern matching.
Cheers,
Jeff
2015年12月29日(火) 4:35 Jeff Janes <jeff.janes@gmail.com>:
But, the planner refuses to use this index for your query anyway,
because it can't see that the patterns are all left-anchored.
Really, your best bet is refactor your url data so it is stored with a
url_prefix and url_suffix column. Then you can do exact matching
rather than pattern matching.I see, exact matching faster than pattern matting.But I need pattern match in path part(ie, http://www.yahoo.com/a/b/c/... )I would like to pattern match '/a/b/c' part.
On 30 December 2015 at 04:21, Hiroyuki Sato <hiroysato@gmail.com> wrote:2015年12月29日(火) 4:35 Jeff Janes <jeff.janes@gmail.com>:But, the planner refuses to use this index for your query anyway,
because it can't see that the patterns are all left-anchored.
Really, your best bet is refactor your url data so it is stored with a
url_prefix and url_suffix column. Then you can do exact matching
rather than pattern matching.I see, exact matching faster than pattern matting.But I need pattern match in path part(ie, http://www.yahoo.com/a/b/c/... )I would like to pattern match '/a/b/c' part.If your pattern matching is as simple as that, then why not split the /a/b/c/ part out as mentioned by Jeff? Alternatively you could just write a function which splits that out for you and returns it, then index that function, and then just include a call to that function in the join condition matching with the equality operator. That'll allow hash and merge joins to be possible again.
--
2015年12月30日(水) 6:04 David Rowley <david.rowley@2ndquadrant.com>:On 30 December 2015 at 04:21, Hiroyuki Sato <hiroysato@gmail.com> wrote:2015年12月29日(火) 4:35 Jeff Janes <jeff.janes@gmail.com>:But, the planner refuses to use this index for your query anyway,
because it can't see that the patterns are all left-anchored.
Really, your best bet is refactor your url data so it is stored with a
url_prefix and url_suffix column. Then you can do exact matching
rather than pattern matching.I see, exact matching faster than pattern matting.But I need pattern match in path part(ie, http://www.yahoo.com/a/b/c/... )I would like to pattern match '/a/b/c' part.If your pattern matching is as simple as that, then why not split the /a/b/c/ part out as mentioned by Jeff? Alternatively you could just write a function which splits that out for you and returns it, then index that function, and then just include a call to that function in the join condition matching with the equality operator. That'll allow hash and merge joins to be possible again.Could you tell me more detail about Alternatively part?It is good idea to split host and part.I'll try it.My matching pattern is the following1, http://www.yahoo.com/a/b/% (host equal, path like)2, http://%.yahoo.com/a/b/% (host and path like )
On Dec 28, 2015 00:55, "Hiroyuki Sato" <hiroysato@gmail.com> wrote:
>
> Hello Andreas and Tom
>
> Thank you for replying.
>
> Sorry, I re-created my questions. I was mis-pasted query log on previous question.
> (@~ operator is PGroonga extension (http://pgroonga.github.io))
> Please ignore it.
>
> Best regards.
>
> 1, Problem.
> (1) Following query is exteme slow. (478sec)
> SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
>
> (2) grep -f kwd.txt sample.txt (exec time under 1sec)
>
These are not the same 'query'. Grep will match any of the patterns for each url while postgres finds all matching combinations. You need to use '... OR ... OR ...' or 'ANY()' in postgres to get the same result.
> 2, Questions
>
> (1) Is it possible to improve this query like the command ``grep -f keyword data``?
I get the best results by using OR of all the different patterns:
SELECT url FROM url_lists4
WHERE
url LIKE 'http://ak.yahoo.co.jp/xwv/%'
OR url LIKE 'http://ao.yahoo.co.jp/wdl/%'
OR ...
In theory you could use:
... WHERE url LIKE ANY(ARRAY(SELECT url FROM keywords4 k WHERE k.name = 'esc_url'));
but that's very slow.
> (2) What kind of Index should I create on url_lists table?
Both btree text_pattern_ops and gin trigram on the URL.
>
> 3, Environment
> OS: CentOS7
> PostgreSQL 9.4
>
> 4, sample source
> https://github.com/hiroyuki-sato/postgres_like_test
>
>
> 5, Create table
>
> drop table if exists url_lists4;
> create table url_lists4 (
> id int not null primary key,
> url text not null
> );
> create index ix_url_url_lists4 on url_lists4(url);
>
> drop table if exists keywords4;
> create table keywords4 (
> id int not null primary key,
> name varchar(40) not null,
> url text not null
> );
>
> create index ix_url_keywords4 on keywords4(url);
> create index ix_name_keywords4 on keywords4(name);
>
>
> \copy url_lists4(id,url) from 'sample.txt' with delimiter ',';
> \copy keywords4(id,name,url) from 'keyword.txt' with delimiter ',';
>
> vacuum url_lists4;
> vacuum keywords4;
> analyze url_lists4;
> analyze keywords4;
>
> 6, Query
>
> EXPLAIN SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
> EXPLAIN ANALYZE SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
>
> SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
> 7, EXPLAIN
>
> QUERY PLAN
> -----------------------------------------------------------------------------
> Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57)
> Join Filter: (u.url ~~ k.url)
> -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000 width=57)
> -> Materialize (cost=0.00..129.50 rows=5000 width=28)
> -> Seq Scan on keywords4 k (cost=0.00..104.50 rows=5000 width=28)
> Filter: ((name)::text = 'esc_url'::text)
> (6 rows)
>
> 8, EXPLAIN ANALYZE
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57) (actual time=6011.642..478011.117 rows=4850 loops=1)
> Join Filter: (u.url ~~ k.url)
> Rows Removed by Join Filter: 2499995150
> -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000 width=57) (actual time=0.034..192.646 rows=500000 loops=1)
> -> Materialize (cost=0.00..129.50 rows=5000 width=28) (actual time=0.000..0.261 rows=5000 loops=500000)
> -> Seq Scan on keywords4 k (cost=0.00..104.50 rows=5000 width=28) (actual time=0.021..1.705 rows=5000 loops=1)
> Filter: ((name)::text = 'esc_url'::text)
> Planning time: 0.061 ms
> Execution time: 478011.773 ms
> (9 rows)
>
>
> 2015年12月28日(月) 3:39 Tom Lane <tgl@sss.pgh.pa.us>:
>>
>> Andreas Kretschmer <andreas@a-kretschmer.de> writes:
>> >> Tom Lane <tgl@sss.pgh.pa.us> hat am 27. Dezember 2015 um 19:11 geschrieben:
>> >> What in the world is this @~ operator? And what sort of index are
>> >> you using now, that can accept it? Are the rowcount estimates in
>> >> the EXPLAIN output accurate? (If they are, it's hardly surprising
>> >> that the query takes a long time.)
>>
>> > in a privat mail he called an other operator: ~~. I think, the @~ is an error.
>>
>> Well, ~~ isn't directly indexable by btree indexes either, so there's
>> still something wrong with either the EXPLAIN output or the claimed
>> index definitions.
>>
>> regards, tom lane
On 30 December 2015 at 13:56, Hiroyuki Sato <hiroysato@gmail.com> wrote:2015年12月30日(水) 6:04 David Rowley <david.rowley@2ndquadrant.com>:On 30 December 2015 at 04:21, Hiroyuki Sato <hiroysato@gmail.com> wrote:2015年12月29日(火) 4:35 Jeff Janes <jeff.janes@gmail.com>:But, the planner refuses to use this index for your query anyway,
because it can't see that the patterns are all left-anchored.
Really, your best bet is refactor your url data so it is stored with a
url_prefix and url_suffix column. Then you can do exact matching
rather than pattern matching.I see, exact matching faster than pattern matting.But I need pattern match in path part(ie, http://www.yahoo.com/a/b/c/... )I would like to pattern match '/a/b/c' part.If your pattern matching is as simple as that, then why not split the /a/b/c/ part out as mentioned by Jeff? Alternatively you could just write a function which splits that out for you and returns it, then index that function, and then just include a call to that function in the join condition matching with the equality operator. That'll allow hash and merge joins to be possible again.Could you tell me more detail about Alternatively part?It is good idea to split host and part.I'll try it.My matching pattern is the following1, http://www.yahoo.com/a/b/% (host equal, path like)2, http://%.yahoo.com/a/b/% (host and path like )It seems I misunderstood your pattern matching. The example you supplied earlier indicated you just needed to match the document part (/a/b/c/) and just ignore the protocol://host part, in which case you could have written a function which took a text parameter, say: "http://www.yahoo.com/a/b/c/", and returned "/a/b/c", then performed: create index on yourtable (thatfunction(yourcolumn)); However that method won't help you, as it seems your pattern matching is more complex than the previous example that you supplied.--
On Dec 28, 2015 00:55, "Hiroyuki Sato" <hiroysato@gmail.com> wrote:
>
> Hello Andreas and Tom
>
> Thank you for replying.
>
> Sorry, I re-created my questions. I was mis-pasted query log on previous question.
> (@~ operator is PGroonga extension (http://pgroonga.github.io))
> Please ignore it.
>
> Best regards.
>
> 1, Problem.
> (1) Following query is exteme slow. (478sec)
> SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
>
> (2) grep -f kwd.txt sample.txt (exec time under 1sec)
>These are not the same 'query'. Grep will match any of the patterns for each url while postgres finds all matching combinations. You need to use '... OR ... OR ...' or 'ANY()' in postgres to get the same result.
> 2, Questions
>
> (1) Is it possible to improve this query like the command ``grep -f keyword data``?I get the best results by using OR of all the different patterns:
SELECT url FROM url_lists4
WHERE
url LIKE 'http://ak.yahoo.co.jp/xwv/%'
OR url LIKE 'http://ao.yahoo.co.jp/wdl/%'
OR ...
In theory you could use:
... WHERE url LIKE ANY(ARRAY(SELECT url FROM keywords4 k WHERE k.name = 'esc_url'));
but that's very slow.
> (2) What kind of Index should I create on url_lists table?
Both btree text_pattern_ops and gin trigram on the URL.
>
> 3, Environment
> OS: CentOS7
> PostgreSQL 9.4
>
> 4, sample source
> https://github.com/hiroyuki-sato/postgres_like_test
>
>
> 5, Create table
>
> drop table if exists url_lists4;
> create table url_lists4 (
> id int not null primary key,
> url text not null
> );
> create index ix_url_url_lists4 on url_lists4(url);
>
> drop table if exists keywords4;
> create table keywords4 (
> id int not null primary key,
> name varchar(40) not null,
> url text not null
> );
>
> create index ix_url_keywords4 on keywords4(url);
> create index ix_name_keywords4 on keywords4(name);
>
>
> \copy url_lists4(id,url) from 'sample.txt' with delimiter ',';
> \copy keywords4(id,name,url) from 'keyword.txt' with delimiter ',';
>
> vacuum url_lists4;
> vacuum keywords4;
> analyze url_lists4;
> analyze keywords4;
>
> 6, Query
>
> EXPLAIN SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
> EXPLAIN ANALYZE SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
>
> SELECT
> u.url
> FROM
> url_lists4 u,
> keywords4 k
> WHERE
> u.url like k.url
> AND
> k.name = 'esc_url';
>
> 7, EXPLAIN
>
> QUERY PLAN
> -----------------------------------------------------------------------------
> Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57)
> Join Filter: (u.url ~~ k.url)
> -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000 width=57)
> -> Materialize (cost=0.00..129.50 rows=5000 width=28)
> -> Seq Scan on keywords4 k (cost=0.00..104.50 rows=5000 width=28)
> Filter: ((name)::text = 'esc_url'::text)
> (6 rows)
>
> 8, EXPLAIN ANALYZE
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------
> Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57) (actual time=6011.642..478011.117 rows=4850 loops=1)
> Join Filter: (u.url ~~ k.url)
> Rows Removed by Join Filter: 2499995150
> -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000 width=57) (actual time=0.034..192.646 rows=500000 loops=1)
> -> Materialize (cost=0.00..129.50 rows=5000 width=28) (actual time=0.000..0.261 rows=5000 loops=500000)
> -> Seq Scan on keywords4 k (cost=0.00..104.50 rows=5000 width=28) (actual time=0.021..1.705 rows=5000 loops=1)
> Filter: ((name)::text = 'esc_url'::text)
> Planning time: 0.061 ms
> Execution time: 478011.773 ms
> (9 rows)
>
>
> 2015年12月28日(月) 3:39 Tom Lane <tgl@sss.pgh.pa.us>:
>>
>> Andreas Kretschmer <andreas@a-kretschmer.de> writes:
>> >> Tom Lane <tgl@sss.pgh.pa.us> hat am 27. Dezember 2015 um 19:11 geschrieben:
>> >> What in the world is this @~ operator? And what sort of index are
>> >> you using now, that can accept it? Are the rowcount estimates in
>> >> the EXPLAIN output accurate? (If they are, it's hardly surprising
>> >> that the query takes a long time.)
>>
>> > in a privat mail he called an other operator: ~~. I think, the @~ is an error.
>>
>> Well, ~~ isn't directly indexable by btree indexes either, so there's
>> still something wrong with either the EXPLAIN output or the claimed
>> index definitions.
>>
>> regards, tom lane
On Jan 4, 2016 09:45, "Hiroyuki Sato" <hiroysato@gmail.com> wrote:
>
> Hello Arjen
>
> Thank you for replying.
>
> 2016年1月4日(月) 16:49 Arjen Nienhuis <a.g.nienhuis@gmail.com>:
>>
>>
>> On Dec 28, 2015 00:55, "Hiroyuki Sato" <hiroysato@gmail.com> wrote:
>> >
>> > Hello Andreas and Tom
>> >
>> > Thank you for replying.
>> >
>> > Sorry, I re-created my questions. I was mis-pasted query log on previous question.
>> > (@~ operator is PGroonga extension (http://pgroonga.github.io))
>> > Please ignore it.
>> >
>> > Best regards.
>> >
>> > 1, Problem.
>> > (1) Following query is exteme slow. (478sec)
>> > SELECT
>> > u.url
>> > FROM
>> > url_lists4 u,
>> > keywords4 k
>> > WHERE
>> > u.url like k.url
>> > AND
>> > k.name = 'esc_url';
>> >
>> >
>> > (2) grep -f kwd.txt sample.txt (exec time under 1sec)
>> >
>>
>> These are not the same 'query'. Grep will match any of the patterns for each url while postgres finds all matching combinations. You need to use '... OR ... OR ...' or 'ANY()' in postgres to get the same result.
>
> I thought the following query are same meaning.
> Could you point me same examples about two differences?
If one URL matches two patterns, grep will return the match only once. In your SQL query it will match multiple times and will be in the result multiple times.
You can test this by putting both 'http' and 'yahoo' in the wordlist.
>
> (1) u.url like k.url
> (2) u.url like 'k.url 1', or u.url like 'k.url2' ...
>
>> > 2, Questions
>> >
>> > (1) Is it possible to improve this query like the command ``grep -f keyword data``?
>>
>> I get the best results by using OR of all the different patterns:
>>
>> SELECT url FROM url_lists4
>> WHERE
>> url LIKE 'http://ak.yahoo.co.jp/xwv/%'
>> OR url LIKE 'http://ao.yahoo.co.jp/wdl/%'
>> OR ...
>
>
> I'll try it.
>
> BTW Do you know how many OR can I use ?
> I have 5000 URLs.
There is no real limit. I tried with your test file (5000 patterns). I got 6 seconds planning time and 12 seconds execution time.
(I also tried making one big regular expression with url1|url2|... but that did fail)
>
>> In theory you could use:
>>
>> ... WHERE url LIKE ANY(ARRAY(SELECT url FROM keywords4 k WHERE k.name = 'esc_url'));
>>
>> but that's very slow.
>
> It's very interesting. It seems same query.
> I'll check EXPLAIN
>
>
>>
>> > (2) What kind of Index should I create on url_lists table?
>>
>> Both btree text_pattern_ops and gin trigram on the URL.
>>
>> >
>> > 3, Environment
>> > OS: CentOS7
>> > PostgreSQL 9.4
>> >
>> > 4, sample source
>> > https://github.com/hiroyuki-sato/postgres_like_test
>> >
>> >
>> > 5, Create table
>> >
>> > drop table if exists url_lists4;
>> > create table url_lists4 (
>> > id int not null primary key,
>> > url text not null
>> > );
>> > create index ix_url_url_lists4 on url_lists4(url);
>> >
>> > drop table if exists keywords4;
>> > create table keywords4 (
>> > id int not null primary key,
>> > name varchar(40) not null,
>> > url text not null
>> > );
>> >
>> > create index ix_url_keywords4 on keywords4(url);
>> > create index ix_name_keywords4 on keywords4(name);
>> >
>> >
>> > \copy url_lists4(id,url) from 'sample.txt' with delimiter ',';
>> > \copy keywords4(id,name,url) from 'keyword.txt' with delimiter ',';
>> >
>> > vacuum url_lists4;
>> > vacuum keywords4;
>> > analyze url_lists4;
>> > analyze keywords4;
>> >
>> > 6, Query
>> >
>> > EXPLAIN SELECT
>> > u.url
>> > FROM
>> > url_lists4 u,
>> > keywords4 k
>> > WHERE
>> > u.url like k.url
>> > AND
>> > k.name = 'esc_url';
>> >
>> > EXPLAIN ANALYZE SELECT
>> > u.url
>> > FROM
>> > url_lists4 u,
>> > keywords4 k
>> > WHERE
>> > u.url like k.url
>> > AND
>> > k.name = 'esc_url';
>> >
>> >
>> > SELECT
>> > u.url
>> > FROM
>> > url_lists4 u,
>> > keywords4 k
>> > WHERE
>> > u.url like k.url
>> > AND
>> > k.name = 'esc_url';
>> >
>> > 7, EXPLAIN
>> >
>> > QUERY PLAN
>> > -----------------------------------------------------------------------------
>> > Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57)
>> > Join Filter: (u.url ~~ k.url)
>> > -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000 width=57)
>> > -> Materialize (cost=0.00..129.50 rows=5000 width=28)
>> > -> Seq Scan on keywords4 k (cost=0.00..104.50 rows=5000 width=28)
>> > Filter: ((name)::text = 'esc_url'::text)
>> > (6 rows)
>> >
>> > 8, EXPLAIN ANALYZE
>> > QUERY PLAN
>> > -----------------------------------------------------------------------------------------------------------------------------
>> > Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57) (actual time=6011.642..478011.117 rows=4850 loops=1)
>> > Join Filter: (u.url ~~ k.url)
>> > Rows Removed by Join Filter: 2499995150
>> > -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000 width=57) (actual time=0.034..192.646 rows=500000 loops=1)
>> > -> Materialize (cost=0.00..129.50 rows=5000 width=28) (actual time=0.000..0.261 rows=5000 loops=500000)
>> > -> Seq Scan on keywords4 k (cost=0.00..104.50 rows=5000 width=28) (actual time=0.021..1.705 rows=5000 loops=1)
>> > Filter: ((name)::text = 'esc_url'::text)
>> > Planning time: 0.061 ms
>> > Execution time: 478011.773 ms
>> > (9 rows)
>> >
>> >
>> > 2015年12月28日(月) 3:39 Tom Lane <tgl@sss.pgh.pa.us>:
>> >>
>> >> Andreas Kretschmer <andreas@a-kretschmer.de> writes:
>> >> >> Tom Lane <tgl@sss.pgh.pa.us> hat am 27. Dezember 2015 um 19:11 geschrieben:
>> >> >> What in the world is this @~ operator? And what sort of index are
>> >> >> you using now, that can accept it? Are the rowcount estimates in
>> >> >> the EXPLAIN output accurate? (If they are, it's hardly surprising
>> >> >> that the query takes a long time.)
>> >>
>> >> > in a privat mail he called an other operator: ~~. I think, the @~ is an error.
>> >>
>> >> Well, ~~ isn't directly indexable by btree indexes either, so there's
>> >> still something wrong with either the EXPLAIN output or the claimed
>> >> index definitions.
>> >>
>> >> regards, tom lane
>
>
> Best regards.
>
On Jan 4, 2016 09:45, "Hiroyuki Sato" <hiroysato@gmail.com> wrote:
>
> Hello Arjen
>
> Thank you for replying.
>
> 2016年1月4日(月) 16:49 Arjen Nienhuis <a.g.nienhuis@gmail.com>:
>>
>>
>> On Dec 28, 2015 00:55, "Hiroyuki Sato" <hiroysato@gmail.com> wrote:
>> >
>> > Hello Andreas and Tom
>> >
>> > Thank you for replying.
>> >
>> > Sorry, I re-created my questions. I was mis-pasted query log on previous question.
>> > (@~ operator is PGroonga extension (http://pgroonga.github.io))
>> > Please ignore it.
>> >
>> > Best regards.
>> >
>> > 1, Problem.
>> > (1) Following query is exteme slow. (478sec)
>> > SELECT
>> > u.url
>> > FROM
>> > url_lists4 u,
>> > keywords4 k
>> > WHERE
>> > u.url like k.url
>> > AND
>> > k.name = 'esc_url';
>> >
>> >
>> > (2) grep -f kwd.txt sample.txt (exec time under 1sec)
>> >
>>
>> These are not the same 'query'. Grep will match any of the patterns for each url while postgres finds all matching combinations. You need to use '... OR ... OR ...' or 'ANY()' in postgres to get the same result.
>
> I thought the following query are same meaning.
> Could you point me same examples about two differences?If one URL matches two patterns, grep will return the match only once. In your SQL query it will match multiple times and will be in the result multiple times.
You can test this by putting both 'http' and 'yahoo' in the wordlist.
>
> (1) u.url like k.url
> (2) u.url like 'k.url 1', or u.url like 'k.url2' ...
>
>> > 2, Questions
>> >
>> > (1) Is it possible to improve this query like the command ``grep -f keyword data``?
>>
>> I get the best results by using OR of all the different patterns:
>>
>> SELECT url FROM url_lists4
>> WHERE
>> url LIKE 'http://ak.yahoo.co.jp/xwv/%'
>> OR url LIKE 'http://ao.yahoo.co.jp/wdl/%'
>> OR ...
>
>
> I'll try it.
>
> BTW Do you know how many OR can I use ?
> I have 5000 URLs.There is no real limit. I tried with your test file (5000 patterns). I got 6 seconds planning time and 12 seconds execution time.
(I also tried making one big regular expression with url1|url2|... but that did fail)
>
>> In theory you could use:
>>
>> ... WHERE url LIKE ANY(ARRAY(SELECT url FROM keywords4 k WHERE k.name = 'esc_url'));
>>
>> but that's very slow.
>
> It's very interesting. It seems same query.
> I'll check EXPLAIN
>
>
>>
>> > (2) What kind of Index should I create on url_lists table?
>>
>> Both btree text_pattern_ops and gin trigram on the URL.
>>
>> >
>> > 3, Environment
>> > OS: CentOS7
>> > PostgreSQL 9.4
>> >
>> > 4, sample source
>> > https://github.com/hiroyuki-sato/postgres_like_test
>> >
>> >
>> > 5, Create table
>> >
>> > drop table if exists url_lists4;
>> > create table url_lists4 (
>> > id int not null primary key,
>> > url text not null
>> > );
>> > create index ix_url_url_lists4 on url_lists4(url);
>> >
>> > drop table if exists keywords4;
>> > create table keywords4 (
>> > id int not null primary key,
>> > name varchar(40) not null,
>> > url text not null
>> > );
>> >
>> > create index ix_url_keywords4 on keywords4(url);
>> > create index ix_name_keywords4 on keywords4(name);
>> >
>> >
>> > \copy url_lists4(id,url) from 'sample.txt' with delimiter ',';
>> > \copy keywords4(id,name,url) from 'keyword.txt' with delimiter ',';
>> >
>> > vacuum url_lists4;
>> > vacuum keywords4;
>> > analyze url_lists4;
>> > analyze keywords4;
>> >
>> > 6, Query
>> >
>> > EXPLAIN SELECT
>> > u.url
>> > FROM
>> > url_lists4 u,
>> > keywords4 k
>> > WHERE
>> > u.url like k.url
>> > AND
>> > k.name = 'esc_url';
>> >
>> > EXPLAIN ANALYZE SELECT
>> > u.url
>> > FROM
>> > url_lists4 u,
>> > keywords4 k
>> > WHERE
>> > u.url like k.url
>> > AND
>> > k.name = 'esc_url';
>> >
>> >
>> > SELECT
>> > u.url
>> > FROM
>> > url_lists4 u,
>> > keywords4 k
>> > WHERE
>> > u.url like k.url
>> > AND
>> > k.name = 'esc_url';
>> >
>> > 7, EXPLAIN
>> >
>> > QUERY PLAN
>> > -----------------------------------------------------------------------------
>> > Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57)
>> > Join Filter: (u.url ~~ k.url)
>> > -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000 width=57)
>> > -> Materialize (cost=0.00..129.50 rows=5000 width=28)
>> > -> Seq Scan on keywords4 k (cost=0.00..104.50 rows=5000 width=28)
>> > Filter: ((name)::text = 'esc_url'::text)
>> > (6 rows)
>> >
>> > 8, EXPLAIN ANALYZE
>> > QUERY PLAN
>> > -----------------------------------------------------------------------------------------------------------------------------
>> > Nested Loop (cost=0.00..37510799.00 rows=12500000 width=57) (actual time=6011.642..478011.117 rows=4850 loops=1)
>> > Join Filter: (u.url ~~ k.url)
>> > Rows Removed by Join Filter: 2499995150
>> > -> Seq Scan on url_lists4 u (cost=0.00..10682.00 rows=500000 width=57) (actual time=0.034..192.646 rows=500000 loops=1)
>> > -> Materialize (cost=0.00..129.50 rows=5000 width=28) (actual time=0.000..0.261 rows=5000 loops=500000)
>> > -> Seq Scan on keywords4 k (cost=0.00..104.50 rows=5000 width=28) (actual time=0.021..1.705 rows=5000 loops=1)
>> > Filter: ((name)::text = 'esc_url'::text)
>> > Planning time: 0.061 ms
>> > Execution time: 478011.773 ms
>> > (9 rows)
>> >
>> >
>> > 2015年12月28日(月) 3:39 Tom Lane <tgl@sss.pgh.pa.us>:
>> >>
>> >> Andreas Kretschmer <andreas@a-kretschmer.de> writes:
>> >> >> Tom Lane <tgl@sss.pgh.pa.us> hat am 27. Dezember 2015 um 19:11 geschrieben:
>> >> >> What in the world is this @~ operator? And what sort of index are
>> >> >> you using now, that can accept it? Are the rowcount estimates in
>> >> >> the EXPLAIN output accurate? (If they are, it's hardly surprising
>> >> >> that the query takes a long time.)
>> >>
>> >> > in a privat mail he called an other operator: ~~. I think, the @~ is an error.
>> >>
>> >> Well, ~~ isn't directly indexable by btree indexes either, so there's
>> >> still something wrong with either the EXPLAIN output or the claimed
>> >> index definitions.
>> >>
>> >> regards, tom lane
>
>
> Best regards.
>