Re: How to use full-text search URL parser to filter query results bydomain name? - Mailing list pgsql-general
From | Jess Wren |
---|---|
Subject | Re: How to use full-text search URL parser to filter query results bydomain name? |
Date | |
Msg-id | 9c7e4b2e-6b0b-70b4-f96e-784a1803a5eb@interference.cc Whole thread Raw |
In response to | Re: How to use full-text search URL parser to filter query results bydomain name? (Arthur Zakirov <a.zakirov@postgrespro.ru>) |
Responses |
Re: How to use full-text search URL parser to filter query results bydomain name?
|
List | pgsql-general |
On 4/8/19 4:50 AM, Arthur Zakirov wrote:
I think it is normal to use ts_parse(). And I suppose you might use windows functions.
For example, you have table links:
=# create table links (score int, link text);
=# insert into links values
(1, 'http://www.foo.com/bar'),
(2, 'http://www.foo.com/foo'),
(2, 'http://www.bar.com/foo'),
(1, 'http://www.bar.com/bar');
You can use the following query:
=# with l as (
select score, token, link,
rank() over (partition by token order by score) as rank
from links,
lateral ts_parse('default', link)
where tokid = 6)
select score, token, link from l where rank = 1;
score | token | link
-------+-------------+------------------------
1 | www.bar.com | http://www.bar.com/bar
1 | www.foo.com | http://www.foo.com/bar
Thank you very much Arthur. Your suggestion led me to a query that is at least returning correct result set. I could not figure out how to get your rank() function to work with my query, but building on your answer (and others from IRC etc), I ended up with the following solution:
First I created the following views:
CREATE VIEW scored_pages AS ( SELECT crawl_results.crawl_id, crawl_results.score, crawl_results.page_id, pages.url FROM crawl_results JOIN pages ON crawl_results.page_id = pages.id );
CREATE VIEW scored_links AS ( SELECT scored_pages.score, links.source, links.target, links.link_text FROM links JOIN scored_pages ON scored_pages.url = links.source );
Then, using these views, I did the following query to extract the links from the lowest scored pages in the results:
SELECT score, host, target
FROM ( SELECT DISTINCT ON (token) token AS host, score, target FROM scored_links, LATERAL ts_parse('default', target) WHERE tokid = 6 ORDER BY token, score
) as x
WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE target=pp.url)
ORDER BY score;
Does this seem like a reasonable approach? When running EXPLAIN on this query, I get the following:
QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Sort (cost=1252927.46..1252927.47 rows=1 width=100) Sort Key: crawl_results.score -> Hash Anti Join (cost=1248297.18..1252927.45 rows=1 width=100) Hash Cond: ((links.target)::text = (pp.url)::text) -> Unique (cost=1247961.08..1252591.28 rows=5 width=100) -> Sort (cost=1247961.08..1250276.18 rows=926040 width=100) Sort Key: ts_parse.token, crawl_results.score -> Gather (cost=1449.79..1054897.20 rows=926040 width=100) Workers Planned: 2 -> Hash Join (cost=449.79..961293.20 rows=385850 width=100) Hash Cond: ((links.source)::text = (pages.url)::text) -> Nested Loop (cost=0.00..955091.41 rows=378702 width=144) -> Parallel Seq Scan on links (cost=0.00..4554.40 rows=75740 width=112) -> Function Scan on ts_parse (cost=0.00..12.50 rows=5 width=32) Filter: (tokid = 6) -> Hash (cost=404.67..404.67 rows=3609 width=63) -> Hash Join (cost=336.10..404.67 rows=3609 width=63) Hash Cond: (crawl_results.page_id = pages.id) -> Seq Scan on crawl_results (cost=0.00..59.09 rows=3609 width=12) -> Hash (cost=291.60..291.60 rows=3560 width=59) -> Seq Scan on pages (cost=0.00..291.60 rows=3560 width=59) -> Hash (cost=291.60..291.60 rows=3560 width=55) -> Seq Scan on pages pp (cost=0.00..291.60 rows=3560 width=55) (23 rows)
I am wondering if there is a more efficient way to do things? Some people on IRC mentioned that it might be better to declare a scalar function to return the host from ts_parse instead of the LATERAL query ... but I couldn't figure out how to do that, or if it was even preferable to the above from a performance standpoint ... any ideas on how I could improve the above.
pgsql-general by date: