Thread: How to use full-text search URL parser to filter query results bydomain name?

I am trying to understand how to use the full-text search parser for URLS and hostnames to filter results from a text field containing URLS based on domain, and also how to index text columns for fast lookup/matching based on domain.

I have a PostgreSQL database containing documents and links downloaded by a web crawler, with the following tables:

        pages
        ----------
        id:          Integer (primary key)
        url:         String  (unique)
        title:       String
        text:        String
        html:        String
        last_visit:  DateTime
        word_pos:    TSVECTOR
        
        links
        ----------
        id         Integer (primary key)
        source:    String
        target:    String  
        link_text: String
        UNIQUE(source,target)
        
        crawls
        ---------
        id:         Integer (primary key)
        query:      String
        
        crawl_results
        -------------
        id:       Integer (primary key)
        score:    Integer (constraint 0<=score<=1)
        crawl_id: Integer (foreign key, crawls.id)
        page_id:  Integer (foreign key, pages.id)


The `source` and `target` fields in the `links` table contain URLs. I am running the following query to extract scored links from the top-ranking search results, for pages that haven't been fetched yet:

        WITH top_results AS 
            (SELECT page_id, score FROM crawl_results 
            WHERE crawl_id=$1 
            ORDER BY score LIMIT 100)
        SELECT top_results.score, l.target
        FROM top_results 
            JOIN pages p ON top_results.page_id=p.id
            JOIN links l on p.url=l.source 
        WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)


However, I would like to filter these results so that only one row is returned for a given domain (the one with the lowest score). So for instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8, 'http://www.foo.com/zor'), I only want the first because it has same domain `foo.com` and has the lower score.

I was able to find documentation for the builtin full text search parsers, which can parse URLS and extract the hostname. For instance, I can extract the hostname from a URL as follows:

        SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid = 6;
        
            token    
        -------------
        www.foo.com
        (1 row)



However, I can't figure out how I would integrate this into the above query to filter out duplicate domains from the results. And because this is the docs for "testing and debugging text search", I don't know if this use of `ts_parse()` is even related to how the URL parser is intended to be used in practice.

How would I use the "host" parser in my query above to return one row per domain? Also, how would I appropriately index the "links" table for "host" and "url" token lookup?

Thanks!

Re: How to use full-text search URL parser to filter query results by domain name?

From
hamann.w@t-online.de
Date:
>> I am trying to understand how to use the full-text search parser for
>> URLS and hostnames to filter results from a text field containing URLS
>> based on domain, and also how to index text columns for fast
>> lookup/matching based on domain.
>> >> I have a PostgreSQL database containing documents and links downloaded
>> by a web crawler, with the following tables:
>> >>         pages
>> >>         ----------
>> >>         id:          Integer (primary key)
>> >>         url:         String  (unique)
>> >>         title:       String
>> >>         text:        String
>> >>         html:        String
>> >>         last_visit:  DateTime
>> >>         word_pos:    TSVECTOR
>> >>         >> >>         links
>> >>         ----------
>> >>         id         Integer (primary key)
>> >>         source:    String
>> >>         target:    String  >> >>         link_text: String
>> >>         UNIQUE(source,target)
>> >>         >> >>         crawls
>> >>         ---------
>> >>         id:         Integer (primary key)
>> >>         query:      String
>> >>         >> >>         crawl_results
>> >>         -------------
>> >>         id:       Integer (primary key)
>> >>         score:    Integer (constraint 0<=score<=1)
>> >>         crawl_id: Integer (foreign key, crawls.id)
>> >>         page_id:  Integer (foreign key, pages.id)
>> >> >> The `source` and `target` fields in the `links` table contain URLs. I am
>> running the following query to extract scored links from the top-ranking
>> search results, for pages that haven't been fetched yet:
>> >>         WITH top_results AS >> >>             (SELECT page_id, score FROM crawl_results >> >>             WHERE
crawl_id=$1>> >>             ORDER BY score LIMIT 100) 
>> >>         SELECT top_results.score, l.target
>> >>         FROM top_results >> >>             JOIN pages p ON top_results.page_id=p.id
>> >>             JOIN links l on p.url=l.source >> >>         WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE
l.target=pp.url)
>> >> >> However, *I would like to filter these results so that only one row is
>> returned for a given domain (the one with the lowest score)*. So for
>> instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
>> 'http://www.foo.com/zor'), I only want the first because it has same
>> domain `foo.com` and has the lower score.
>> >> I was able to find documentation for the builtin full text search
>> parsers <https://www.postgresql.org/docs/11/textsearch-parsers.html>,
>> which can parse URLS and extract the hostname. For instance, I can
>> extract the hostname from a URL as follows:
>>
Hi,

I have no real idea about solving the complete problem, and would probably try
something with a temp table first.
For extracting the hostname from a url you could use

select regex_replace('https?://(.*=)/.*', '\\1', url)

instead of the fulltext parser

Best regards
Wolfgang







On 4/6/19 11:42 PM, hamann.w@t-online.de wrote:
> Hi,
>
> I have no real idea about solving the complete problem, and would probably try
> something with a temp table first.
> For extracting the hostname from a url you could use
>
> select regex_replace('https?://(.*=)/.*', '\\1', url)
>
> instead of the fulltext parser
>
> Best regards
> Wolfgang

Thanks Wolfgang, I understand that I could implement a function using
regex for this, or just create an extra column/table to store the
hostname data. But there are other parts of the application where I'll
need to extract URL path, others where i'll want to extract scheme, etc.
Since postgres has builtin capabilities for parsing URLs to do alll of
this, I'd rather just use the builtin functions instead of writing them
myself using regex or having to generate a temp table each time I do a
lookup.

So although I'm aware that there are a variety of ways to extract
hostname (right now I'm just doing it in Python), I'm really most
interested in understanding how to use the builtin Postgres URL parsers
to extract host, url path, etc and how to appropriately create indexes
based on them. The documentation for the URL parser is very sparse, and
I can't find much info online either.




On 07.04.2019 07:06, Jess Wren wrote:
> However, I can't figure out how I would integrate this into the above 
> query to filter out duplicate domains from the results. And because this 
> is the docs for "testing and debugging text search 
> <https://www.postgresql.org/docs/11/textsearch-debugging.html#TEXTSEARCH-PARSER-TESTING>", 
> I don't know if this use of `ts_parse()` is even related to how the URL 
> parser is intended to be used in practice.
> 
> How would I use the "host" parser in my query above to return one row 
> per domain? Also, how would I appropriately index the "links" table for 
> "host" and "url" token lookup?

I think it is normal to use ts_parse(). And I suppose you might use 
windows functions.

For example, you have table links:

=# create table links (score int, link text);
=# insert into links values
   (1, 'http://www.foo.com/bar'),
   (2, 'http://www.foo.com/foo'),
   (2, 'http://www.bar.com/foo'),
   (1, 'http://www.bar.com/bar');

You can use the following query:

=# with l as (
   select score, token, link,
     rank() over (partition by token order by score) as rank
   from links,
     lateral ts_parse('default', link)
   where tokid = 6)
select score, token, link from l where rank = 1;
  score |    token    |          link
-------+-------------+------------------------
      1 | www.bar.com | http://www.bar.com/bar
      1 | www.foo.com | http://www.foo.com/bar

It is just the idea, probably the query might be simpler.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



On 4/8/19 4:50 AM, Arthur Zakirov wrote:
I think it is normal to use ts_parse(). And I suppose you might use windows functions.

For example, you have table links:

=# create table links (score int, link text);
=# insert into links values
  (1, 'http://www.foo.com/bar'),
  (2, 'http://www.foo.com/foo'),
  (2, 'http://www.bar.com/foo'),
  (1, 'http://www.bar.com/bar');

You can use the following query:

=# with l as (
  select score, token, link,
    rank() over (partition by token order by score) as rank
  from links,
    lateral ts_parse('default', link)
  where tokid = 6)
select score, token, link from l where rank = 1;
 score |    token    |          link
-------+-------------+------------------------
     1 | www.bar.com | http://www.bar.com/bar
     1 | www.foo.com | http://www.foo.com/bar


Thank you very much Arthur. Your suggestion led me to a query that is at least returning correct result set. I could not figure out how to get your rank() function to work with my query, but building on your answer (and others from IRC etc), I ended up with the following solution:

First I created the following views:
        CREATE VIEW scored_pages AS (           SELECT crawl_results.crawl_id, crawl_results.score,                crawl_results.page_id, pages.url           FROM crawl_results            JOIN pages ON crawl_results.page_id = pages.id        );

       CREATE VIEW scored_links AS (           SELECT scored_pages.score, links.source,                links.target, links.link_text           FROM links           JOIN scored_pages ON scored_pages.url = links.source       );

Then, using these views, I did the following query to extract the links from the lowest scored pages in the results:

SELECT score, host, target 
FROM (       SELECT DISTINCT ON (token) token AS host, score, target        FROM scored_links,       LATERAL ts_parse('default', target) WHERE tokid = 6       ORDER BY token, score
) as x
WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE target=pp.url)
ORDER BY score;

Does this seem like a reasonable approach? When running EXPLAIN on this query, I get the following:

                                                    QUERY PLAN                                                    
------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1252927.46..1252927.47 rows=1 width=100)
   Sort Key: crawl_results.score
   ->  Hash Anti Join  (cost=1248297.18..1252927.45 rows=1 width=100)
         Hash Cond: ((links.target)::text = (pp.url)::text)
         ->  Unique  (cost=1247961.08..1252591.28 rows=5 width=100)
               ->  Sort  (cost=1247961.08..1250276.18 rows=926040 width=100)
                     Sort Key: ts_parse.token, crawl_results.score
                     ->  Gather  (cost=1449.79..1054897.20 rows=926040 width=100)
                           Workers Planned: 2
                           ->  Hash Join  (cost=449.79..961293.20 rows=385850 width=100)
                                 Hash Cond: ((links.source)::text = (pages.url)::text)
                                 ->  Nested Loop  (cost=0.00..955091.41 rows=378702 width=144)
                                       ->  Parallel Seq Scan on links  (cost=0.00..4554.40 rows=75740 width=112)
                                       ->  Function Scan on ts_parse  (cost=0.00..12.50 rows=5 width=32)
                                             Filter: (tokid = 6)
                                 ->  Hash  (cost=404.67..404.67 rows=3609 width=63)
                                       ->  Hash Join  (cost=336.10..404.67 rows=3609 width=63)
                                             Hash Cond: (crawl_results.page_id = pages.id)
                                             ->  Seq Scan on crawl_results  (cost=0.00..59.09 rows=3609 width=12)
                                             ->  Hash  (cost=291.60..291.60 rows=3560 width=59)
                                                   ->  Seq Scan on pages  (cost=0.00..291.60 rows=3560 width=59)
         ->  Hash  (cost=291.60..291.60 rows=3560 width=55)
               ->  Seq Scan on pages pp  (cost=0.00..291.60 rows=3560 width=55)
(23 rows)


I am wondering if there is a more efficient way to do things? Some people on IRC mentioned that it might be better to declare a scalar function to return the host from ts_parse instead of the LATERAL query ... but I couldn't figure out how to do that, or if it was even preferable to the above from a performance standpoint ... any ideas on how I could improve the above.


Re: How to use full-text search URL parser to filter query results bydomain name?

From
Michel Pelletier
Date:
On Wed, Apr 10, 2019 at 1:58 AM Jess Wren <jess.wren@interference.cc> wrote:
                                       ->  Parallel Seq Scan on links  (cost=0.00..4554.40 rows=75740 width=112)
                                       ->  Function Scan on ts_parse  (cost=0.00..12.50 rows=5 width=32)
                                             Filter: (tokid = 6)
(23 rows)


I am wondering if there is a more efficient way to do things? Some people on IRC mentioned that it might be better to declare a scalar function to return the host from ts_parse instead of the LATERAL query ... but I couldn't figure out how to do that, or if it was even preferable to the above from a performance standpoint ... any ideas on how I could improve the above.


May try indexing the parsed expression to avoid the seq scan on links, something like:

create index on links (ts_parse('default', target));
 
and then run the explain (or explain analyze) to see if that improves things.  Certainly as the links table gets bigger this should help.