How to use full-text search URL parser to filter query results bydomain name? - Mailing list pgsql-general

From Jess Wren
Subject How to use full-text search URL parser to filter query results bydomain name?
Date
Msg-id 2a782f3e-7473-43ce-aad5-1de31cf2b31a@interference.cc
Whole thread Raw
Responses Re: How to use full-text search URL parser to filter query results by domain name?
Re: How to use full-text search URL parser to filter query results bydomain name?
List pgsql-general

I am trying to understand how to use the full-text search parser for URLS and hostnames to filter results from a text field containing URLS based on domain, and also how to index text columns for fast lookup/matching based on domain.

I have a PostgreSQL database containing documents and links downloaded by a web crawler, with the following tables:

        pages
        ----------
        id:          Integer (primary key)
        url:         String  (unique)
        title:       String
        text:        String
        html:        String
        last_visit:  DateTime
        word_pos:    TSVECTOR
        
        links
        ----------
        id         Integer (primary key)
        source:    String
        target:    String  
        link_text: String
        UNIQUE(source,target)
        
        crawls
        ---------
        id:         Integer (primary key)
        query:      String
        
        crawl_results
        -------------
        id:       Integer (primary key)
        score:    Integer (constraint 0<=score<=1)
        crawl_id: Integer (foreign key, crawls.id)
        page_id:  Integer (foreign key, pages.id)


The `source` and `target` fields in the `links` table contain URLs. I am running the following query to extract scored links from the top-ranking search results, for pages that haven't been fetched yet:

        WITH top_results AS 
            (SELECT page_id, score FROM crawl_results 
            WHERE crawl_id=$1 
            ORDER BY score LIMIT 100)
        SELECT top_results.score, l.target
        FROM top_results 
            JOIN pages p ON top_results.page_id=p.id
            JOIN links l on p.url=l.source 
        WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)


However, I would like to filter these results so that only one row is returned for a given domain (the one with the lowest score). So for instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8, 'http://www.foo.com/zor'), I only want the first because it has same domain `foo.com` and has the lower score.

I was able to find documentation for the builtin full text search parsers, which can parse URLS and extract the hostname. For instance, I can extract the hostname from a URL as follows:

        SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid = 6;
        
            token    
        -------------
        www.foo.com
        (1 row)



However, I can't figure out how I would integrate this into the above query to filter out duplicate domains from the results. And because this is the docs for "testing and debugging text search", I don't know if this use of `ts_parse()` is even related to how the URL parser is intended to be used in practice.

How would I use the "host" parser in my query above to return one row per domain? Also, how would I appropriately index the "links" table for "host" and "url" token lookup?

Thanks!

pgsql-general by date:

Previous
From: senor
Date:
Subject: Re: pg_upgrade --jobs
Next
From: hamann.w@t-online.de
Date:
Subject: Re: How to use full-text search URL parser to filter query results by domain name?