Home > mailing lists

How to use full-text search URL parser to filter query results bydomain name? - Mailing list pgsql-general

From	Jess Wren
Subject	How to use full-text search URL parser to filter query results bydomain name?
Date	April 7, 2019 07:06:26
Msg-id	2a782f3e-7473-43ce-aad5-1de31cf2b31a@interference.cc Whole thread Raw
Responses	Re: How to use full-text search URL parser to filter query results by domain name? Re: How to use full-text search URL parser to filter query results bydomain name?
List	pgsql-general

Tree view

I am trying to understand how to use the full-text search parser for URLS and hostnames to filter results from a text field containing URLS based on domain, and also how to index text columns for fast lookup/matching based on domain.

I have a PostgreSQL database containing documents and links downloaded by a web crawler, with the following tables:

        pages

        ----------

        id:          Integer (primary key)

        url:         String  (unique)

        title:       String

        text:        String

        html:        String

        last_visit:  DateTime

        word_pos:    TSVECTOR

        links

        ----------

        id         Integer (primary key)

        source:    String

        target:    String

        link_text: String

        UNIQUE(source,target)

        crawls

        ---------

        id:         Integer (primary key)

        query:      String

        crawl_results

        -------------

        id:       Integer (primary key)

        score:    Integer (constraint 0<=score<=1)

        crawl_id: Integer (foreign key, crawls.id)

        page_id:  Integer (foreign key, pages.id)

The `source` and `target` fields in the `links` table contain URLs. I am running the following query to extract scored links from the top-ranking search results, for pages that haven't been fetched yet:

        WITH top_results AS

            (SELECT page_id, score FROM crawl_results

            WHERE crawl_id=$1

            ORDER BY score LIMIT 100)

        SELECT top_results.score, l.target

        FROM top_results

            JOIN pages p ON top_results.page_id=p.id

            JOIN links l on p.url=l.source

        WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)

However, I would like to filter these results so that only one row is returned for a given domain (the one with the lowest score). So for instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8, 'http://www.foo.com/zor'), I only want the first because it has same domain `foo.com` and has the lower score.

I was able to find documentation for the builtin full text search parsers, which can parse URLS and extract the hostname. For instance, I can extract the hostname from a URL as follows:

        SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid = 6;

            token

        -------------

        www.foo.com

        (1 row)

However, I can't figure out how I would integrate this into the above query to filter out duplicate domains from the results. And because this is the docs for "testing and debugging text search", I don't know if this use of `ts_parse()` is even related to how the URL parser is intended to be used in practice.

How would I use the "host" parser in my query above to return one row per domain? Also, how would I appropriately index the "links" table for "host" and "url" token lookup?

Thanks!

pgsql-general by date:

From: senor
Date: 07 April 2019, 03:47:59
Subject: Re: pg_upgrade --jobs

From: hamann.w@t-online.de
Date: 07 April 2019, 09:42:06
Subject: Re: How to use full-text search URL parser to filter query results by domain name?

How to use full-text search URL parser to filter query results bydomain name? - Mailing list pgsql-general

Previous

Next