Text search parser's treatment of URLs and emails - Mailing list pgsql-general

From Thom Brown
Subject Text search parser's treatment of URLs and emails
Date
Msg-id AANLkTikf=K=pen6M4bWKkt1QOzh8mbrEXKOYJ=H0qCMh@mail.gmail.com
Whole thread Raw
Responses Re: Text search parser's treatment of URLs and emails  (Thom Brown <thom@linux.com>)
Re: Text search parser's treatment of URLs and emails  (Bruce Momjian <bruce@momjian.us>)
Re: Text search parser's treatment of URLs and emails  (Bruce Momjian <bruce@momjian.us>)
List pgsql-general
Hi,

I noticed that if I run this:

SELECT alias, description, token FROM
ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary');

I get:

  alias   |  description  |                              token
----------+---------------+-----------------------------------------------------------------
 protocol | Protocol head | http://
 url      | URL           |
www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
 host     | Host          | www.postgresql.org:2345
 url_path | URL path      |
/directory/page.html?version=9.1&build=alpha1#summary
(4 rows)


It could be me being picky, but I don't regard parameters or page
fragments as part of the URL path.  Ideally, I'd sort of expect:

    alias     |  description  |                              token
--------------+---------------+-----------------------------------------------------------------
 protocol     | Protocol head | http://
 url          | URL           |
www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
 host         | Host          | www.postgresql.org
 port         | Port          | 2345
 url_path     | URL path      | /directory/page.html
 query_string | Query string  | version=9.1&build=alpha1
 fragment     | Page fragment | summary
(7 rows)

... of course that's if there was support for query strings and page
fragments, which there isn't.  But if changes were made to support my
definition of a URL path, they'd have to be considered breaking
changes.

But my main gripe is with the name "url_path".

Also:

SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com');

Yields:

   alias   |   description   |       token
-----------+-----------------+--------------------
 asciiword | Word, all ASCII | myname
 blank     | Space symbols   | +
 email     | Email address   | priority@gmail.com
(3 rows)

The entire string I entered is a valid email address, and isn't
totally uncommon.  Shouldn't that take such email address styles be
taken into account?  The example above incorrectly identifies the
email address since the real destination address would most likely be
myname@gmail.com.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

pgsql-general by date:

Previous
From: John R Pierce
Date:
Subject: Re: error while autovacuuming
Next
From: Tom Lane
Date:
Subject: Re: Memory Errors