Thread: Text search parser's treatment of URLs and emails

Text search parser's treatment of URLs and emails

From
Thom Brown
Date:
Hi,

I noticed that if I run this:

SELECT alias, description, token FROM
ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary');

I get:

  alias   |  description  |                              token
----------+---------------+-----------------------------------------------------------------
 protocol | Protocol head | http://
 url      | URL           |
www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
 host     | Host          | www.postgresql.org:2345
 url_path | URL path      |
/directory/page.html?version=9.1&build=alpha1#summary
(4 rows)


It could be me being picky, but I don't regard parameters or page
fragments as part of the URL path.  Ideally, I'd sort of expect:

    alias     |  description  |                              token
--------------+---------------+-----------------------------------------------------------------
 protocol     | Protocol head | http://
 url          | URL           |
www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
 host         | Host          | www.postgresql.org
 port         | Port          | 2345
 url_path     | URL path      | /directory/page.html
 query_string | Query string  | version=9.1&build=alpha1
 fragment     | Page fragment | summary
(7 rows)

... of course that's if there was support for query strings and page
fragments, which there isn't.  But if changes were made to support my
definition of a URL path, they'd have to be considered breaking
changes.

But my main gripe is with the name "url_path".

Also:

SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com');

Yields:

   alias   |   description   |       token
-----------+-----------------+--------------------
 asciiword | Word, all ASCII | myname
 blank     | Space symbols   | +
 email     | Email address   | priority@gmail.com
(3 rows)

The entire string I entered is a valid email address, and isn't
totally uncommon.  Shouldn't that take such email address styles be
taken into account?  The example above incorrectly identifies the
email address since the real destination address would most likely be
myname@gmail.com.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

Re: Text search parser's treatment of URLs and emails

From
Thom Brown
Date:
On 8 September 2010 21:48, Thom Brown <thom@linux.com> wrote:
> Hi,
>
> I noticed that if I run this:
>
> SELECT alias, description, token FROM
> ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary');
>
> I get:
>
>  alias   |  description  |                              token
> ----------+---------------+-----------------------------------------------------------------
>  protocol | Protocol head | http://
>  url      | URL           |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
>  host     | Host          | www.postgresql.org:2345
>  url_path | URL path      |
> /directory/page.html?version=9.1&build=alpha1#summary
> (4 rows)
>
>
> It could be me being picky, but I don't regard parameters or page
> fragments as part of the URL path.  Ideally, I'd sort of expect:
>
>    alias     |  description  |                              token
> --------------+---------------+-----------------------------------------------------------------
>  protocol     | Protocol head | http://
>  url          | URL           |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
>  host         | Host          | www.postgresql.org
>  port         | Port          | 2345
>  url_path     | URL path      | /directory/page.html
>  query_string | Query string  | version=9.1&build=alpha1
>  fragment     | Page fragment | summary
> (7 rows)
>
> ... of course that's if there was support for query strings and page
> fragments, which there isn't.  But if changes were made to support my
> definition of a URL path, they'd have to be considered breaking
> changes.
>
> But my main gripe is with the name "url_path".
>
> Also:
>
> SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com');
>
> Yields:
>
>   alias   |   description   |       token
> -----------+-----------------+--------------------
>  asciiword | Word, all ASCII | myname
>  blank     | Space symbols   | +
>  email     | Email address   | priority@gmail.com
> (3 rows)
>
> The entire string I entered is a valid email address, and isn't
> totally uncommon.  Shouldn't that take such email address styles be
> taken into account?  The example above incorrectly identifies the
> email address since the real destination address would most likely be
> myname@gmail.com.

No opinions?

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

Re: Text search parser's treatment of URLs and emails

From
Bruce Momjian
Date:
Thom Brown wrote:
> Hi,
>
> I noticed that if I run this:
>
> SELECT alias, description, token FROM
> ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary');
>
> I get:
>
>   alias   |  description  |                              token
> ----------+---------------+-----------------------------------------------------------------
>  protocol | Protocol head | http://
>  url      | URL           |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
>  host     | Host          | www.postgresql.org:2345
>  url_path | URL path      |
> /directory/page.html?version=9.1&build=alpha1#summary
> (4 rows)
>
>
> It could be me being picky, but I don't regard parameters or page
> fragments as part of the URL path.  Ideally, I'd sort of expect:
>
>     alias     |  description  |                              token
> --------------+---------------+-----------------------------------------------------------------
>  protocol     | Protocol head | http://
>  url          | URL           |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
>  host         | Host          | www.postgresql.org
>  port         | Port          | 2345
>  url_path     | URL path      | /directory/page.html
>  query_string | Query string  | version=9.1&build=alpha1
>  fragment     | Page fragment | summary
> (7 rows)
>
> ... of course that's if there was support for query strings and page
> fragments, which there isn't.  But if changes were made to support my
> definition of a URL path, they'd have to be considered breaking
> changes.
>

Wow, that is a tough one.  One the one hand, it seems nice to be able to
split stuff out more, but on the other hand we would be making url_path
less useful because people would need to piece things together to get
the old behavior.  In fact to piece things together we would need to add
'?' and '#' optionally, which seems kind of hard.  Perhaps we should
keep url_path unchanged and add file_path that has your suggestion.
That would allow more fine-grained control without breaking backward
compatibility.  We already duplicate some data with url and url_path, so
having file_path as another place we duplicate some seems OK.

> But my main gripe is with the name "url_path".
>
> Also:
>
> SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com');
>
> Yields:
>
>    alias   |   description   |       token
> -----------+-----------------+--------------------
>  asciiword | Word, all ASCII | myname
>  blank     | Space symbols   | +
>  email     | Email address   | priority@gmail.com
> (3 rows)
>
> The entire string I entered is a valid email address, and isn't
> totally uncommon.  Shouldn't that take such email address styles be
> taken into account?  The example above incorrectly identifies the
> email address since the real destination address would most likely be
> myname@gmail.com.

I had no idea '+' could be part of an email address, and in fact it is a
modifier that is stripped off when delivering the email:

    http://my.brandeis.edu/bboard/q-and-a-fetch-msg?msg_id=0000Nu

I didn't even know that was possible.  It is used as an email delivery
flag.  I agree that needs to be corrected.  We fixed URLs in 9.0 with:

       Use more standards-compliant rules for parsing URL tokens
       (Tom Lane)

so I think it is reasonable to fix email addresses in 9.1.  Care to
submit a patch?   You can lookup Tom's change as a guide.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Text search parser's treatment of URLs and emails

From
Bruce Momjian
Date:
I have added this as a TODO:

    * Improve handling of plus signs in email address user
      names, and perhaps improve URL parsing

        * http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php

---------------------------------------------------------------------------

Thom Brown wrote:
> Hi,
>
> I noticed that if I run this:
>
> SELECT alias, description, token FROM
> ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary');
>
> I get:
>
>   alias   |  description  |                              token
> ----------+---------------+-----------------------------------------------------------------
>  protocol | Protocol head | http://
>  url      | URL           |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
>  host     | Host          | www.postgresql.org:2345
>  url_path | URL path      |
> /directory/page.html?version=9.1&build=alpha1#summary
> (4 rows)
>
>
> It could be me being picky, but I don't regard parameters or page
> fragments as part of the URL path.  Ideally, I'd sort of expect:
>
>     alias     |  description  |                              token
> --------------+---------------+-----------------------------------------------------------------
>  protocol     | Protocol head | http://
>  url          | URL           |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
>  host         | Host          | www.postgresql.org
>  port         | Port          | 2345
>  url_path     | URL path      | /directory/page.html
>  query_string | Query string  | version=9.1&build=alpha1
>  fragment     | Page fragment | summary
> (7 rows)
>
> ... of course that's if there was support for query strings and page
> fragments, which there isn't.  But if changes were made to support my
> definition of a URL path, they'd have to be considered breaking
> changes.
>
> But my main gripe is with the name "url_path".
>
> Also:
>
> SELECT alias, description, token FROM ts_debug('myname+priority@gmail.com');
>
> Yields:
>
>    alias   |   description   |       token
> -----------+-----------------+--------------------
>  asciiword | Word, all ASCII | myname
>  blank     | Space symbols   | +
>  email     | Email address   | priority@gmail.com
> (3 rows)
>
> The entire string I entered is a valid email address, and isn't
> totally uncommon.  Shouldn't that take such email address styles be
> taken into account?  The example above incorrectly identifies the
> email address since the real destination address would most likely be
> myname@gmail.com.
>
> --
> Thom Brown
> Twitter: @darkixion
> IRC (freenode): dark_ixion
> Registered Linux user: #516935
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +