Thread: Bug with Tsearch and tsvector

Bug with Tsearch and tsvector

From

"Donald Fraser"

Date:

26 April 2010, 11:32:09

PostgreSQL 8.3.10 (on i686-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.=
2 20080704 (Red Hat 4.1.2-46))
OS: Linux Redhat EL 5.4=20
Database encoding: LATIN9

Using the default tsearch configuration, for 'english', text is being wrong=
ly parsed into the tsvector type.=20
The fail condition is shown with the following example, using the ts_headli=
ne function to highlight the issue.

SELECT ts_headline('english', 'The annual financial report will shortly be =
posted on the Company’s web-site at
      <span lang=3D"EN-GB">http://www.harewoodsolutions.co.uk/press.aspx</s=
pan><span lang=3D"EN-GB" style=3D""></span><span style=3D"">
      and a further announcement will be made once the annual financial rep=
ort is available to be downloaded. </span>',
      to_tsquery(''), 'MaxWords=3D101, MinWords=3D100');

Output:
"The annual financial report will shortly be posted on the Company’s =
 web-site at
       http://www.harewoodsolutions.co.uk/press.aspx</span><span lang=3D"EN=
-GB" style=3D"">
      and a further announcement will be made once the annual financial rep=
ort is available to be downloaded.  "=20

Expected output:
"The annual financial report will shortly be posted on the Company’s =
 web-site at
       http://www.harewoodsolutions.co.uk/press.aspx
      and a further announcement will be made once the annual financial rep=
ort is available to be downloaded.  "=20

Regards
Donald Fraser=

Re: Bug with Tsearch and tsvector

From

Tom Lane

Date:

26 April 2010, 11:55:39

"Donald Fraser" <postgres@kiwi-fraser.net> writes:
> Using the default tsearch configuration, for 'english', text is being wrongly parsed into the tsvector type.

ts_debug shows that it's being parsed like this:

      alias      |           description           |                 token                  |  dictionaries  |
dictionary |                 lexemes                   

-----------------+---------------------------------+----------------------------------------+----------------+--------------+------------------------------------------
 tag             | XML tag                         | <span lang="EN-GB">                    | {}             |
   |  
 protocol        | Protocol head                   | http://                                | {}             |
   |  
 url             | URL                             | www.harewoodsolutions.co.uk/press.aspx | {simple}       | simple
   | {www.harewoodsolutions.co.uk/press.aspx} 
 host            | Host                            | www.harewoodsolutions.co.uk            | {simple}       | simple
   | {www.harewoodsolutions.co.uk} 
 url_path        | URL path                        | /press.aspx</span><span                | {simple}       | simple
   | {/press.aspx</span><span} 
 blank           | Space symbols                   |                                        | {}             |
   |  
 asciiword       | Word, all ASCII                 | lang                                   | {english_stem} |
english_stem| {lang} 
 ... etc ...

ie the critical point seems to be that url_path is willing to soak up a
string containing "<" and ">", so the span tags don't get recognized as
separate lexemes.  While that's "obviously" the wrong thing in this
particular example, I'm not sure if it's the wrong thing in general.
Can anyone comment on the frequency of usage of those two symbols in
URLs?

In any case it's weird that the URL lexeme doesn't span the same text
as the url_path one, but I'm not sure which one we should consider
wrong.

            regards, tom lane

Re: Bug with Tsearch and tsvector

From

"Kevin Grittner"

Date:

26 April 2010, 15:55:03

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> ie the critical point seems to be that url_path is willing to soak
> up a string containing "<" and ">", so the span tags don't get
> recognized as separate lexemes.  While that's "obviously" the
> wrong thing in this particular example, I'm not sure if it's the
> wrong thing in general. Can anyone comment on the frequency of
> usage of those two symbols in URLs?

http://www.ietf.org/rfc/rfc2396.txt section 2.4.3 "delims" expressly
forbids their use in URIs.

> In any case it's weird that the URL lexeme doesn't span the same
> text as the url_path one, but I'm not sure which one we should
> consider wrong.

In spite of the above prohibition, I notice that firefox and wget
both seem to *try* to use such characters if they're included.

-Kevin

Re: Bug with Tsearch and tsvector

From

Tom Lane

Date:

26 April 2010, 15:58:51

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ie the critical point seems to be that url_path is willing to soak
>> up a string containing "<" and ">", so the span tags don't get
>> recognized as separate lexemes.  While that's "obviously" the
>> wrong thing in this particular example, I'm not sure if it's the
>> wrong thing in general. Can anyone comment on the frequency of
>> usage of those two symbols in URLs?

> http://www.ietf.org/rfc/rfc2396.txt section 2.4.3 "delims" expressly
> forbids their use in URIs.
> In spite of the above prohibition, I notice that firefox and wget
> both seem to *try* to use such characters if they're included.

Hmm, thanks for the reference, but I'm not sure this is specifying
quite what we want to get at.  In particular I note that it excludes
'%' on the grounds that that ought to be escaped, so I guess this is
specifying the characters allowed in an underlying URI, *not* the
textual representation of a URI.

Still, it seems like this is a sufficient defense against any complaints
we might get for not treating "<" or ">" as part of a URL.

I wonder whether we ought to reject any of the other characters listed
here too.  Right now, the InURLPath state seems to eat everything until
a space, quote, or double quote mark.  We could easily make it stop at
"<" or ">" too, but what else?

            regards, tom lane

Re: Bug with Tsearch and tsvector

From

Tom Lane

Date:

26 April 2010, 16:24:06

I wrote:
> "Donald Fraser" <postgres@kiwi-fraser.net> writes:
>> Using the default tsearch configuration, for 'english', text is being wrongly parsed into the tsvector type.

> ts_debug shows that it's being parsed like this:

>       alias      |           description           |                 token                  |  dictionaries  |
dictionary |                 lexemes                   
>
-----------------+---------------------------------+----------------------------------------+----------------+--------------+------------------------------------------
>  tag             | XML tag                         | <span lang="EN-GB">                    | {}             |
     |  
>  protocol        | Protocol head                   | http://                                | {}             |
     |  
>  url             | URL                             | www.harewoodsolutions.co.uk/press.aspx | {simple}       | simple
     | {www.harewoodsolutions.co.uk/press.aspx} 
>  host            | Host                            | www.harewoodsolutions.co.uk            | {simple}       | simple
     | {www.harewoodsolutions.co.uk} 
>  url_path        | URL path                        | /press.aspx</span><span                | {simple}       | simple
     | {/press.aspx</span><span} 
>  blank           | Space symbols                   |                                        | {}             |
     |  
>  asciiword       | Word, all ASCII                 | lang                                   | {english_stem} |
english_stem| {lang} 
>  ... etc ...

> ie the critical point seems to be that url_path is willing to soak up a
> string containing "<" and ">", so the span tags don't get recognized as
> separate lexemes.  While that's "obviously" the wrong thing in this
> particular example, I'm not sure if it's the wrong thing in general.
> Can anyone comment on the frequency of usage of those two symbols in
> URLs?

> In any case it's weird that the URL lexeme doesn't span the same text
> as the url_path one, but I'm not sure which one we should consider
> wrong.

I poked at this a bit.  The reason for the inconsistency between the url
and url_path lexemes is that the InURLPathStart state transitions
directly to InURLPath, which is *not* consistent with what happens while
parsing the URL as a whole: p_isURLPath() starts the sub-parser in
InFileFirst state.  The attached proposed patch rectifies that by
transitioning to InFileFirst state instead.  A possible objection to
this fix is that you may get either a "file" or a "url_path" component
lexeme, where before you always got "url_path".  I'm not sure if that's
something to worry about or not; I'd tend to think there's nothing much
wrong with it.

The other change in the attached patch is to make InURLPath parsing
stop at "<" or ">", as per discussion.

With these changes I get

regression=# SELECT * from ts_debug('http://www.harewoodsolutions.co.uk/press.aspx</span>');
  alias   |    description    |                 token                  | dictionaries | dictionary |
lexemes                  

----------+-------------------+----------------------------------------+--------------+------------+------------------------------------------
 protocol | Protocol head     | http://                                | {}           |            |
 url      | URL               | www.harewoodsolutions.co.uk/press.aspx | {simple}     | simple     |
{www.harewoodsolutions.co.uk/press.aspx}
 host     | Host              | www.harewoodsolutions.co.uk            | {simple}     | simple     |
{www.harewoodsolutions.co.uk}
 file     | File or path name | /press.aspx                            | {simple}     | simple     | {/press.aspx}
 tag      | XML tag           | </span>                                | {}           |            |
(5 rows)

as compared to the prior behavior

regression=# SELECT * from ts_debug('http://www.harewoodsolutions.co.uk/press.aspx</span>');
  alias   |  description  |                 token                  | dictionaries | dictionary |
lexemes                  

----------+---------------+----------------------------------------+--------------+------------+------------------------------------------
 protocol | Protocol head | http://                                | {}           |            |
 url      | URL           | www.harewoodsolutions.co.uk/press.aspx | {simple}     | simple     |
{www.harewoodsolutions.co.uk/press.aspx}
 host     | Host          | www.harewoodsolutions.co.uk            | {simple}     | simple     |
{www.harewoodsolutions.co.uk}
 url_path | URL path      | /press.aspx</span>                     | {simple}     | simple     | {/press.aspx</span>}
(4 rows)

Neither change affects the current set of regression tests; but none the
less there's a potential compatibility issue here, so my thought is to
apply this only in HEAD.

Comments?

            regards, tom lane


Index: src/backend/tsearch/wparser_def.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.29
diff -c -r1.29 wparser_def.c
*** src/backend/tsearch/wparser_def.c    26 Apr 2010 17:10:18 -0000    1.29
--- src/backend/tsearch/wparser_def.c    26 Apr 2010 19:17:48 -0000
***************
*** 1504,1521 ****
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
      {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

  static const TParserStateActionItem actionTPS_InURLPathStart[] = {
!     {NULL, 0, A_NEXT, TPS_InURLPath, 0, NULL}
  };

  static const TParserStateActionItem actionTPS_InURLPath[] = {
      {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
  };
--- 1504,1526 ----
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
+     {p_iseqC, '<', A_POP, TPS_Null, 0, NULL},
+     {p_iseqC, '>', A_POP, TPS_Null, 0, NULL},
      {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

  static const TParserStateActionItem actionTPS_InURLPathStart[] = {
!     /* this should transition to same state that p_isURLPath starts in */
!     {NULL, 0, A_NEXT, TPS_InFileFirst, 0, NULL}
  };

  static const TParserStateActionItem actionTPS_InURLPath[] = {
      {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL},
+     {p_iseqC, '<', A_BINGO, TPS_Base, URLPATH, NULL},
+     {p_iseqC, '>', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
  };

Re: Bug with Tsearch and tsvector

From

"Kevin Grittner"

Date:

26 April 2010, 17:55:17

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Hmm, thanks for the reference, but I'm not sure this is specifying
> quite what we want to get at.  In particular I note that it
> excludes '%' on the grounds that that ought to be escaped, so I
> guess this is specifying the characters allowed in an underlying
> URI, *not* the textual representation of a URI.

I'm not sure I follow you here -- % is disallowed "raw" because it
is itself the escape character to allow hexadecimal specification of
any disallowed character.  So, being the escape character itself, we
would need to allow it.

Section 2.4, taken as a whole, makes sense to me, and argues that we
should always treat any text representation of a URI (including a
URL) as being in escaped form.  If it weren't for backward
compatibility, I would feel strongly that we should take any of the
excluded characters as the end of a URI.

| A URI is always in an "escaped" form, since escaping or unescaping
| a completed URI might change its semantics.  Normally, the only
| time escape encodings can safely be made is when the URI is being
| created from its component parts; each component may have its own
| set of characters that are reserved, so only the mechanism
| responsible for generating or interpreting that component can
| determine whether or not escaping a character will change its
| semantics. Likewise, a URI must be separated into its components
| before the escaped characters within those components can be
| safely decoded.

> Still, it seems like this is a sufficient defense against any
> complaints we might get for not treating "<" or ">" as part of a
> URL.

I would think so.

> I wonder whether we ought to reject any of the other characters
> listed here too.  Right now, the InURLPath state seems to eat
> everything until a space, quote, or double quote mark.  We could
> easily make it stop at "<" or ">" too, but what else?

From the RFC:

| control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
| space       = <US-ASCII coded character 20 hexadecimal>
| delims      = "<" | ">" | "#" | "%" | <">
| unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Except, of course, that since % is the escape character, it is OK.

Hmm.  Having typed that, I'm staring at the # character, which is
used to mark off an anchor within an HTML page identified by the
URL.  Should we consider the # and anchor part of a URL?  Any other
questionable characters?

-Kevin

Re: Bug with Tsearch and tsvector

From

"Kevin Grittner"

Date:

26 April 2010, 17:58:31

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> there's a potential compatibility issue here, so my thought is to
> apply this only in HEAD.

Agreed.

-Kevin

Re: Bug with Tsearch and tsvector

From

Tom Lane

Date:

27 April 2010, 00:32:29

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Hmm.  Having typed that, I'm staring at the # character, which is
> used to mark off an anchor within an HTML page identified by the
> URL.  Should we consider the # and anchor part of a URL?

Yeah, I would think so.

This discussion is making me think that my previous patch went in the
wrong direction.  The way that the existing code works is that after
seeing something that looks host-name-ish followed by a '/', it goes to
the FilePath state, which is why "/press.aspx" gets parsed as a file
name in my previous example.  It only goes to the URLPath state if,
while in FilePath state, it sees '?'.  This seems a tad bizarre, and
it means that anything we do to the URLPath rules will only affect the
part of a URL following a '?'.

What I think might be the right thing instead, if we are going to
tighten up what URLPath accepts, is to go directly to URLPath state
after seeing host-name-and-'/'.  This eliminates the problem of
sometimes reporting "file" where we would have said "url_path"
before, and gives us a chance to apply the URLPath rules uniformly
to all text following a hostname.

Attached is a patch that does it that way instead.  We'd probably
not want to apply this as-is, but should first tighten up what
characters URLPath allows, per Kevin's spec research.

I find that this patch does create a couple of changes in the regression
test outputs.  The reason is that it parses this case differently:
    select * from ts_debug('http://5aew.werc.ewr:8100/?');
Existing code says that that is

  alias   |  description  |       token        | dictionaries | dictionary |       lexemes
----------+---------------+--------------------+--------------+------------+----------------------
 protocol | Protocol head | http://            | {}           |            |
 host     | Host          | 5aew.werc.ewr:8100 | {simple}     | simple     | {5aew.werc.ewr:8100}
 blank    | Space symbols | /?                 | {}           |            |
(3 rows)

while with this patch we get

  alias   |  description  |        token         | dictionaries | dictionary |        lexemes
----------+---------------+----------------------+--------------+------------+------------------------
 protocol | Protocol head | http://              | {}           |            |
 url      | URL           | 5aew.werc.ewr:8100/? | {simple}     | simple     | {5aew.werc.ewr:8100/?}
 host     | Host          | 5aew.werc.ewr:8100   | {simple}     | simple     | {5aew.werc.ewr:8100}
 url_path | URL path      | /?                   | {simple}     | simple     | {/?}
(4 rows)

Offhand I see no reason to discriminate against "/?" as a URL path, so
this change seems fine to me, but it is a change.

Thoughts?

            regards, tom lane

Index: src/backend/tsearch/wparser_def.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.29
diff -c -r1.29 wparser_def.c
*** src/backend/tsearch/wparser_def.c    26 Apr 2010 17:10:18 -0000    1.29
--- src/backend/tsearch/wparser_def.c    27 Apr 2010 03:27:08 -0000
***************
*** 707,715 ****
      int            res = 0;

      tmpprs->state = newTParserPosition(tmpprs->state);
!     tmpprs->state->state = TPS_InFileFirst;

!     if (TParserGet(tmpprs) && (tmpprs->type == URLPATH || tmpprs->type == FILEPATH))
      {
          prs->state->posbyte += tmpprs->lenbytetoken;
          prs->state->poschar += tmpprs->lenchartoken;
--- 707,715 ----
      int            res = 0;

      tmpprs->state = newTParserPosition(tmpprs->state);
!     tmpprs->state->state = TPS_InURLPathFirst;

!     if (TParserGet(tmpprs) && tmpprs->type == URLPATH)
      {
          prs->state->posbyte += tmpprs->lenbytetoken;
          prs->state->poschar += tmpprs->lenchartoken;
***************
*** 1441,1447 ****
      {p_isdigit, 0, A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '.', A_NEXT, TPS_InPathFirst, 0, NULL},
      {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL},
-     {p_iseqC, '?', A_PUSH, TPS_InURLPathFirst, 0, NULL},
      {p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL}
  };
--- 1441,1446 ----
***************
*** 1488,1494 ****
      {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '-', A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
-     {p_iseqC, '?', A_PUSH, TPS_InURLPathFirst, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, FILEPATH, NULL}
  };

--- 1487,1492 ----
***************
*** 1504,1510 ****
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
!     {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

--- 1502,1510 ----
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
!     {p_iseqC, '<', A_POP, TPS_Null, 0, NULL},
!     {p_iseqC, '>', A_POP, TPS_Null, 0, NULL},
!     {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

***************
*** 1516,1521 ****
--- 1516,1523 ----
      {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL},
+     {p_iseqC, '<', A_BINGO, TPS_Base, URLPATH, NULL},
+     {p_iseqC, '>', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
  };

Re: Bug with Tsearch and tsvector

From

"Kevin Grittner"

Date:

27 April 2010, 11:22:47

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> We'd probably not want to apply this as-is, but should first
> tighten up what characters URLPath allows, per Kevin's spec
> research.

If we're headed that way, I figured I should double-check.  The RFC
I referenced was later obsoleted by:

http://www.ietf.org/rfc/rfc3986.txt

I haven't been able to find anything more recent.  This newer RFC
doesn't seem to list characters which are *not* part of a URI, so
here are the characters which *are* allowed:

  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                / "*" / "+" / "," / ";" / "="

  unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

I'll read this RFC closely and follow up later today.

-Kevin

Re: Bug with Tsearch and tsvector

From

Tom Lane

Date:

27 April 2010, 11:32:37

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We'd probably not want to apply this as-is, but should first
>> tighten up what characters URLPath allows, per Kevin's spec
>> research.

> If we're headed that way, I figured I should double-check.  The RFC
> I referenced was later obsoleted by:
> http://www.ietf.org/rfc/rfc3986.txt

On reflection, since we're changing the behavior anyway, it seems
like the most defensible thing to do is make the TS parser follow
the RFC's allowed character set exactly.  The newer RFC doesn't
restrict '#' so that possible corner case is gone.

            regards, tom lane

Re: Bug with Tsearch and tsvector

From

"Kevin Grittner"

Date:

27 April 2010, 11:42:33

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> We'd probably not want to apply this as-is, but should first
>>> tighten up what characters URLPath allows, per Kevin's spec
>>> research.
>
>> If we're headed that way, I figured I should double-check.  The
>> RFC I referenced was later obsoleted by:
>> http://www.ietf.org/rfc/rfc3986.txt
>
> On reflection, since we're changing the behavior anyway, it seems
> like the most defensible thing to do is make the TS parser follow
> the RFC's allowed character set exactly.  The newer RFC doesn't
> restrict '#' so that possible corner case is gone.

It seems worth mentioning that there is a BSD licensed URI parser on
sourceforge:

http://uriparser.sourceforge.net/

I'm not advocating for using it, I just ran across it and it seemed
of possible interest.

-Kevin

Re: Bug with Tsearch and tsvector

From

"Kevin Grittner"

Date:

27 April 2010, 12:42:59

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> I'll read this RFC closely and follow up later today.

For anyone not clear on what a URI is compared to a URL, every URL
is also a URI (but not the other way around):

  A URI can be further classified as a locator, a name, or both.
  The term "Uniform Resource Locator" (URL) refers to the subset of
  URIs that, in addition to identifying a resource, provide a means
  of locating the resource by describing its primary access
  mechanism (e.g., its network "location").

So rules for URIs apply to URLs.

Regarding allowed characters, the relevant portions seem to be:

  The URI syntax has been designed with global transcription as one
  of its main considerations.  A URI is a sequence of characters
  from a very limited set: the letters of the basic Latin alphabet,
  digits, and a few special characters.


  The generic syntax uses the slash ("/"), question mark ("?"), and
  number sign ("#") characters to delimit components that are
  significant to the generic parser's hierarchical interpretation of
  an identifier.


  A URI is composed from a limited set of characters consisting of
  digits, letters, and a few graphic symbols.  A reserved subset of
  those characters may be used to delimit syntax components within a
  URI while the remaining characters, including both the unreserved
  set and those reserved characters not acting as delimiters, define
  each component's identifying data.


  A percent-encoding mechanism is used to represent a data octet in
  a component when that octet's corresponding character is outside
  the allowed set or is being used as a delimiter of, or within, the
  component.  A percent-encoded octet is encoded as a character
  triplet, consisting of the percent character "%" followed by the
  two hexadecimal digits representing that octet's numeric value.
  For example, "%20" is the percent-encoding for the binary octet
  "00100000" (ABNF: %x20), which in US-ASCII corresponds to the
  space character (SP).  Section 2.4 describes when percent-encoding
  and decoding is applied.

     pct-encoded = "%" HEXDIG HEXDIG

  The uppercase hexadecimal digits 'A' through 'F' are equivalent to
  the lowercase digits 'a' through 'f', respectively.  If two URIs
  differ only in the case of hexadecimal digits used in percent-
  encoded octets, they are equivalent.  For consistency, URI
  producers and normalizers should use uppercase hexadecimal digits
  for all percent-encodings.


     reserved    = gen-delims / sub-delims

     gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

     sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="


     unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"


I think that we should accept all the above characters (reserved and
unreserved) and the percent character (since it is the escape
character) as part of a URL.

Certainly *not* back-patchable.

I don't know whether we should try to extract components of the URL,
but if we do, perhaps we should also adopt the standard names for
the components:

  The generic URI syntax consists of a hierarchical sequence of
  components referred to as the scheme, authority, path, query, and
  fragment.

    URI        = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

    hier-part   = "//" authority path-abempty
                / path-absolute
                / path-rootless
                / path-empty

  The scheme and path components are required, though the path may
  be empty (no characters).  When authority is present, the path
  must either be empty or begin with a slash ("/") character.  When
  authority is not present, the path cannot begin with two slash
  characters ("//").  These restrictions result in five different
  ABNF rules for a path (Section 3.3), only one of which will match
  any given URI reference.

  The following are two example URIs and their component parts:

        foo://example.com:8042/over/there?name=ferret#nose
        \_/   \______________/\_________/ \_________/ \__/
         |           |            |            |        |
      scheme     authority       path        query   fragment
         |   _____________________|__
        / \ /                        \
        urn:example:animal:ferret:nose

I'm not really sure of the source for names we're now using.

Of course, the bigger the changes, the less they sound like material
for a quick, 11th hour 9.0 patch.

-Kevin

Re: Bug with Tsearch and tsvector

From

Tom Lane

Date:

27 April 2010, 12:52:47

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> I think that we should accept all the above characters (reserved and
> unreserved) and the percent character (since it is the escape
> character) as part of a URL.

Check.

> I don't know whether we should try to extract components of the URL,
> but if we do, perhaps we should also adopt the standard names for
> the components:

I don't think we should change the component names; that would break
existing text search configurations, and it doesn't seem to buy much
in return.

            regards, tom lane

Re: Bug with Tsearch and tsvector

From

Tom Lane

Date:

27 April 2010, 23:06:26

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>      reserved    = gen-delims / sub-delims
>      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
>      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
>                  / "*" / "+" / "," / ";" / "="
>      unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

> I think that we should accept all the above characters (reserved and
> unreserved) and the percent character (since it is the escape
> character) as part of a URL.

I've applied the attached patch to make it work that way.

            regards, tom lane

Index: src/backend/tsearch/wparser_def.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.29
diff -c -r1.29 wparser_def.c
*** src/backend/tsearch/wparser_def.c    26 Apr 2010 17:10:18 -0000    1.29
--- src/backend/tsearch/wparser_def.c    28 Apr 2010 01:57:14 -0000
***************
*** 583,588 ****
--- 583,617 ----
      return (p_isascii(prs) && p_isalpha(prs)) ? 1 : 0;
  }

+ static int
+ p_isurlchar(TParser *prs)
+ {
+     char        ch;
+
+     /* no non-ASCII need apply */
+     if (prs->state->charlen != 1)
+         return 0;
+     ch = *(prs->str + prs->state->posbyte);
+     /* no spaces or control characters */
+     if (ch <= 0x20 || ch >= 0x7F)
+         return 0;
+     /* reject characters disallowed by RFC 3986 */
+     switch (ch)
+     {
+         case '"':
+         case '<':
+         case '>':
+         case '\\':
+         case '^':
+         case '`':
+         case '{':
+         case '|':
+         case '}':
+             return 0;
+     }
+     return 1;
+ }
+

  /* deliberately suppress unused-function complaints for the above */
  void        _make_compiler_happy(void);
***************
*** 707,715 ****
      int            res = 0;

      tmpprs->state = newTParserPosition(tmpprs->state);
!     tmpprs->state->state = TPS_InFileFirst;

!     if (TParserGet(tmpprs) && (tmpprs->type == URLPATH || tmpprs->type == FILEPATH))
      {
          prs->state->posbyte += tmpprs->lenbytetoken;
          prs->state->poschar += tmpprs->lenchartoken;
--- 736,744 ----
      int            res = 0;

      tmpprs->state = newTParserPosition(tmpprs->state);
!     tmpprs->state->state = TPS_InURLPathFirst;

!     if (TParserGet(tmpprs) && tmpprs->type == URLPATH)
      {
          prs->state->posbyte += tmpprs->lenbytetoken;
          prs->state->poschar += tmpprs->lenchartoken;
***************
*** 1441,1447 ****
      {p_isdigit, 0, A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '.', A_NEXT, TPS_InPathFirst, 0, NULL},
      {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL},
-     {p_iseqC, '?', A_PUSH, TPS_InURLPathFirst, 0, NULL},
      {p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL}
  };
--- 1470,1475 ----
***************
*** 1488,1494 ****
      {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '-', A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
-     {p_iseqC, '?', A_PUSH, TPS_InURLPathFirst, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, FILEPATH, NULL}
  };

--- 1516,1521 ----
***************
*** 1502,1510 ****

  static const TParserStateActionItem actionTPS_InURLPathFirst[] = {
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
!     {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
!     {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
!     {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

--- 1529,1535 ----

  static const TParserStateActionItem actionTPS_InURLPathFirst[] = {
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
!     {p_isurlchar, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

***************
*** 1514,1522 ****

  static const TParserStateActionItem actionTPS_InURLPath[] = {
      {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
!     {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL},
!     {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL},
!     {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
  };

--- 1539,1545 ----

  static const TParserStateActionItem actionTPS_InURLPath[] = {
      {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
!     {p_isurlchar, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
  };

Index: src/test/regress/expected/tsearch.out
===================================================================
RCS file: /cvsroot/pgsql/src/test/regress/expected/tsearch.out,v
retrieving revision 1.17
diff -c -r1.17 tsearch.out
*** src/test/regress/expected/tsearch.out    22 Nov 2009 05:20:41 -0000    1.17
--- src/test/regress/expected/tsearch.out    28 Apr 2010 01:57:14 -0000
***************
*** 287,294 ****
       6 | 4aew.werc.ewr
      12 |
      14 | http://
       6 | 5aew.werc.ewr:8100
!     12 | /?
       1 | ad
      12 | =
       1 | qwe
--- 287,296 ----
       6 | 4aew.werc.ewr
      12 |
      14 | http://
+      5 | 5aew.werc.ewr:8100/?
       6 | 5aew.werc.ewr:8100
!     18 | /?
!     12 |
       1 | ad
      12 | =
       1 | qwe
***************
*** 391,404 ****
      12 |
      12 | <>
       1 | qwerty
! (131 rows)

  SELECT to_tsvector('english', '345 qwe@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw
1aew.werc.ewr/?ad=qwe&dw2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/?
ad=qwe&dw6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005
teodor@stack.netqwe-wer asdf <fr>qwer jf sdjk<we hjwer <werrwe> ewr1> ewri2 <a href="qwe<qwe>"> 
  /usr/local/fff /awdf/dwqe/4325 rewt/ewr wefjn /wqe-324/ewr gist.h gist.h.c gist.c. readline 4.2 4.2. 4.2,
readline-4.2readline-4.2. 234 
  <i <b> wow  < jqw <> qwerty');
!


                                             to_tsvector


                                                                                                     
!
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
!  '+4.0e-10':26 '-4.2':58,60 '/?ad=qwe&dw':7,10,14,22 '/?ad=qwe&dw=%20%32':25 '/awdf/dwqe/4325':46 '/usr/local/fff':45
'/wqe-324/ewr':49'1aew.werc.ewr':9 '1aew.werc.ewr/?ad=qwe&dw':8 '234':61 '234.435':30 '2aew.werc.ewr':11 '345':1
'3aew.werc.ewr':13'3aew.werc.ewr/?ad=qwe&dw':12 '4.2':54,55,56 '455':31 '4aew.werc.ewr':15 '5.005':32
'5aew.werc.ewr:8100':16'6aew.werc.ewr:8100':21 '6aew.werc.ewr:8100/?ad=qwe&dw':20 '7aew.werc.ewr:8100':24
'7aew.werc.ewr:8100/?ad=qwe&dw=%20%32':23'ad':17 'aew.werc.ewr':6 'aew.werc.ewr/?ad=qwe&dw':5 'asdf':37 'dw':19
'efd.r':3'ewr1':43 'ewri2':44 'gist.c':52 'gist.h':50 'gist.h.c':51 'hjwer':42 'jf':39 'jqw':64 'qwe':2,18,27,28,35
'qwe-wer':34'qwer':38 'qwerti':65 'qwqwe':29 'readlin':53,57,59 'rewt/ewr':47 'sdjk':40 'teodor@stack.net':33
'wefjn':48'wer':36 'wow':63 'www.com':4 
  (1 row)

  SELECT length(to_tsvector('english', '345 qwe@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw
1aew.werc.ewr/?ad=qwe&dw2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/?
ad=qwe&dw6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005
teodor@stack.netqwe-wer asdf <fr>qwer jf sdjk<we hjwer <werrwe> ewr1> ewri2 <a href="qwe<qwe>"> 
--- 393,406 ----
      12 |
      12 | <>
       1 | qwerty
! (133 rows)

  SELECT to_tsvector('english', '345 qwe@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw
1aew.werc.ewr/?ad=qwe&dw2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/?
ad=qwe&dw6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005
teodor@stack.netqwe-wer asdf <fr>qwer jf sdjk<we hjwer <werrwe> ewr1> ewri2 <a href="qwe<qwe>"> 
  /usr/local/fff /awdf/dwqe/4325 rewt/ewr wefjn /wqe-324/ewr gist.h gist.h.c gist.c. readline 4.2 4.2. 4.2,
readline-4.2readline-4.2. 234 
  <i <b> wow  < jqw <> qwerty');
!


                                                              to_tsvector



              
!
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
!  '+4.0e-10':28 '-4.2':60,62 '/?':18 '/?ad=qwe&dw':7,10,14,24 '/?ad=qwe&dw=%20%32':27 '/awdf/dwqe/4325':48
'/usr/local/fff':47'/wqe-324/ewr':51 '1aew.werc.ewr':9 '1aew.werc.ewr/?ad=qwe&dw':8 '234':63 '234.435':32
'2aew.werc.ewr':11'345':1 '3aew.werc.ewr':13 '3aew.werc.ewr/?ad=qwe&dw':12 '4.2':56,57,58 '455':33 '4aew.werc.ewr':15
'5.005':34'5aew.werc.ewr:8100':17 '5aew.werc.ewr:8100/?':16 '6aew.werc.ewr:8100':23 '6aew.werc.ewr:8100/?ad=qwe&dw':22
'7aew.werc.ewr:8100':26'7aew.werc.ewr:8100/?ad=qwe&dw=%20%32':25 'ad':19 'aew.werc.ewr':6 'aew.werc.ewr/?ad=qwe&dw':5
'asdf':39'dw':21 'efd.r':3 'ewr1':45 'ewri2':46 'gist.c':54 'gist.h':52 'gist.h.c':53 'hjwer':44 'jf':41 'jqw':66
'qwe':2,20,29,30,37'qwe-wer':36 'qwer':40 'qwerti':67 'qwqwe':31 'readlin':55,59,61 'rewt/ewr':49 'sdjk':42
'teodor@stack.net':35'wefjn':50 'wer':38 'wow':65 'www.com':4 
  (1 row)

  SELECT length(to_tsvector('english', '345 qwe@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw
1aew.werc.ewr/?ad=qwe&dw2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/?
ad=qwe&dw6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005
teodor@stack.netqwe-wer asdf <fr>qwer jf sdjk<we hjwer <werrwe> ewr1> ewri2 <a href="qwe<qwe>"> 
***************
*** 406,412 ****
  <i <b> wow  < jqw <> qwerty'));
   length
  --------
!      51
  (1 row)

  -- ts_debug
--- 408,414 ----
  <i <b> wow  < jqw <> qwerty'));
   length
  --------
!      53
  (1 row)

  -- ts_debug
***************
*** 424,429 ****
--- 426,469 ----
   tag       | XML tag         | </myns:foo-bar_baz.blurfl> | {}             |              |
  (9 rows)

+ -- check parsing of URLs
+ SELECT * from ts_debug('english', 'http://www.harewoodsolutions.co.uk/press.aspx</span>');
+   alias   |  description  |                 token                  | dictionaries | dictionary |
lexemes                  
+
----------+---------------+----------------------------------------+--------------+------------+------------------------------------------
+  protocol | Protocol head | http://                                | {}           |            |
+  url      | URL           | www.harewoodsolutions.co.uk/press.aspx | {simple}     | simple     |
{www.harewoodsolutions.co.uk/press.aspx}
+  host     | Host          | www.harewoodsolutions.co.uk            | {simple}     | simple     |
{www.harewoodsolutions.co.uk}
+  url_path | URL path      | /press.aspx                            | {simple}     | simple     | {/press.aspx}
+  tag      | XML tag       | </span>                                | {}           |            |
+ (5 rows)
+
+ SELECT * from ts_debug('english', 'http://aew.wer0c.ewr/id?ad=qwe&dw<span>');
+   alias   |  description  |           token            | dictionaries | dictionary |           lexemes
+ ----------+---------------+----------------------------+--------------+------------+------------------------------
+  protocol | Protocol head | http://                    | {}           |            |
+  url      | URL           | aew.wer0c.ewr/id?ad=qwe&dw | {simple}     | simple     | {aew.wer0c.ewr/id?ad=qwe&dw}
+  host     | Host          | aew.wer0c.ewr              | {simple}     | simple     | {aew.wer0c.ewr}
+  url_path | URL path      | /id?ad=qwe&dw              | {simple}     | simple     | {/id?ad=qwe&dw}
+  tag      | XML tag       | <span>                     | {}           |            |
+ (5 rows)
+
+ SELECT * from ts_debug('english', 'http://5aew.werc.ewr:8100/?');
+   alias   |  description  |        token         | dictionaries | dictionary |        lexemes
+ ----------+---------------+----------------------+--------------+------------+------------------------
+  protocol | Protocol head | http://              | {}           |            |
+  url      | URL           | 5aew.werc.ewr:8100/? | {simple}     | simple     | {5aew.werc.ewr:8100/?}
+  host     | Host          | 5aew.werc.ewr:8100   | {simple}     | simple     | {5aew.werc.ewr:8100}
+  url_path | URL path      | /?                   | {simple}     | simple     | {/?}
+ (4 rows)
+
+ SELECT * from ts_debug('english', '5aew.werc.ewr:8100/?xx');
+   alias   | description |         token          | dictionaries | dictionary |         lexemes
+ ----------+-------------+------------------------+--------------+------------+--------------------------
+  url      | URL         | 5aew.werc.ewr:8100/?xx | {simple}     | simple     | {5aew.werc.ewr:8100/?xx}
+  host     | Host        | 5aew.werc.ewr:8100     | {simple}     | simple     | {5aew.werc.ewr:8100}
+  url_path | URL path    | /?xx                   | {simple}     | simple     | {/?xx}
+ (3 rows)
+
  -- to_tsquery
  SELECT to_tsquery('english', 'qwe & sKies ');
    to_tsquery
Index: src/test/regress/sql/tsearch.sql
===================================================================
RCS file: /cvsroot/pgsql/src/test/regress/sql/tsearch.sql,v
retrieving revision 1.11
diff -c -r1.11 tsearch.sql
*** src/test/regress/sql/tsearch.sql    19 May 2009 02:48:26 -0000    1.11
--- src/test/regress/sql/tsearch.sql    28 Apr 2010 01:57:14 -0000
***************
*** 105,110 ****
--- 105,116 ----

  SELECT * from ts_debug('english',
'<myns:foo-bar_baz.blurfl>abc&nm1;def©ghiõjkl</myns:foo-bar_baz.blurfl>');

+ -- check parsing of URLs
+ SELECT * from ts_debug('english', 'http://www.harewoodsolutions.co.uk/press.aspx</span>');
+ SELECT * from ts_debug('english', 'http://aew.wer0c.ewr/id?ad=qwe&dw<span>');
+ SELECT * from ts_debug('english', 'http://5aew.werc.ewr:8100/?');
+ SELECT * from ts_debug('english', '5aew.werc.ewr:8100/?xx');
+
  -- to_tsquery

  SELECT to_tsquery('english', 'qwe & sKies ');

Re: Bug with Tsearch and tsvector

From

Jasen Betts

Date:

29 April 2010, 06:39:46

On 2010-04-26, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> From the RFC:
>
>| control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
>| space       = <US-ASCII coded character 20 hexadecimal>
>| delims      = "<" | ">" | "#" | "%" | <">
>| unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
>
> Except, of course, that since % is the escape character, it is OK.
>
> Hmm.  Having typed that, I'm staring at the # character, which is
> used to mark off an anchor within an HTML page identified by the
> URL.  Should we consider the # and anchor part of a URL?  Any other
> questionable characters?

\ is popular in URIs on some platfroms, or is URI a different beast

Re: Bug with Tsearch and tsvector

From

Tom Lane

Date:

29 April 2010, 10:32:09

Jasen Betts <jasen@xnet.co.nz> writes:
> \ is popular in URIs on some platfroms, or is URI a different beast

I hope not, because \ is explicitly disallowed by both the older and
newer versions of that RFC.

I did think of proposing that we allow \ and : in FilePath, which is
currently pretty Unix-centric:

regression=# select * from ts_debug('/foo/bar.baz');
 alias |    description    |    token     | dictionaries | dictionary |    lexemes
-------+-------------------+--------------+--------------+------------+----------------
 file  | File or path name | /foo/bar.baz | {simple}     | simple     | {/foo/bar.baz}
(1 row)

regression=# select * from ts_debug(E'C:\\foo\\bar.baz');
   alias   |   description   |  token  |  dictionaries  |  dictionary  |  lexemes
-----------+-----------------+---------+----------------+--------------+-----------
 asciiword | Word, all ASCII | C       | {english_stem} | english_stem | {c}
 blank     | Space symbols   | :\      | {}             |              |
 asciiword | Word, all ASCII | foo     | {english_stem} | english_stem | {foo}
 blank     | Space symbols   | \       | {}             |              |
 host      | Host            | bar.baz | {simple}       | simple       | {bar.baz}
(5 rows)

But that's more or less orthogonal to what URLPath should allow.

            regards, tom lane

Re: Bug with Tsearch and tsvector

From

Jasen Betts

Date:

01 May 2010, 04:12:48

On 2010-04-29, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jasen Betts <jasen@xnet.co.nz> writes:
>> \ is popular in URIs on some platfroms, or is URI a different beast
>
> I hope not, because \ is explicitly disallowed by both the older and
> newer versions of that RFC.

I should have known better than to assume that Microsoft was using a
term correctly.