Re: Bug with Tsearch and tsvector - Mailing list pgsql-bugs

From Kevin Grittner
Subject Re: Bug with Tsearch and tsvector
Date
Msg-id 4BD5B7500200002500030E22@gw.wicourts.gov
Whole thread Raw
In response to Re: Bug with Tsearch and tsvector  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Bug with Tsearch and tsvector  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Hmm, thanks for the reference, but I'm not sure this is specifying
> quite what we want to get at.  In particular I note that it
> excludes '%' on the grounds that that ought to be escaped, so I
> guess this is specifying the characters allowed in an underlying
> URI, *not* the textual representation of a URI.

I'm not sure I follow you here -- % is disallowed "raw" because it
is itself the escape character to allow hexadecimal specification of
any disallowed character.  So, being the escape character itself, we
would need to allow it.

Section 2.4, taken as a whole, makes sense to me, and argues that we
should always treat any text representation of a URI (including a
URL) as being in escaped form.  If it weren't for backward
compatibility, I would feel strongly that we should take any of the
excluded characters as the end of a URI.

| A URI is always in an "escaped" form, since escaping or unescaping
| a completed URI might change its semantics.  Normally, the only
| time escape encodings can safely be made is when the URI is being
| created from its component parts; each component may have its own
| set of characters that are reserved, so only the mechanism
| responsible for generating or interpreting that component can
| determine whether or not escaping a character will change its
| semantics. Likewise, a URI must be separated into its components
| before the escaped characters within those components can be
| safely decoded.

> Still, it seems like this is a sufficient defense against any
> complaints we might get for not treating "<" or ">" as part of a
> URL.

I would think so.

> I wonder whether we ought to reject any of the other characters
> listed here too.  Right now, the InURLPath state seems to eat
> everything until a space, quote, or double quote mark.  We could
> easily make it stop at "<" or ">" too, but what else?

From the RFC:

| control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
| space       = <US-ASCII coded character 20 hexadecimal>
| delims      = "<" | ">" | "#" | "%" | <">
| unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Except, of course, that since % is the escape character, it is OK.

Hmm.  Having typed that, I'm staring at the # character, which is
used to mark off an anchor within an HTML page identified by the
URL.  Should we consider the # and anchor part of a URL?  Any other
questionable characters?

-Kevin

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Bug with Tsearch and tsvector
Next
From: "Kevin Grittner"
Date:
Subject: Re: Bug with Tsearch and tsvector