Home > mailing lists

Re: Bug with Tsearch and tsvector - Mailing list pgsql-bugs

From	Kevin Grittner
Subject	Re: Bug with Tsearch and tsvector
Date	April 26, 2010 20:55:17
Msg-id	4BD5B7500200002500030E22@gw.wicourts.gov Whole thread Raw
In response to	Re: Bug with Tsearch and tsvector (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Bug with Tsearch and tsvector (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-bugs

Tree view

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Hmm, thanks for the reference, but I'm not sure this is specifying
> quite what we want to get at.  In particular I note that it
> excludes '%' on the grounds that that ought to be escaped, so I
> guess this is specifying the characters allowed in an underlying
> URI, *not* the textual representation of a URI.

I'm not sure I follow you here -- % is disallowed "raw" because it
is itself the escape character to allow hexadecimal specification of
any disallowed character.  So, being the escape character itself, we
would need to allow it.

Section 2.4, taken as a whole, makes sense to me, and argues that we
should always treat any text representation of a URI (including a
URL) as being in escaped form.  If it weren't for backward
compatibility, I would feel strongly that we should take any of the
excluded characters as the end of a URI.

| A URI is always in an "escaped" form, since escaping or unescaping
| a completed URI might change its semantics.  Normally, the only
| time escape encodings can safely be made is when the URI is being
| created from its component parts; each component may have its own
| set of characters that are reserved, so only the mechanism
| responsible for generating or interpreting that component can
| determine whether or not escaping a character will change its
| semantics. Likewise, a URI must be separated into its components
| before the escaped characters within those components can be
| safely decoded.

> Still, it seems like this is a sufficient defense against any
> complaints we might get for not treating "<" or ">" as part of a
> URL.

I would think so.

> I wonder whether we ought to reject any of the other characters
> listed here too.  Right now, the InURLPath state seems to eat
> everything until a space, quote, or double quote mark.  We could
> easily make it stop at "<" or ">" too, but what else?

From the RFC:

| control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
| space       = <US-ASCII coded character 20 hexadecimal>
| delims      = "<" | ">" | "#" | "%" | <">
| unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Except, of course, that since % is the escape character, it is OK.

Hmm.  Having typed that, I'm staring at the # character, which is
used to mark off an anchor within an HTML page identified by the
URL.  Should we consider the # and anchor part of a URL?  Any other
questionable characters?

-Kevin

pgsql-bugs by date:

From: Tom Lane
Date: 26 April 2010, 19:24:06
Subject: Re: Bug with Tsearch and tsvector

From: "Kevin Grittner"
Date: 26 April 2010, 20:58:31
Subject: Re: Bug with Tsearch and tsvector

Re: Bug with Tsearch and tsvector - Mailing list pgsql-bugs

Previous

Next