Re: Bug with Tsearch and tsvector - Mailing list pgsql-bugs

From Kevin Grittner
Subject Re: Bug with Tsearch and tsvector
Date
Msg-id 4BD6BF9D0200002500030F5E@gw.wicourts.gov
Whole thread Raw
In response to Re: Bug with Tsearch and tsvector  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: Bug with Tsearch and tsvector  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Bug with Tsearch and tsvector  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> I'll read this RFC closely and follow up later today.

For anyone not clear on what a URI is compared to a URL, every URL
is also a URI (but not the other way around):

  A URI can be further classified as a locator, a name, or both.
  The term "Uniform Resource Locator" (URL) refers to the subset of
  URIs that, in addition to identifying a resource, provide a means
  of locating the resource by describing its primary access
  mechanism (e.g., its network "location").

So rules for URIs apply to URLs.

Regarding allowed characters, the relevant portions seem to be:

  The URI syntax has been designed with global transcription as one
  of its main considerations.  A URI is a sequence of characters
  from a very limited set: the letters of the basic Latin alphabet,
  digits, and a few special characters.


  The generic syntax uses the slash ("/"), question mark ("?"), and
  number sign ("#") characters to delimit components that are
  significant to the generic parser's hierarchical interpretation of
  an identifier.


  A URI is composed from a limited set of characters consisting of
  digits, letters, and a few graphic symbols.  A reserved subset of
  those characters may be used to delimit syntax components within a
  URI while the remaining characters, including both the unreserved
  set and those reserved characters not acting as delimiters, define
  each component's identifying data.


  A percent-encoding mechanism is used to represent a data octet in
  a component when that octet's corresponding character is outside
  the allowed set or is being used as a delimiter of, or within, the
  component.  A percent-encoded octet is encoded as a character
  triplet, consisting of the percent character "%" followed by the
  two hexadecimal digits representing that octet's numeric value.
  For example, "%20" is the percent-encoding for the binary octet
  "00100000" (ABNF: %x20), which in US-ASCII corresponds to the
  space character (SP).  Section 2.4 describes when percent-encoding
  and decoding is applied.

     pct-encoded = "%" HEXDIG HEXDIG

  The uppercase hexadecimal digits 'A' through 'F' are equivalent to
  the lowercase digits 'a' through 'f', respectively.  If two URIs
  differ only in the case of hexadecimal digits used in percent-
  encoded octets, they are equivalent.  For consistency, URI
  producers and normalizers should use uppercase hexadecimal digits
  for all percent-encodings.


     reserved    = gen-delims / sub-delims

     gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

     sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="


     unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"


I think that we should accept all the above characters (reserved and
unreserved) and the percent character (since it is the escape
character) as part of a URL.

Certainly *not* back-patchable.

I don't know whether we should try to extract components of the URL,
but if we do, perhaps we should also adopt the standard names for
the components:

  The generic URI syntax consists of a hierarchical sequence of
  components referred to as the scheme, authority, path, query, and
  fragment.

    URI        = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

    hier-part   = "//" authority path-abempty
                / path-absolute
                / path-rootless
                / path-empty

  The scheme and path components are required, though the path may
  be empty (no characters).  When authority is present, the path
  must either be empty or begin with a slash ("/") character.  When
  authority is not present, the path cannot begin with two slash
  characters ("//").  These restrictions result in five different
  ABNF rules for a path (Section 3.3), only one of which will match
  any given URI reference.

  The following are two example URIs and their component parts:

        foo://example.com:8042/over/there?name=ferret#nose
        \_/   \______________/\_________/ \_________/ \__/
         |           |            |            |        |
      scheme     authority       path        query   fragment
         |   _____________________|__
        / \ /                        \
        urn:example:animal:ferret:nose

I'm not really sure of the source for names we're now using.

Of course, the bigger the changes, the less they sound like material
for a quick, 11th hour 9.0 patch.

-Kevin

pgsql-bugs by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: Bug with Tsearch and tsvector
Next
From: Tom Lane
Date:
Subject: Re: Bug with Tsearch and tsvector