Re: Bug with Tsearch and tsvector - Mailing list pgsql-bugs

From Tom Lane
Subject Re: Bug with Tsearch and tsvector
Date
Msg-id 22254.1272339139@sss.pgh.pa.us
Whole thread Raw
In response to Re: Bug with Tsearch and tsvector  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: Bug with Tsearch and tsvector
List pgsql-bugs
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Hmm.  Having typed that, I'm staring at the # character, which is
> used to mark off an anchor within an HTML page identified by the
> URL.  Should we consider the # and anchor part of a URL?

Yeah, I would think so.

This discussion is making me think that my previous patch went in the
wrong direction.  The way that the existing code works is that after
seeing something that looks host-name-ish followed by a '/', it goes to
the FilePath state, which is why "/press.aspx" gets parsed as a file
name in my previous example.  It only goes to the URLPath state if,
while in FilePath state, it sees '?'.  This seems a tad bizarre, and
it means that anything we do to the URLPath rules will only affect the
part of a URL following a '?'.

What I think might be the right thing instead, if we are going to
tighten up what URLPath accepts, is to go directly to URLPath state
after seeing host-name-and-'/'.  This eliminates the problem of
sometimes reporting "file" where we would have said "url_path"
before, and gives us a chance to apply the URLPath rules uniformly
to all text following a hostname.

Attached is a patch that does it that way instead.  We'd probably
not want to apply this as-is, but should first tighten up what
characters URLPath allows, per Kevin's spec research.

I find that this patch does create a couple of changes in the regression
test outputs.  The reason is that it parses this case differently:
    select * from ts_debug('http://5aew.werc.ewr:8100/?');
Existing code says that that is

  alias   |  description  |       token        | dictionaries | dictionary |       lexemes
----------+---------------+--------------------+--------------+------------+----------------------
 protocol | Protocol head | http://            | {}           |            |
 host     | Host          | 5aew.werc.ewr:8100 | {simple}     | simple     | {5aew.werc.ewr:8100}
 blank    | Space symbols | /?                 | {}           |            |
(3 rows)

while with this patch we get

  alias   |  description  |        token         | dictionaries | dictionary |        lexemes
----------+---------------+----------------------+--------------+------------+------------------------
 protocol | Protocol head | http://              | {}           |            |
 url      | URL           | 5aew.werc.ewr:8100/? | {simple}     | simple     | {5aew.werc.ewr:8100/?}
 host     | Host          | 5aew.werc.ewr:8100   | {simple}     | simple     | {5aew.werc.ewr:8100}
 url_path | URL path      | /?                   | {simple}     | simple     | {/?}
(4 rows)

Offhand I see no reason to discriminate against "/?" as a URL path, so
this change seems fine to me, but it is a change.

Thoughts?

            regards, tom lane

Index: src/backend/tsearch/wparser_def.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.29
diff -c -r1.29 wparser_def.c
*** src/backend/tsearch/wparser_def.c    26 Apr 2010 17:10:18 -0000    1.29
--- src/backend/tsearch/wparser_def.c    27 Apr 2010 03:27:08 -0000
***************
*** 707,715 ****
      int            res = 0;

      tmpprs->state = newTParserPosition(tmpprs->state);
!     tmpprs->state->state = TPS_InFileFirst;

!     if (TParserGet(tmpprs) && (tmpprs->type == URLPATH || tmpprs->type == FILEPATH))
      {
          prs->state->posbyte += tmpprs->lenbytetoken;
          prs->state->poschar += tmpprs->lenchartoken;
--- 707,715 ----
      int            res = 0;

      tmpprs->state = newTParserPosition(tmpprs->state);
!     tmpprs->state->state = TPS_InURLPathFirst;

!     if (TParserGet(tmpprs) && tmpprs->type == URLPATH)
      {
          prs->state->posbyte += tmpprs->lenbytetoken;
          prs->state->poschar += tmpprs->lenchartoken;
***************
*** 1441,1447 ****
      {p_isdigit, 0, A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '.', A_NEXT, TPS_InPathFirst, 0, NULL},
      {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL},
-     {p_iseqC, '?', A_PUSH, TPS_InURLPathFirst, 0, NULL},
      {p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL}
  };
--- 1441,1446 ----
***************
*** 1488,1494 ****
      {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '-', A_NEXT, TPS_InFile, 0, NULL},
      {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL},
-     {p_iseqC, '?', A_PUSH, TPS_InURLPathFirst, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, FILEPATH, NULL}
  };

--- 1487,1492 ----
***************
*** 1504,1510 ****
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
!     {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

--- 1502,1510 ----
      {p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
      {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
!     {p_iseqC, '<', A_POP, TPS_Null, 0, NULL},
!     {p_iseqC, '>', A_POP, TPS_Null, 0, NULL},
!     {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_POP, TPS_Null, 0, NULL},
  };

***************
*** 1516,1521 ****
--- 1516,1523 ----
      {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL},
+     {p_iseqC, '<', A_BINGO, TPS_Base, URLPATH, NULL},
+     {p_iseqC, '>', A_BINGO, TPS_Base, URLPATH, NULL},
      {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
      {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
  };

pgsql-bugs by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: Bug with Tsearch and tsvector
Next
From: "Kevin Grittner"
Date:
Subject: Re: Bug with Tsearch and tsvector