Re: Bug with Tsearch and tsvector - Mailing list pgsql-bugs
From | Tom Lane |
---|---|
Subject | Re: Bug with Tsearch and tsvector |
Date | |
Msg-id | 22254.1272339139@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Bug with Tsearch and tsvector ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>) |
Responses |
Re: Bug with Tsearch and tsvector
|
List | pgsql-bugs |
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Hmm. Having typed that, I'm staring at the # character, which is > used to mark off an anchor within an HTML page identified by the > URL. Should we consider the # and anchor part of a URL? Yeah, I would think so. This discussion is making me think that my previous patch went in the wrong direction. The way that the existing code works is that after seeing something that looks host-name-ish followed by a '/', it goes to the FilePath state, which is why "/press.aspx" gets parsed as a file name in my previous example. It only goes to the URLPath state if, while in FilePath state, it sees '?'. This seems a tad bizarre, and it means that anything we do to the URLPath rules will only affect the part of a URL following a '?'. What I think might be the right thing instead, if we are going to tighten up what URLPath accepts, is to go directly to URLPath state after seeing host-name-and-'/'. This eliminates the problem of sometimes reporting "file" where we would have said "url_path" before, and gives us a chance to apply the URLPath rules uniformly to all text following a hostname. Attached is a patch that does it that way instead. We'd probably not want to apply this as-is, but should first tighten up what characters URLPath allows, per Kevin's spec research. I find that this patch does create a couple of changes in the regression test outputs. The reason is that it parses this case differently: select * from ts_debug('http://5aew.werc.ewr:8100/?'); Existing code says that that is alias | description | token | dictionaries | dictionary | lexemes ----------+---------------+--------------------+--------------+------------+---------------------- protocol | Protocol head | http:// | {} | | host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100} blank | Space symbols | /? | {} | | (3 rows) while with this patch we get alias | description | token | dictionaries | dictionary | lexemes ----------+---------------+----------------------+--------------+------------+------------------------ protocol | Protocol head | http:// | {} | | url | URL | 5aew.werc.ewr:8100/? | {simple} | simple | {5aew.werc.ewr:8100/?} host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100} url_path | URL path | /? | {simple} | simple | {/?} (4 rows) Offhand I see no reason to discriminate against "/?" as a URL path, so this change seems fine to me, but it is a change. Thoughts? regards, tom lane Index: src/backend/tsearch/wparser_def.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.29 diff -c -r1.29 wparser_def.c *** src/backend/tsearch/wparser_def.c 26 Apr 2010 17:10:18 -0000 1.29 --- src/backend/tsearch/wparser_def.c 27 Apr 2010 03:27:08 -0000 *************** *** 707,715 **** int res = 0; tmpprs->state = newTParserPosition(tmpprs->state); ! tmpprs->state->state = TPS_InFileFirst; ! if (TParserGet(tmpprs) && (tmpprs->type == URLPATH || tmpprs->type == FILEPATH)) { prs->state->posbyte += tmpprs->lenbytetoken; prs->state->poschar += tmpprs->lenchartoken; --- 707,715 ---- int res = 0; tmpprs->state = newTParserPosition(tmpprs->state); ! tmpprs->state->state = TPS_InURLPathFirst; ! if (TParserGet(tmpprs) && tmpprs->type == URLPATH) { prs->state->posbyte += tmpprs->lenbytetoken; prs->state->poschar += tmpprs->lenchartoken; *************** *** 1441,1447 **** {p_isdigit, 0, A_NEXT, TPS_InFile, 0, NULL}, {p_iseqC, '.', A_NEXT, TPS_InPathFirst, 0, NULL}, {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL}, - {p_iseqC, '?', A_PUSH, TPS_InURLPathFirst, 0, NULL}, {p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL}, {NULL, 0, A_POP, TPS_Null, 0, NULL} }; --- 1441,1446 ---- *************** *** 1488,1494 **** {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL}, {p_iseqC, '-', A_NEXT, TPS_InFile, 0, NULL}, {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL}, - {p_iseqC, '?', A_PUSH, TPS_InURLPathFirst, 0, NULL}, {NULL, 0, A_BINGO, TPS_Base, FILEPATH, NULL} }; --- 1487,1492 ---- *************** *** 1504,1510 **** {p_isEOF, 0, A_POP, TPS_Null, 0, NULL}, {p_iseqC, '"', A_POP, TPS_Null, 0, NULL}, {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL}, ! {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL}, {NULL, 0, A_POP, TPS_Null, 0, NULL}, }; --- 1502,1510 ---- {p_isEOF, 0, A_POP, TPS_Null, 0, NULL}, {p_iseqC, '"', A_POP, TPS_Null, 0, NULL}, {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL}, ! {p_iseqC, '<', A_POP, TPS_Null, 0, NULL}, ! {p_iseqC, '>', A_POP, TPS_Null, 0, NULL}, ! {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL}, {NULL, 0, A_POP, TPS_Null, 0, NULL}, }; *************** *** 1516,1521 **** --- 1516,1523 ---- {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL}, {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL}, {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL}, + {p_iseqC, '<', A_BINGO, TPS_Base, URLPATH, NULL}, + {p_iseqC, '>', A_BINGO, TPS_Base, URLPATH, NULL}, {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL}, {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL} };
pgsql-bugs by date: