Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore - Mailing list pgsql-bugs

From Euler Taveira de Oliveira
Subject Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore
Date
Msg-id 4ABAAFC8.7030108@timbira.com
Whole thread Raw
In response to BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore  ("Marek Lewczuk" <marek@lewczuk.com>)
Responses Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore  (Robert Haas <robertmhaas@gmail.com>)
Re: BUG #5075: Text Search parser does not identify xml tag when attribute name's contains underscore  (Peter Eisentraut <peter_e@gmx.net>)
List pgsql-bugs
Marek Lewczuk escreveu:
> Please execute following example:
> select * from ts_debug('english', '<img width="182" height="120"
> align="right" style="margin: 0px 0px 5px 5px;" test_aa="26461"/>')
>
> As the result you will see, that <img/> is not identified as XML tag, but
> rather splitted as words, blank spaces etc. The reason for that is the fact,
> that last attribute "test_aa" contains underscore in its name - when the
> underscore is removed, then img tag is properly identified as XML tag.
>
> XML definition allows using underscore in tag and attribute names.
>
The problem is we already allow it in tag names but not in attribute names. So
the proper fix is to allow underscore when the state is TPS_InTag; according
to XML spec [1], the underscore is a valid character in attribute names.

A possible downside is that we don't have underscores in HTML attribute names.
In this case, should it fail? I don't think so but...

The problem exists in 8.3, 8.4 and HEAD. It is a trivial fix so I think there
isn't a problem to back-patch it.


[1] http://www.w3.org/TR/REC-xml/#sec-common-syn


--
  Euler Taveira de Oliveira
  http://www.timbira.com/
Index: wparser_def.c
===================================================================
RCS file: /a/pgsql/dev/anoncvs/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.24
diff -c -r1.24 wparser_def.c
*** wparser_def.c    16 Jul 2009 06:33:44 -0000    1.24
--- wparser_def.c    23 Sep 2009 23:19:28 -0000
***************
*** 1225,1230 ****
--- 1225,1231 ----
      {p_isdigit, 0, A_NEXT, TPS_Null, 0, NULL},
      {p_iseqC, '=', A_NEXT, TPS_Null, 0, NULL},
      {p_iseqC, '-', A_NEXT, TPS_Null, 0, NULL},
+     {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL},
      {p_iseqC, '#', A_NEXT, TPS_Null, 0, NULL},
      {p_iseqC, '/', A_NEXT, TPS_Null, 0, NULL},
      {p_iseqC, ':', A_NEXT, TPS_Null, 0, NULL},

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #5074: segmentation fault in autovacuum
Next
From: "Bryan McLemore"
Date:
Subject: BUG #5077: Corrupted Table