Html parsing and inline elements - Mailing list pgsql-hackers

From Marcelo Zabani
Subject Html parsing and inline elements
Date
Msg-id CACgY3QZ0_TX4LBC8=RRCRGM2Mgos6S8jj8AhxYMP6P5EM2M4yQ@mail.gmail.com
Whole thread Raw
Responses Re: Html parsing and inline elements  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi everyone,

I was here wondering whether HTML parsing should separate tokens that are not separated by spaces in the original text, but are separated by an inline element. Let me show you an example:

SELECT to_tsvector('english', 'Hello<p>neighbor</p>, you are <strong>n</strong>i<em>ce</em>')
Results: "'ce':7 'hello':1 'n':5 'neighbor':2"

"Hello" and "neighbor" should really be separated, because <p> is a block element, but "nice" should be a single word there, since there is no visual separation when rendered (<em> and <strong> are inline elements).

Sorry if this has been asked before, but I couldn't find it anywhere.

Thanks in advance,
Marcelo.

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Missing PG_INT32_MIN in numutils.c
Next
From: Robert Haas
Date:
Subject: Re: Missing PG_INT32_MIN in numutils.c