Home > mailing lists

Re: Html parsing and inline elements - Mailing list pgsql-hackers

From	Marcelo Zabani
Subject	Re: Html parsing and inline elements
Date	April 13, 2016 18:57:51
Msg-id	CACgY3QavK=P8G-KD6ZRR+M6+y25h+LjicQqp9HYfOiu22GdAFg@mail.gmail.com Whole thread Raw
In response to	Re: Html parsing and inline elements (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Html parsing and inline elements Re: Html parsing and inline elements Re: Html parsing and inline elements
List	pgsql-hackers

Tree view

Hi, Tom,

You're right, I don't think one can argue that the default parser should know HTML.

How about your suggestion of there being an HTML parser, is it feasible? I ask this because I think that a lot of people store HTML documents these days, and although there probably aren't lots of HTML with words written along multiple inline elements, it would certainly be nice to have a proper parser for these use cases.

What do you think?

On Wed, Apr 13, 2016 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marcelo Zabani <mzabani@gmail.com> writes:
> I was here wondering whether HTML parsing should separate tokens that are
> not separated by spaces in the original text, but are separated by an
> inline element. Let me show you an example:

> *SELECT to_tsvector('english', 'Helloneighbor, you are
> nice')*
> *Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*

> "Hello" and "neighbor" should really be separated, because ** is a block
> element, but "nice" should be a single word there, since there is no visual
> separation when rendered (** and ** are inline elements).

I can't imagine that we want to_tsvector to know that much about HTML.
It doesn't, really, even have license to assume that its input *is*
HTML. So even if you see things that look like <foo> and </foo> in the
string, it could easily be XML or SGML or some other SGML-like markup
format with different semantics for the markup keywords.

Perhaps it'd be sane to do something like this as long as the
HTML-specific behavior was broken out into a separate function.
(Or maybe it could be done within to_tsvector as a separate parser
or separate dictionary?) But I don't think it should be part of
the default behavior.

regards, tom lane

pgsql-hackers by date:

From: Tom Lane
Date: 13 April 2016, 18:53:29
Subject: Re: [patch] \crosstabview documentation

From: Andres Freund
Date: 13 April 2016, 18:59:52
Subject: Re: Re: [COMMITTERS] pgsql: Avoid extra locks in GetSnapshotData if old_snapshot_threshold <

Re: Html parsing and inline elements - Mailing list pgsql-hackers

Previous

Next