Thread: tsvector limitations - why and how

tsvector limitations - why and how

From

Stanislav Kozlovski

Date:

27 August 2024, 23:38:03

Hey!

I just spent some time extensively documenting a draft e-mail I planned to send over here to report what I thought was a bug.

I was playing around with tsvector and full text search for a pet project of mine, ingesting books' content.

I was aware of the limitations of FTS and tried to ensure I didn't hit any - but what I missed was that the maximum allowed lexeme position was 16383 and everything above silently gets set to 16383. I was searching for a phrase (two words) at the end of the book and couldn't find it. After debugging I realized that my phrase's lexemes were being set to 16383, which was inaccurate.

I acknowledge that this limitation still sets the single lexeme and makes it searchable. I commend the authors for its elegant approach of still providing functionality beyond the limitations that save space.

The problem I had is that it breaks FOLLOWED BY queries, essentially stopping you from being able to match on phrases (more than one word) on large text.

I have a few questions here:

Why is this still the case?

The earliest I could see this limitation in place was from PG 8.3, released 16 years ago (2008). Hardware, and in particular memory and storage, have gone a long way since.

Do these limits still make sense?

I saw a thread from 6 years ago (2018) which asked a similar question, but it was focused on hitting the 1MB limit of the tsvector.

There, a member suggested that you'd probably need a whole new type (e.g bigtsvector) to make this work. Somebody else shared a project called tsvector2 which was a beefed-up version.

There was also a patch that attempted to remove the 1MB limit, but the contributor had it stall and never got it merged. I'm not aware if that would have fixed the positional limitation.

Are there any other drawbacks that I'm missing?

My understanding of the 16383 position limit is the following:

Lexemes present beyond the 16383 character mark will still match search queries, because they'll be represented as 'lexeme: 16383'
Phrase searches for lexemes beyond the 16383 character mark won't ever find anything

Is there anything else I'm missing?

If no - I'd be happy to contribute this piece of text to the docs - particularly in the limitations part, as I believe this would help people reading it.

What is the preferred/conventional way to store large text?

If I want to store a whole books' content - like PostgreSQL: Up and Running (2012) - I'd need to store it over 30 rows. (it's 300 pages long, 300-page books average about 82500 words, English words average about 6.5-4 characters, meaning a tsvector will hold the positions of no more than [2520-3277] words).

30 rows sounds fair, but there are two complications:

Splitting the chapters in a logical way is tricky
FOLLOWED BY queries between the rows will not work. This is fine when it's separated logically, e.g like a chapter, but not if it's separated within a chapter. The example book has 10 chapters, so it'd average around 3 rows a chapter.

Are there any unofficial larger tsvector types? Am I doing anything wrong?

Thanks for reading this. I'd appreciate any response.

Best,
Stanislav

Re: tsvector limitations - why and how

From

Tom Lane

Date:

28 August 2024, 01:24:52

Stanislav Kozlovski <Stanislav_Kozlovski@outlook.com> writes:
> I was aware of the limitations of FTS<https://www.postgresql.org/docs/17/textsearch-limitations.html> and tried to
ensureI didn't hit any - but what I missed was that the maximum allowed lexeme position was 16383 and everything above
silentlygets set to 16383. I was searching for a phrase (two words) at the end of the book and couldn't find it. After
debuggingI realized that my phrase's lexemes were being set to 16383, which was inaccurate. 
> ...
> The problem I had is that it breaks FOLLOWED BY queries, essentially stopping you from being able to match on phrases
(morethan one word) on large text. 

Yeah.  FOLLOWED BY didn't exist when the tsvector storage
representation was designed, so the possible inaccuracy of the
lexeme positions wasn't such a big deal.

> Why is this still the case?

Because nobody's done the significant amount of work needed to make
it better.  I think an acceptable patch would have to support both
the current tsvector representation and a "big" version that's able
to handle anything up to the 1GB varlena limit.  (If you were hoping
for documents bigger than that, you'd be needing a couple more
orders of magnitude worth of work.)  We might also find that there
are performance bottlenecks that'd have to be improved, but even just
making the code cope with two representations would be a big patch.

There has been some cursory talk about this, I think, but I don't
believe anyone's actually worked on it since the 2017 patch you
mentioned.  I'm not sure if that patch is worth using as the basis
for a fresh try: it looks like it had some performance issues, and
AFAICS it didn't really improve the lexeme-position limit.

(Wanders away wondering if the expanded-datum infrastructure could
be exploited here...)

            regards, tom lane

Re: tsvector limitations - why and how

From

Stanislav Kozlovski

Date:

29 August 2024, 16:53:35

Thanks for the reply, Tom.

Makes sense to me.

Worth saying - one very large misunderstanding that was pointed out to me was that the position limit is not per character - it is per word. This makes sense given the parser parses per word - but I completely missed it. It basically completely changes my calculations:

> If I want to store a whole books' content - like PostgreSQL: Up and Running (2012) - I'd need to store it over 30 rows. (it's 300 pages long, 300-page books average about 82500 words, English words average about 6.5-4 characters, meaning a tsvector will hold the positions of no more than [2520-3277] words).

to having to store 8250 words per chapter, meaning just 10 rows as each chapter would cleanly fit into a tsvector. (assuming the max position per lexeme isn't hit)

Meaning the limitations aren't as egregious as I first thought. This further explains to me why there hasn't been much interest by others to expand the limits.

From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Wednesday, 28 August 2024 0:24
To: Stanislav Kozlovski <Stanislav_Kozlovski@outlook.com>
Cc: pgsql-general@lists.postgresql.org <pgsql-general@lists.postgresql.org>
Subject: Re: tsvector limitations - why and how

Stanislav Kozlovski <Stanislav_Kozlovski@outlook.com> writes:
> I was aware of the limitations of FTS<https://www.postgresql.org/docs/17/textsearch-limitations.html> and tried to ensure I didn't hit any - but what I missed was that the maximum allowed lexeme position was 16383 and everything above silently gets set to 16383. I was searching for a phrase (two words) at the end of the book and couldn't find it. After debugging I realized that my phrase's lexemes were being set to 16383, which was inaccurate.
> ...
> The problem I had is that it breaks FOLLOWED BY queries, essentially stopping you from being able to match on phrases (more than one word) on large text.

Yeah. FOLLOWED BY didn't exist when the tsvector storage
representation was designed, so the possible inaccuracy of the
lexeme positions wasn't such a big deal.

> Why is this still the case?

Because nobody's done the significant amount of work needed to make
it better. I think an acceptable patch would have to support both
the current tsvector representation and a "big" version that's able
to handle anything up to the 1GB varlena limit. (If you were hoping
for documents bigger than that, you'd be needing a couple more
orders of magnitude worth of work.) We might also find that there
are performance bottlenecks that'd have to be improved, but even just
making the code cope with two representations would be a big patch.

There has been some cursory talk about this, I think, but I don't
believe anyone's actually worked on it since the 2017 patch you
mentioned. I'm not sure if that patch is worth using as the basis
for a fresh try: it looks like it had some performance issues, and
AFAICS it didn't really improve the lexeme-position limit.

(Wanders away wondering if the expanded-datum infrastructure could
be exploited here...)

regards, tom lane