tsvector limitations - why and how - Mailing list pgsql-general
From | Stanislav Kozlovski |
---|---|
Subject | tsvector limitations - why and how |
Date | |
Msg-id | DU0PR10MB60604534BEBB91C67FD743A78A942@DU0PR10MB6060.EURPRD10.PROD.OUTLOOK.COM Whole thread Raw |
Responses |
Re: tsvector limitations - why and how
|
List | pgsql-general |
Hey!
I just spent some time extensively documenting a draft e-mail I planned to send over here to report what I thought was a bug.
I was playing around with tsvector and full text search for a pet project of mine, ingesting books' content.
I was aware of the limitations of FTS and tried to ensure I didn't hit any - but what I missed was that the maximum allowed lexeme position was 16383 and everything above silently gets set to 16383. I was searching for a phrase (two words) at the end of the book and couldn't find it. After debugging I realized that my phrase's lexemes were being set to 16383, which was inaccurate.
I was playing around with tsvector and full text search for a pet project of mine, ingesting books' content.
I was aware of the limitations of FTS and tried to ensure I didn't hit any - but what I missed was that the maximum allowed lexeme position was 16383 and everything above silently gets set to 16383. I was searching for a phrase (two words) at the end of the book and couldn't find it. After debugging I realized that my phrase's lexemes were being set to 16383, which was inaccurate.
I acknowledge that this limitation still sets the single lexeme and makes it searchable. I commend the authors for its elegant approach of still providing functionality beyond the limitations that save space.
The problem I had is that it breaks FOLLOWED BY queries, essentially stopping you from being able to match on phrases (more than one word) on large text.
I have a few questions here:
- Why is this still the case?
The earliest I could see this limitation in place was from PG 8.3, released 16 years ago (2008). Hardware, and in particular memory and storage, have gone a long way since.
Do these limits still make sense?
I saw a thread from 6 years ago (2018) which asked a similar question, but it was focused on hitting the 1MB limit of the tsvector.
There, a member suggested that you'd probably need a whole new type (e.g bigtsvector) to make this work. Somebody else shared a project called tsvector2 which was a beefed-up version.
There was also a patch that attempted to remove the 1MB limit, but the contributor had it stall and never got it merged. I'm not aware if that would have fixed the positional limitation.
- Are there any other drawbacks that I'm missing?
My understanding of the 16383 position limit is the following:
- Lexemes present beyond the 16383 character mark will still match search queries, because they'll be represented as 'lexeme: 16383'
- Phrase searches for lexemes beyond the 16383 character mark won't ever find anything
Is there anything else I'm missing?
If no - I'd be happy to contribute this piece of text to the docs - particularly in the limitations part, as I believe this would help people reading it.
- What is the preferred/conventional way to store large text?
If I want to store a whole books' content - like PostgreSQL: Up and Running (2012) - I'd need to store it over 30 rows. (it's 300 pages long, 300-page books average about 82500 words, English words average about 6.5-4 characters, meaning a tsvector will hold the positions of no more than [2520-3277] words).
30 rows sounds fair, but there are two complications:
- Splitting the chapters in a logical way is tricky
- FOLLOWED BY queries between the rows will not work. This is fine when it's separated logically, e.g like a chapter, but not if it's separated within a chapter. The example book has 10 chapters, so it'd average around 3 rows a chapter.
Are there any unofficial larger tsvector types? Am I doing anything wrong?
Thanks for reading this. I'd appreciate any response.
Best,
Stanislav
Stanislav
pgsql-general by date: