Re: Empty string in lexeme for tsvector - Mailing list pgsql-hackers

From Ranier Vilela
Subject Re: Empty string in lexeme for tsvector
Date
Msg-id CAEudQAo+M-_N6KGHjaOqJcNgfCQPLi_trFb1Qy0ySkbjqzvL3w@mail.gmail.com
Whole thread Raw
In response to Empty string in lexeme for tsvector  (Jean-Christophe Arnu <jcarnu@gmail.com>)
Responses Re: Empty string in lexeme for tsvector
List pgsql-hackers
Em sex., 24 de set. de 2021 às 05:47, Jean-Christophe Arnu <jcarnu@gmail.com> escreveu:
Hello Hackers,

This is my second proposal for a patch, so I hope not to make "rookie" mistakes.

This proposal patch is based on a simple use case :

If one creates a table this way
CREATE TABLE tst_table AS (SELECT array_to_tsvector(ARRAY['','abc','def']));

the table content is :
 array_to_tsvector
-------------------
 '' 'abc' 'def'
(1 row)

First it can be strange to have an empty string for tsvector lexeme but anyway, keep going to the real point.

Once dumped, this table dump contains that empty string that can't be restored.
tsvector_parse (./utils/adt/tsvector_parser.c) raises an error.

Thus it is not possible for data to be restored this way.

There are two ways to consider this : is it alright to have empty strings in lexemes ?
   * If so, empty strings should be correctly parsed by tsvector_parser.
   * If not, one should prevent empty strings from being stored into tsvectors.

Since "empty strings" seems not to be a valid lexeme, I undertook to change some functions dealing with tsvector to check whether string arguments are empty. This might be the wrong path as I'm not familiar with tsvector usage... (OTOH, I can provide a fix patch for tsvector_parser() if I'm wrong).

This involved changing the way functions like array_to_tsvector(), ts_delete() and setweight() behave. As for NULL values, empty string values are checked and an error is raised for such a value. It appears to me that ERRCODE_ZERO_LENGTH_CHARACTER_STRING (2200F) matched this behaviour but I may be wrong.

Since this patch changes the way functions behave, consider it as a simple try to overcome a strange situation we've noticed on a specific use case.

This included patch manages that checks for empty strings on the pointed out functions. It comes with modified regression tests. Patch applies along head/master and 14_RC1.

Comments are more than welcome!
1. Would be better to add this test-and-error before tsvector_bsearch call.

+ if (lex_len == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+ errmsg("lexeme array may not contain empty strings")));
+

If lex_len is equal to zero, better get out soon.

2. The second test-and-error can use lex_len, just like the first test,
I don't see the point in recalculating the size of lex_len if that's already done.

+ if (VARSIZE(dlexemes[i]) - VARHDRSZ == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_ZERO_LENGTH_CHARACTER_STRING),
+ errmsg("lexeme array may not contain empty strings")));
+

regards,
Ranier Vilela

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Skipping logical replication transactions on subscriber side
Next
From: Antonin Houska
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs