Thread: FTS parser - missing UUID token type

FTS parser - missing UUID token type

From

Przemysław Sztoch

Date:

14 September 2022, 09:26:41

I miss UUID, which indexes very strangely, is more and more popular and people want to search for it.

See: https://www.postgresql.org/docs/current/textsearch-parsers.html

UUID is fairly easy to parse:
The hexadecimal digits are grouped as 32 hexadecimal characters with four hyphens: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX.
The number of characters per hyphen is 8-4-4-4-12. The last section of four, or the N position, indicates the format and encoding in either one to three bits.

Now, UUIDs parse each other differently, depending on whether the individual parts begin with numbers or letters:
00633f1d-1fff-409e-8294-40a21f565904    '-40':6 '00633f1d':2 '00633f1d-1fff-409e':1 '1fff':3 '409e':4 '8294':5 'a21f565904':7
00856c28-2251-4aaf-82d3-e4962f5b732d    '-2251':2 '-4':3 '00856c28':1 '82d3':6 'aaf':5 'aaf-82d3-e4962f5b732d':4 'e4962f5b732d':7
00a1cc84-816a-490a-a99c-8a4c637380b0    '00a1cc84':2 '00a1cc84-816a-490a-a99c-8a4c637380b0':1 '490a':4 '816a':3 '8a4c637380b0':6 'a99c':5

As a result, such identifiers cannot be found in the database later.

What is your opinion on missing tokens for FTS?

--

Przemysław Sztoch | Mobile +48 509 99 00 66

Re: FTS parser - missing UUID token type

From

Tom Lane

Date:

14 September 2022, 14:10:39

=?UTF-8?Q?Przemys=c5=82aw_Sztoch?= <przemyslaw@sztoch.pl> writes:
> I miss UUID, which indexes very strangely, is more and more popular and 
> people want to search for it.

Really?  UUIDs in running text seem like an extremely uncommon
use-case to me.  URLs in running text are common nowadays, which is
why the text search parser has special code for that, but UUIDs?

Adding such a thing isn't cost-free either.  Aside from the
probably-substantial development effort, we know from experience
with the URL support that it sometimes misfires and identifies
something as a URL or URL fragment when it really isn't one.
That leads to poorer indexing of the affected text.  It seems
likely that adding a UUID token type would be a net negative
for most people, since they'd be subject to that hazard even if
their text contains no true UUIDs.

It's a shame that the text search parser isn't more extensible.
If it were you could imagine having such a feature while making
it optional.  I'm not volunteering to fix that though :-(

            regards, tom lane