Home > mailing lists

Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

From	Alexander Borisov
Subject	Re: Improve the performance of Unicode Normalization Forms.
Date	August 14 17:41:42
Msg-id	1260f607-8d90-4cf2-accc-45aa6eab63d8@gmail.com Whole thread Raw
In response to	Re: Improve the performance of Unicode Normalization Forms. (Jeff Davis <pgsql@j-davis.com>)
List	pgsql-hackers

Tree view

14.08.2025 01:12, Jeff Davis wrote:
> On Mon, 2025-08-11 at 17:21 +0300, Alexander Borisov wrote:

[..]

> 
> Comments on the patch itself:
> 
> The 0001 patch generalizes the two-step lookup process: first navigate
> branches to find the index into a partially-compacted sparse array, and
> then use that to get the index into the dense array. The branching
> code, the sparse array, and the dense array are all generated code. The
> reason for the two-step lookup is to keep the sparse array element size
> small (uint16), so that missing elements consume less space (missing
> elements still remain for small gaps). The full entry is kept in the
> dense array.

Yes, that's exactly right.

> GenerateSparseArray.pm would be more descriptive than "Ranges.pm" for
> the new module. And we should probably include "sparse" in the name of
> the sparse arrays.

I don't mind, that's good.

> The new module is responsible for generating the branching code as well
> as the sparse array; while the caller is reponsible for the dense
> arrays. For case mapping, one sparse array is used for four parallel
> arrays for the different case kinds (lower/title/upper/fold).

Yes, that's right.

> The use of zero values for different purposes is getting confusing. It
> means "doesn't exist", but we are also reserving the zeroth element in
> the arrays. Would it be easier to just "#define EMPTY 0xFFFF" and then
> have the caller check for it? That way we don't need to reserve the
> zeroth array element, which should make it easier to avoid off-by-one
> errors.

Initially, that was the case;
However, I subsequently concluded that creating magical values to define
an element that was not found was not an ideal solution.
I felt that 0 was easier to understand.

I'm not entirely sure now, but maybe this will even help get rid of
the comparison to 0. That is, we can get an element from the main table
by index 0. It will be zeroed, and the algorithm will work as it should.
However, it is not certain that this will have any effect on
performance.

In general, I don't have a firm position on this issue.

> I think we can simplify the interface, as well. Why does the caller
> need to separately generate the ranges, then generate the table, then
> generate the branches? It's really all the same action and can be based
> on an input hash with a certain structure, and then return both the
> table and the branches, right?

Yes, it can be done in one go. But the separation was done
intentionally. It makes the code more understandable, and, more
importantly, it makes it easier for developers "from the street"
to understand it and, if necessary, correct or supplement the algorithm.
These are utilities, code for generating an algorithm that, let's say,
will not be used very often, and I would not "compress" it. Here, it is
more important for us to understand how it works.

--
Regards,
Alexander Borisov

pgsql-hackers by date:

From: Greg Burd
Date: 14 August, 17:21:37
Subject: Re: [PATCH] bms_prev_member() can read beyond the end of the array of allocated words

From: David Rowley
Date: 14 August, 17:49:18
Subject: Re: [PATCH] bms_prev_member() can read beyond the end of the array of allocated words

Re: Improve the performance of Unicode Normalization Forms. - Mailing list pgsql-hackers

Previous

Next