Home > mailing lists

Re: Unicode normalization SQL functions - Mailing list pgsql-hackers

From	Daniel Verite
Subject	Re: Unicode normalization SQL functions
Date	January 6, 2020 19:00:11
Msg-id	3348a374-0325-4768-bff6-736eb76a5f9c@manitou-mail.org Whole thread Raw
In response to	Unicode normalization SQL functions (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Responses	Re: Unicode normalization SQL functions
List	pgsql-hackers

Tree view

    Peter Eisentraut wrote:

> Also, there is a way to optimize the "is normalized" test for common
> cases, described in UTR #15.  For that we'll need an additional data
> file from Unicode.  In order to simplify that, I would like my patch
> "Add support for automatically updating Unicode derived files"
> integrated first.

Would that explain that the NFC/NFKC normalization and "is normalized"
check seem abnormally slow with the current patch, or should
it be regarded independently of the other patch?

For instance, testing 10000 short ASCII strings:

postgres=# select count(*) from (select md5(i::text) as t from
generate_series(1,10000) as i) s where t is nfc normalized ;
 count
-------
 10000
(1 row)

Time: 2573,859 ms (00:02,574)

By comparison, the NFD/NFKD case is faster by two orders of magnitude:

postgres=# select count(*) from (select md5(i::text) as t from
generate_series(1,10000) as i) s where t is nfd normalized ;
 count
-------
 10000
(1 row)

Time: 29,962 ms

Although NFC/NFKC has a recomposition step that NFD/NFKD
doesn't have, such a difference is surprising.

I've tried an alternative implementation based on ICU's
unorm2_isNormalized() /unorm2_normalize() functions (which I'm
currently adding to the icu_ext extension to be exposed in SQL).
With these, the 4 normal forms are in the 20ms ballpark with the above
test case, without a clear difference between composed and decomposed
forms.

Independently of the performance, I've compared the results
of the ICU implementation vs this patch on large series of strings
with all normal forms and could not find any difference.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

pgsql-hackers by date:

From: Tomas Vondra
Date: 06 January 2020, 18:21:55
Subject: Re: Avoid full GIN index scan when possible

From: Tom Lane
Date: 06 January 2020, 19:03:47
Subject: Re: Recognizing superuser in pg_hba.conf

Re: Unicode normalization SQL functions - Mailing list pgsql-hackers

Previous

Next