Home > mailing lists

Re: Draft release notes for next week's releases - Mailing list pgsql-hackers

From	Oleg Bartunov
Subject	Re: Draft release notes for next week's releases
Date	March 28, 2016 10:56:14
Msg-id	CAF4Au4w10NmS4wit98yLCTCwjVrkaMmScmYvNm8hO6PUjwRt6A@mail.gmail.com Whole thread Raw
In response to	Re: Draft release notes for next week's releases (Peter Geoghegan <pg@heroku.com>)
Responses	Re: Draft release notes for next week's releases
List	pgsql-hackers

Tree view

On Mon, Mar 28, 2016 at 1:21 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Mar 28, 2016 at 12:08 AM, Oleg Bartunov <obartunov@gmail.com> wrote:
> Should we start thinking about ICU ? I compare Postgres with ICU and without
> and found 27x improvement in btree index creation for russian strings. This
> includes effect of abbreviated keys and ICU itself. Also, we'll get system
> independent locale.

I think we should. I want to develop a detailed proposal before
talking about it more, though, because the idea is controversial.

Did you use the FreeBSD ports patch? Do you have your own patch that
you could share?

We'll post the patch. Teodor made something to get abbreviated keys work as

I remember. I should say, that 27x improvement I got on my macbook. I will
check on linux.

I'm not surprised that ICU is so much faster, especially now that
UTF-8 is not a second class citizen (it's been possible to build ICU
to specialize all its routines to handle UTF-8 for years now). As you
may know, ICU supports partial sort keys, and sort key compression,
which may have also helped:
http://userguide.icu-project.org/collation/architecture

That page also describes how binary sort keys are versioned, which
allows them to be stored on disk. It says "A common example is the use
of keys to build indexes in databases". We'd be crazy to trust Glibc
strxfrm() to be stable *on disk*, but ICU already cares deeply about
the things we need to care about, because it's used by other database
systems like DB2, Firebird, and in some configurations SQLite [1].

Glibc strxfrm() is not great with codepoints from the Cyrillic
alphabet -- it seems to store 2 bytes per code-point in the primary
weight level. So ICU might also do better in your test case for that
reason.

Yes, I see on this page, that ICU is ~3 times faster for russian text.
http://site.icu-project.org/charts/collation-icu4c48-glibc

[1] https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/icu/README.txt
--
Peter Geoghegan

pgsql-hackers by date:

From: Michael Paquier
Date: 28 March 2016, 10:54:22
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data

From: Peter Geoghegan
Date: 28 March 2016, 11:06:30
Subject: Re: Draft release notes for next week's releases

Re: Draft release notes for next week's releases - Mailing list pgsql-hackers

Previous

Next