Thread: Reorganize collation lookup time and place

Reorganize collation lookup time and place

From
Peter Eisentraut
Date:
Attached is a patch that reorganizes how and where collations are looked up.

Until now, a fmgr C function that makes use of collation information
(for example varstr_cmp(), str_toupper()) gets passed the collation OID,
looks up the collation with pg_newlocale_from_collation(), and then does
something with it.

With this change, the lookup is moved earlier, typically during executor
initialization.  The fmgr functions receive the locale pointer that is
the result of that lookup.

This makes the collation lookup more similar to the function lookup.  In
many cases they now happen right next to each other.
pg_newlocale_from_collation() becomes analogous to fmgr_info() and
pg_locale_t to FmgrInfo *.

We also save repeated lookups in repeated function calls.  I don't
expect massive performance gains, since it's all cached either way, but
it just makes more sense to look up the collation once per executor node
instead of once per function call.

Moreover, this makes the structure and logic of functions like
varstr_cmp() easier, because they don't have to deal with a mix of OIDs
and locale pointers that are sometimes not set depending on the value of
the other.  A locale pointer of zero now has the more natural meaning of
"nothing was provided" instead of a mix of nothing or default.
Conversely, if a collation was assigned for the call, the passed-in
pointer is always non-zero.

I think this would also help other efforts like allowing ICU to be used
for the default collation.  For that, we need to store the ICU collators
somewhere, so having NULL represent the default locale won't do anymore.

Some renaming might be in order.  The original collation implementation
used pg_newlocale_... and pg_locale_t as thin wrappers around the
operating system's newlocale() and locale_t.  Now these have grown into
larger structures with various catalog lookup information.  It might be
worth renaming, for example pg_locale_t to CollInfo *.  (It's also
confusing in this context that pg_locale_t is already a pointer but
FmgrInfo is a struct.)

Also, in order to avoid including pg_locale.h in a lot more places, I
have created an incomplete type fmLocalePtr in fmgr.h.  This isn't
necessarily the final solution, given the discussion in the previous
paragraph.  Maybe a better structure would be to have, say, CollInfo *
and whatever the lookup function is called in fmgr.h and fmgr.c, and
then only have fmgr.c include pg_locale.h and call into pg_locale.c.  Or
something similar.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Reorganize collation lookup time and place

From
Andreas Karlsson
Date:
On 12/10/18 4:12 PM, Peter Eisentraut wrote:
> Attached is a patch that reorganizes how and where collations are looked up.
> 
> Until now, a fmgr C function that makes use of collation information
> (for example varstr_cmp(), str_toupper()) gets passed the collation OID,
> looks up the collation with pg_newlocale_from_collation(), and then does
> something with it.
> 
> With this change, the lookup is moved earlier, typically during executor
> initialization.  The fmgr functions receive the locale pointer that is
> the result of that lookup.

Sounds like a great plan. I too feel that having the look ups there 
makes more logical sense.

But when taking a look at your patch I got a segfault when running 
initdb. See the stack trace below. My compiler is "gcc (Debian 8.2.0-9) 
8.2.0" and the locale when running initdb is glibc's "en_US.utf8".

#0  __GI___strcoll_l (s1=0x55d64555d998 "({CONST :consttype 1009 
:consttypmod -1 :constcollid 100 :constlen -1 :constbyval false 
:constisnull false :location 160 :constvalue 16 [ 64 0 0 0 0 0 0 0 0 0 0 
0 25 0 0 0 ]})", s2=0x55d64555ddb0 "({CONST :consttype 1009 :consttypmod 
-1 :constcollid 100 :constlen -1 :constbyval false :constisnull false 
:location 161 :constvalue 16 [ 64 0 0 0 0 0 0 0 0 0 0 0 25 0 0 0 ]})", 
l=0x0) at strcoll_l.c:259
#1  0x000055d644c4d607 in varstrfastcmp_locale (x=94378777346072, 
y=94378777347120, ssup=0x7ffea6a64990) at varlena.c:2185
#2  0x000055d644958dad in ApplySortComparator (ssup=0x7ffea6a64990, 
isNull2=false, datum2=<optimized out>, isNull1=false, datum1=<optimized 
out>) at ../../../src/include/utils/sortsupport.h:224
#3  compare_scalars (a=<optimized out>, b=<optimized out>, 
arg=0x7ffea6a64980) at analyze.c:2784
#4  0x000055d644cba494 in qsort_arg (a=a@entry=0x55d6456ee5f0, 
n=n@entry=14, es=es@entry=16, cmp=cmp@entry=0x55d644958d82 
<compare_scalars>, arg=arg@entry=0x7ffea6a64980) at qsort_arg.c:140
#5  0x000055d64495b833 in compute_scalar_stats (stats=0x55d6456c6c88, 
fetchfunc=0x55d6449597f1 <std_fetch_func>, samplerows=2904, 
totalrows=2904) at analyze.c:2360
#6  0x000055d64495cca3 in do_analyze_rel 
(onerel=onerel@entry=0x7f3424bdbd50, options=options@entry=2, 
params=params@entry=0x7ffea6a64d90, va_cols=va_cols@entry=0x0, 
acquirefunc=0x55d64495901b <acquire_sample_rows>, relpages=77, 
inh=false, in_outer_xact=false, elevel=13) at analyze.c:527
#7  0x000055d64495d2ec in analyze_rel (relid=<optimized out>, 
relation=0x0, options=options@entry=2, 
params=params@entry=0x7ffea6a64d90, va_cols=0x0, 
in_outer_xact=<optimized out>, bstrategy=0x55d6455413a8) at analyze.c:258
#8  0x000055d6449d61b1 in vacuum (options=2, relations=0x55d645541520, 
params=params@entry=0x7ffea6a64d90, bstrategy=<optimized out>, 
bstrategy@entry=0x0, isTopLevel=<optimized out>) at vacuum.c:357
#9  0x000055d6449d644c in ExecVacuum 
(vacstmt=vacstmt@entry=0x55d645442e50, isTopLevel=isTopLevel@entry=true) 
at vacuum.c:141
#10 0x000055d644b5ec9d in standard_ProcessUtility (pstmt=0x55d6454431a0, 
queryString=0x55d645442468 "ANALYZE;\n", 
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
dest=0x55d644f50b00 <debugtupDR>, completionTag=0x7ffea6a650d0 "") at 
utility.c:670
#11 0x000055d644b5f1cc in ProcessUtility 
(pstmt=pstmt@entry=0x55d6454431a0, queryString=<optimized out>, 
context=<optimized out>, params=<optimized out>, queryEnv=<optimized 
out>, dest=dest@entry=0x55d644f50b00 <debugtupDR>, 
completionTag=0x7ffea6a650d0 "") at utility.c:360
#12 0x000055d644b5b582 in PortalRunUtility 
(portal=portal@entry=0x55d645434428, pstmt=pstmt@entry=0x55d6454431a0, 
isTopLevel=isTopLevel@entry=true, 
setHoldSnapshot=setHoldSnapshot@entry=false, 
dest=dest@entry=0x55d644f50b00 <debugtupDR>, 
completionTag=completionTag@entry=0x7ffea6a650d0 "") at pquery.c:1175
#13 0x000055d644b5c1fd in PortalRunMulti 
(portal=portal@entry=0x55d645434428, isTopLevel=isTopLevel@entry=true, 
setHoldSnapshot=setHoldSnapshot@entry=false, 
dest=dest@entry=0x55d644f50b00 <debugtupDR>, 
altdest=altdest@entry=0x55d644f50b00 <debugtupDR>, 
completionTag=completionTag@entry=0x7ffea6a650d0 "") at pquery.c:1321
#14 0x000055d644b5cfbb in PortalRun (portal=portal@entry=0x55d645434428, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
run_once=run_once@entry=true, dest=dest@entry=0x55d644f50b00 
<debugtupDR>, altdest=altdest@entry=0x55d644f50b00 <debugtupDR>, 
completionTag=0x7ffea6a650d0 "") at pquery.c:796
#15 0x000055d644b591b8 in exec_simple_query 
(query_string=query_string@entry=0x55d645442468 "ANALYZE;\n") at 
postgres.c:1215
#16 0x000055d644b5b0dd in PostgresMain (argc=<optimized out>, 
argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) 
at postgres.c:4256
#17 0x000055d644a3abf9 in main (argc=10, argv=0x55d6453c3490) at main.c:224




Re: Reorganize collation lookup time and place

From
Peter Eisentraut
Date:
On 12/12/2018 15:58, Andreas Karlsson wrote:
> But when taking a look at your patch I got a segfault when running 
> initdb. See the stack trace below. My compiler is "gcc (Debian 8.2.0-9) 
> 8.2.0" and the locale when running initdb is glibc's "en_US.utf8".

Fixed in the attached updated patch.

(Apparently passing a NULL locale to strcoll_l() doesn't crash as
readily on macOS as on glibc.)

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Reorganize collation lookup time and place

From
Andres Freund
Date:
Hi,

On 2018-12-10 16:12:54 +0100, Peter Eisentraut wrote:
> Attached is a patch that reorganizes how and where collations are looked up.
> 
> Until now, a fmgr C function that makes use of collation information
> (for example varstr_cmp(), str_toupper()) gets passed the collation OID,
> looks up the collation with pg_newlocale_from_collation(), and then does
> something with it.
> 
> With this change, the lookup is moved earlier, typically during executor
> initialization.  The fmgr functions receive the locale pointer that is
> the result of that lookup.
> 
> This makes the collation lookup more similar to the function lookup.  In
> many cases they now happen right next to each other.
> pg_newlocale_from_collation() becomes analogous to fmgr_info() and
> pg_locale_t to FmgrInfo *.

Why isn't this integrated into fmgr_info()?

Greetings,

Andres Freund


Re: Reorganize collation lookup time and place

From
Peter Eisentraut
Date:
On 12/12/2018 18:56, Andres Freund wrote:
>> This makes the collation lookup more similar to the function lookup.  In
>> many cases they now happen right next to each other.
>> pg_newlocale_from_collation() becomes analogous to fmgr_info() and
>> pg_locale_t to FmgrInfo *.
> Why isn't this integrated into fmgr_info()?

fmgr_info() looks up stuff in pg_proc.  pg_newlocale_...() looks up
stuff in pg_collation.  There is no overlap between them.

There is also not necessarily a one-to-one correspondence between
function and collation lookup calls.  For example, in some cases you
need to look up a sorting and a hashing functions, but only one
collation for both.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Reorganize collation lookup time and place

From
Andres Freund
Date:
Hi,

On 2018-12-13 15:05:40 +0100, Peter Eisentraut wrote:
> On 12/12/2018 18:56, Andres Freund wrote:
> >> This makes the collation lookup more similar to the function lookup.  In
> >> many cases they now happen right next to each other.
> >> pg_newlocale_from_collation() becomes analogous to fmgr_info() and
> >> pg_locale_t to FmgrInfo *.
> > Why isn't this integrated into fmgr_info()?
> 
> fmgr_info() looks up stuff in pg_proc.  pg_newlocale_...() looks up
> stuff in pg_collation.  There is no overlap between them.

It looks up stuff necessary for calling a function, that doesn't fit
looking up the collation necessary to do so too badly. A lot of the the
changes you made are rote changes to each caller, taking the collation
oid and expanding it with pg_newlocale_from_collation().  It seems not
necessary to force every external user to do rote changes like:

                 fmgr_info(opexpr->opfuncid, finfo);
                 fmgr_info_set_expr((Node *) node, finfo);
                 InitFunctionCallInfoData(*fcinfo, finfo, 2,
-                                         opexpr->inputcollid, NULL, NULL);
+                                         pg_newlocale_from_collation(opexpr->inputcollid), NULL, NULL);
...
-    h = FunctionCall1Coll(array_extra_data->hash, DEFAULT_COLLATION_OID, d);
+    h = FunctionCall1Coll(array_extra_data->hash, pg_newlocale_from_collation(DEFAULT_COLLATION_OID), d);


A lot of the new pg_newlocale_from_collation() calls go a fair bit over
the customary line length limit.

As it stands I have another patch that wants to change the function call
interface, including for extensions:
https://www.postgresql.org/message-id/20180605172952.x34m5uz6ju6enaem%40alap3.anarazel.de
so perhaps we can just bite the bullet and do both.

But I'm somewhat doubtful it's useful to expose having to lookup
pg_newlocale_from_collation() at the callsites to every single caller of
functions in the codebase. You say that's a separate conern of
initializing function calls, for me it's unnecessarily exposing details
that seem likely to change.


> There is also not necessarily a one-to-one correspondence between
> function and collation lookup calls.  For example, in some cases you
> need to look up a sorting and a hashing functions, but only one
> collation for both.

That seems rare enough not to matter, performancewise.

Greetings,

Andres Freund


Re: Reorganize collation lookup time and place

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-12-13 15:05:40 +0100, Peter Eisentraut wrote:
>> On 12/12/2018 18:56, Andres Freund wrote:
>>> Why isn't this integrated into fmgr_info()?

>> fmgr_info() looks up stuff in pg_proc.  pg_newlocale_...() looks up
>> stuff in pg_collation.  There is no overlap between them.

> It looks up stuff necessary for calling a function, that doesn't fit
> looking up the collation necessary to do so too badly. A lot of the the
> changes you made are rote changes to each caller, taking the collation
> oid and expanding it with pg_newlocale_from_collation().

I'm not very thrilled with the idea of changing every single caller of
InitFunctionCallInfoData and related functions, especially when exactly
zero functional change ensues.  We should work harder on avoiding that
code churn; extension developers will thank us.

>> There is also not necessarily a one-to-one correspondence between
>> function and collation lookup calls.  For example, in some cases you
>> need to look up a sorting and a hashing functions, but only one
>> collation for both.

> That seems rare enough not to matter, performancewise.

I think it potentially does matter, but I also agree that it's not
the common case.  Could we perhaps keep the API for the existing
functions the same, and introduce new functions alongside them
to be used by the small number of places where it matters?

(I've not looked at Peter's patch yet, so maybe I'm talking through
my hat.  But we should set a pretty high value on avoiding code
churn in stuff as widely used as the fmgr interfaces.)

            regards, tom lane


Re: Reorganize collation lookup time and place

From
Andres Freund
Date:
On 2018-12-13 14:50:53 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-12-13 15:05:40 +0100, Peter Eisentraut wrote:
> >> On 12/12/2018 18:56, Andres Freund wrote:
> >>> Why isn't this integrated into fmgr_info()?
> 
> >> fmgr_info() looks up stuff in pg_proc.  pg_newlocale_...() looks up
> >> stuff in pg_collation.  There is no overlap between them.
> 
> > It looks up stuff necessary for calling a function, that doesn't fit
> > looking up the collation necessary to do so too badly. A lot of the the
> > changes you made are rote changes to each caller, taking the collation
> > oid and expanding it with pg_newlocale_from_collation().
> 
> I'm not very thrilled with the idea of changing every single caller of
> InitFunctionCallInfoData and related functions, especially when exactly
> zero functional change ensues.  We should work harder on avoiding that
> code churn; extension developers will thank us.

I am thinking of moving to commit 0001 (but not 0002) of
https://www.postgresql.org/message-id/20181009191802.ppt6lqcvkpjvkm76%40alap3.anarazel.de
which'd break some (but fewer) of the same place that'd be affected by
this.  The amount of memory saved for plpgsql and other places with lots
of expressoins is quite noticable, and it's a prerequisite for better
code generation and caching for JIT.  As that patch doesn't affect
DirectFunctionCall* etc, I'm not sure it's a sufficient argument to go
full breakage here, however.

Independent of that it still seems like putting stuff onto
every function caller that they don't need to deal with.

Greetings,

Andres Freund


Re: Reorganize collation lookup time and place

From
Andres Freund
Date:
Hi,

On 2018-12-13 14:50:53 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-12-13 15:05:40 +0100, Peter Eisentraut wrote:
> >> On 12/12/2018 18:56, Andres Freund wrote:
> >>> Why isn't this integrated into fmgr_info()?
> 
> >> fmgr_info() looks up stuff in pg_proc.  pg_newlocale_...() looks up
> >> stuff in pg_collation.  There is no overlap between them.
> 
> > It looks up stuff necessary for calling a function, that doesn't fit
> > looking up the collation necessary to do so too badly. A lot of the the
> > changes you made are rote changes to each caller, taking the collation
> > oid and expanding it with pg_newlocale_from_collation().
> 
> I'm not very thrilled with the idea of changing every single caller of
> InitFunctionCallInfoData and related functions, especially when exactly
> zero functional change ensues.  We should work harder on avoiding that
> code churn; extension developers will thank us.
> 
> >> There is also not necessarily a one-to-one correspondence between
> >> function and collation lookup calls.  For example, in some cases you
> >> need to look up a sorting and a hashing functions, but only one
> >> collation for both.
> 
> > That seems rare enough not to matter, performancewise.
> 
> I think it potentially does matter, but I also agree that it's not
> the common case.  Could we perhaps keep the API for the existing
> functions the same, and introduce new functions alongside them
> to be used by the small number of places where it matters?
> 
> (I've not looked at Peter's patch yet, so maybe I'm talking through
> my hat.  But we should set a pretty high value on avoiding code
> churn in stuff as widely used as the fmgr interfaces.)

So, what's the plan here? Should the CF entry be closed?

Greetings,

Andres Freund


Re: Reorganize collation lookup time and place

From
Peter Eisentraut
Date:
On 31/01/2019 13:45, Andres Freund wrote:
> So, what's the plan here? Should the CF entry be closed?

Set to RwF.

This patch will probably make a reappearance when we consider making ICU
available as the database-wide default collation.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Reorganize collation lookup time and place

From
Peter Geoghegan
Date:
On Fri, Feb 1, 2019 at 7:27 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> This patch will probably make a reappearance when we consider making ICU
> available as the database-wide default collation.

Somebody recently reported that an ICU locale was over 2000x faster
than the equivalent glibc locale with their query due to supporting
the abbreviated keys optimization:

https://postgr.es/m/CACd=f9cO5OcmAeBK1M5a3+t8BO7E91Ki0WaLAxFm=FBbon_HDw@mail.gmail.com

We really should get around to supporting ICU for database-wide
collations sooner rather than later. I knew that there were cases
where glibc performs abysmally, but I didn't think it could ever be a
difference of more than two orders of magnitude until recently.

-- 
Peter Geoghegan