Thread: Bringing PostgreSQL torwards the standard regarding case folding

Bringing PostgreSQL torwards the standard regarding case folding

From
Shachar Shemesh
Date:
I'm opening a new thread, as the previous one was too nested, and 
contained too much emotions.

I'll start by my understanding of a summary of the thread so far. The 
solution we are seeking would have to satisfy the following conditions:
1. Setting should be on a per-database level. A per-server option is not 
good enough, and a per-session option is too difficult to implement, 
with no apparent justifiable return.
2. Old applications already working with PG's lowercase folding should 
have an option to continue working unmodified for the foreseeable future.

Solutions offered so far, and their status:
1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn.
2. Dual state. Fold lower or upper. Break if client is broken.
3. Create a database conversion tool to change existing case.

Solution 1:
As currently the case folding is performed disjointed from the actual 
use of the identifier, this solution requires quite a big amount of 
work. On the other hand, and on second thought, it's main benefit - 
gradual transition of applications from one to the other, is not really 
necessary once you declare the current behaviour as there to stay. 
Existing applications can simply choose to continue using whatever 
method they currently use. No need for migration.

As such, I think we can simply state that tri-state migration path 
solution can be discarded for the time being.

Solution 2:
Obviously, this is the way to go. We will have a dabase attribute that 
states whether things are lower or upper case there.

Solution 3:
(unrelated to the above)
There seems to be some ambiguity about how to handle the translation. 
Such a tool seems to require guessing which identifiers are accessed 
quoted, unquoted, or both. The last option, of course, will never work.

We may need such a tool, for some projects may wish to transform from 
one way to the other. It seems to me, however, that such a tool can wait 
a little.

Open issues:
1. What do we do with identifiers in "template1" upon database creation?
2. How do we handle queries to tables belonging the catalog that are 
shared between databases?

Observation: on a lowercase folding DB, any identifier that is not 
composed only of lowercase characters MUST can be automatically assumed 
to be accessed only through quoted mode.

I therefor suggest the following path to a solution:
1. CreateDB will be able to create databases from either type.
2. template1 will be defined to be one or the other. For the sake of 
this discussion, let's assume it's lowercase (current situation)
3. CreateDB, upon being asked to create a new DB that has uppercase 
folding, will copy over template1, as it currently does.
4. While copying, it will check each identifier. If the identifier is 
not lowercase only, it is safe to copy it verbatim.
5. If the identifier is lowercase only, convert it to uppercase only. I 
am assuming here that the authors of the client code chose an 
uppercase-folding database, so they should know what they are doing when 
accessing stuff from the standard offering.
6. I'm not sure what are the shared tables from the catalog. I don't 
think it so unreasonable to ask anyone doing catalog work to assume that 
catalog entries are case-sensitive. As such, maybe it's best to just 
leave the data as is.
7. Column headers, however, will have to have a solution. A point still 
open in current design.

I'm hoping this summary helps in furthering the discussion.

-- 
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/



Re: Bringing PostgreSQL torwards the standard regarding

From
Rod Taylor
Date:
> 5. If the identifier is lowercase only, convert it to uppercase only. I 
> am assuming here that the authors of the client code chose an 
> uppercase-folding database, so they should know what they are doing when 
> accessing stuff from the standard offering.

You've just broken one of my databases.

In one project I quote nearly everything and do so in lower case only.
This was done to ensure portability between PostgreSQL, Oracle, etc. --
but with my preference of lower case names.

If someone copied this database with the wrong case folding option, it
would break a (reasonably) spec compliant application that is regularly
installed on environments where we have little to no control over the
database settings.

I think copied attributes need to be left alone. Train the PostgreSQL
utilities to always quote the identifiers instead.

If you want case to be folded, run an external utility to does a bunch
of ALTER ... RENAMEs.




Re: Bringing PostgreSQL torwards the standard regarding

From
Shachar Shemesh
Date:
Rod Taylor wrote:

>>5. If the identifier is lowercase only, convert it to uppercase only. I 
>>am assuming here that the authors of the client code chose an 
>>uppercase-folding database, so they should know what they are doing when 
>>accessing stuff from the standard offering.
>>    
>>
>
>You've just broken one of my databases.
>
>In one project I quote nearly everything and do so in lower case only.
>This was done to ensure portability between PostgreSQL, Oracle, etc. --
>but with my preference of lower case names.
>  
>
I'm not sure you understood me.

First, if we don't convert lower->upper, how can anyone expect the 
following query to work:
select lower(id) from table;

Even if you quote everything, you'd still probably have:
select lower("id") from "table";

Noone can expect you to do:
select "lower"("id") from "table";

The problem is that "lower" is defined in template1. If we don't 
uppercase it when we create the database, the above won't work. Then 
again, I'm fairly sure that the identifiers you placed as lowercase in 
your database are not defined by template1.

In short, I don't think this suggestion broke your database.
         Shachar

-- 
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/



Re: Bringing PostgreSQL torwards the standard regarding

From
Rod Taylor
Date:
> I'm not sure you understood me.

Perhaps not, here is what we have:

\c template_db
CREATE TABLE "example" ( "col" integer);

CREATE DATABASE newdb WITH TEMPLATE template_db UPPERCASE IDENTIFIERS;

\c newdb
SELECT "col" FROM "example";

> In short, I don't think this suggestion broke your database.

I understood that since "example", when copied, would turn into
"EXAMPLE" -- but that if it was "Example" it would be copied as
"Example".



Re: Bringing PostgreSQL torwards the standard regarding

From
Andrew Dunstan
Date:

Shachar Shemesh wrote:

> I'm opening a new thread, as the previous one was too nested, and 
> contained too much emotions.
>
> I'll start by my understanding of a summary of the thread so far. The 
> solution we are seeking would have to satisfy the following conditions:
> 1. Setting should be on a per-database level. A per-server option is 
> not good enough, and a per-session option is too difficult to 
> implement, with no apparent justifiable return. 


I am not convinced on this point. Why is per-server not good enough? The 
obvious place to make these changes seems to me to be during or 
immediatly after the bootstrap phase of initdb. It would avoid a host of 
later troubles.

>
> 2. Old applications already working with PG's lowercase folding should 
> have an option to continue working unmodified for the foreseeable future.
>
> Solutions offered so far, and their status:
> 1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn.
> 2. Dual state. Fold lower or upper. Break if client is broken.
> 3. Create a database conversion tool to change existing case.


I don't think we should rush at this. All of these solutions are based 
on the existing structures. I have started thinking about a solution 
that would involve keeping two versions of catalog names: a canonical 
name and a "name as supplied at creation".  There would be heaps of 
wrinkles, but it might get us where we want to be. But I have not had 
time to sort it out in my head yet, let alone make any experiments.  
Let's keep getting more ideas.

cheers

andrew





Re: Bringing PostgreSQL torwards the standard regarding

From
Alvaro Herrera
Date:
On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote:

> I don't think we should rush at this. All of these solutions are based 
> on the existing structures. I have started thinking about a solution 
> that would involve keeping two versions of catalog names: a canonical 
> name and a "name as supplied at creation".

Why do you want two names?  Just keep the original casing, and a boolean
saying if it's quoted or not.

-- 
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"No deja de ser humillante para una persona de ingenio saber
que no hay tonto que no le pueda enseñar algo." (Jean B. Say)


Re: Bringing PostgreSQL torwards the standard regarding

From
Andrew Dunstan
Date:

Alvaro Herrera wrote:

>On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote:
>
>  
>
>>I don't think we should rush at this. All of these solutions are based 
>>on the existing structures. I have started thinking about a solution 
>>that would involve keeping two versions of catalog names: a canonical 
>>name and a "name as supplied at creation".
>>    
>>
>
>Why do you want two names?  Just keep the original casing, and a boolean
>saying if it's quoted or not.
>  
>

Because I was not just considering the upper/lower issue - refer to the 
thread that relates to unquoted names with case preserved.

cheers

andrew



Re: Bringing PostgreSQL torwards the standard regarding

From
Andrew Dunstan
Date:

I wrote:

>
>
> Alvaro Herrera wrote:
>
>> On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote:
>>
>>  
>>
>>> I don't think we should rush at this. All of these solutions are 
>>> based on the existing structures. I have started thinking about a 
>>> solution that would involve keeping two versions of catalog names: a 
>>> canonical name and a "name as supplied at creation".
>>>   
>>
>>
>> Why do you want two names?  Just keep the original casing, and a boolean
>> saying if it's quoted or not.
>>  
>>
>
> Because I was not just considering the upper/lower issue - refer to 
> the thread that relates to unquoted names with case preserved. 



Sorry - brain malfunction  - yes, original casing plus boolean would 
work. In effect you could derive the canonical form from those two.

cheers

andrew





Re: Bringing PostgreSQL torwards the standard regarding

From
Dennis Bjorklund
Date:
On Sun, 25 Apr 2004, Andrew Dunstan wrote:

> >> Why do you want two names?  Just keep the original casing, and a boolean
> >> saying if it's quoted or not.
> 
> Sorry - brain malfunction  - yes, original casing plus boolean would 
> work. In effect you could derive the canonical form from those two.

Say that you have this in the table with the identifier
 name      quoted ----      ------ Foo       False

Now you want to add the name "FOO"
 FOO       True

should you be allowed or is it a clash with the above?

What if you also add "foo"
 foo       True

One of these two should be forbidden. And what about a quoted "FOO":
 FOO       False FOO       True

This case says it is not enough with an expressional unique index on
(upper(name), quoted). It would be easier to enforce uniqueness if one
store both the converted name and the original name:
 name      orig_name ----      --------- FOO       NULL                 <-- quoted one FOO       FOO
<--unquoted one
 

and the first case
 FOO       Foo                   <-- unquoted FOO       NULL                  <-- clashes with the first, good foo
NULL                  <-- no clash, works fine
 

With this one can always use upper case translation as per sql spec and
psql can optionally show all unquoted identifiers as upper, lower or mixed
case.

Then we also have the INFORMATION_SCHEMA that should show the names in
UPPER CASE when not quoted, this since applications that are written for
the standard might depend on that (probably no application do today but it
would be a valid case of use of the information schema).

-- 
/Dennis Björklund



Re: Bringing PostgreSQL torwards the standard regarding

From
"Andrew Dunstan"
Date:
Dennis Bjorklund said:
> On Sun, 25 Apr 2004, Andrew Dunstan wrote:
>
>> >> Why do you want two names?  Just keep the original casing, and a
>> >> boolean saying if it's quoted or not.
>>
>> Sorry - brain malfunction  - yes, original casing plus boolean would
>> work. In effect you could derive the canonical form from those two.
>


Dennis,

Ideas still swirling a bit, but I was thinking that there would be a per
database flag (which could indeed be set at db creation time) which would
specify the flavor of canonical names being used - upper, or lower, or we
could also consider exact (i.e. full case sensitivity, which I seem to
recall is a mode that SQLServer allows, possibly even the default, but my
memory could be rusty).

The canonical form of an unquoted name is dictated by this setting, while
the canonical form of a quoted name is the name as supplied. Two names
clash if their canonical forms are identical, quoted or not.

Assuming that we have a database with the flag set to use upper case
canonical names, as per the standard, then ...

> Say that you have this in the table with the identifier
>
>  name      quoted
>  ----      ------
>  Foo       False
>
> Now you want to add the name "FOO"
>
>  FOO       True
>
> should you be allowed or is it a clash with the above?

It's a clash. The canonical for of both is "FOO"


>
> What if you also add "foo"
>
>  foo       True
>

No clash - "FOO" <> "foo"

> One of these two should be forbidden. And what about a quoted "FOO":
>
>  FOO       False
>  FOO       True
>

clash

> This case says it is not enough with an expressional unique index on
> (upper(name), quoted). It would be easier to enforce uniqueness if one
> store both the converted name and the original name:
>

The constraint would in effect be on CASE WHEN quoted THEN name ELSE upper
(name) END.


The advantage of using a boolean is that a lot less work would need to be
done to use whatever flag was being used for the DB. Possibly a reindex
after the files are copied. It might fail on some highly pathological
cases, but should never fail on our standard template databases.


>  name      orig_name
>  ----      ---------
>  FOO       NULL                 <-- quoted one
>  FOO       FOO                  <-- unquoted one
>
> and the first case
>
>  FOO       Foo                   <-- unquoted
>  FOO       NULL                  <-- clashes with the first, good foo
>      NULL                  <-- no clash, works fine
>
> With this one can always use upper case translation as per sql spec and
> psql can optionally show all unquoted identifiers as upper, lower or
> mixed case.
>

My thought was that there would be a user setting that would allow
resultset labels to use either canonical or literal names.


> Then we also have the INFORMATION_SCHEMA that should show the names in
> UPPER CASE when not quoted, this since applications that are written
> for the standard might depend on that (probably no application do today
> but it would be a valid case of use of the information schema).
>


I see 2 possibilities: either use the upper case canonical setting I
envisioned above, or change the information schema setup to force upper
case labels via AS clauses in the views.

cheers

andrew





Re: Bringing PostgreSQL torwards the standard regarding

From
Dennis Bjorklund
Date:
On Mon, 26 Apr 2004, Andrew Dunstan wrote:

> Ideas still swirling a bit

Sure, I'm thinking in public as well. Not something you want to do if you 
are afraid of being wrong and showing it :-) But I'm not.

> The constraint would in effect be on CASE WHEN quoted THEN name ELSE upper
> (name) END.

That's simple enough (and pretty straight forward).

-- 
/Dennis Björklund



Re: Bringing PostgreSQL torwards the standard regarding case folding

From
Josh Berkus
Date:
Shachar,

I've been giving this some more thought.  Here are my contributions:

> 1. Setting should be on a per-database level. A per-server option is not
> good enough, and a per-session option is too difficult to implement,
> with no apparent justifiable return.

I disagree with this.  I think doing case-folding per database would be 
preposterously difficult, and that per-server is adequate.   Per database 
settings bring up a whole raft of logical conflicts, particularly around the 
system catalogs and dblink, that aren't necessarily worth navigating.

I also didn't follow the discussion of why a client-side implementation was 
technically impossible; this seems like the most obvious course to me, and to 
have *considerable* benefit.    It's also consistent with our other statement 
variables, such as datestyle, which are all client-side, per-session 
settings.   

A server-side implementation would possibly reqire touching every single 
source code file in Postgres, something that would justify a lot of effort to 
avoid.

> 2. Old applications already working with PG's lowercase folding should
> have an option to continue working unmodified for the foreseeable future.

Si.

> 1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn.

Can't see this being possible.

> 2. Dual state. Fold lower or upper. Break if client is broken.

Best, I think.  But it should be client-side.

> 3. Create a database conversion tool to change existing case.

No thanks.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Bringing PostgreSQL torwards the standard regarding case folding

From
Shachar Shemesh
Date:
Josh Berkus wrote:

>Shachar,
>
>  
>
>>I think the concensus was that the runtime part was aprox. four lines 
>>where the case folding currently takes place. Obviously, you would have 
>>to get a var, and propogate that var to that place, but not actually 
>>change program flow.
>>    
>>
>
>That's only if you ignore the system catalogs entirely, which maybe you're 
>prepared to do.  If you want to change case folding for the system catalogs, 
>though, you'll need to update code in thousands of places, becuase the 
>back-end code is expecting lower-case identifiers ....
>
>  
>
IF you want per session setting, yes.

If you want per database setting, you only need to worry about the 
shared catalogs

If you want server wide setting, you just create the catalogs with the 
correct name, and get it over with.

That's why I said that per-session setting seems like too much trouble.

-- 
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/



Re: Bringing PostgreSQL torwards the standard regarding case folding

From
Josh Berkus
Date:
Shachar,

> I think the concensus was that the runtime part was aprox. four lines 
> where the case folding currently takes place. Obviously, you would have 
> to get a var, and propogate that var to that place, but not actually 
> change program flow.

That's only if you ignore the system catalogs entirely, which maybe you're 
prepared to do.  If you want to change case folding for the system catalogs, 
though, you'll need to update code in thousands of places, becuase the 
back-end code is expecting lower-case identifiers ....

-- 
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Bringing PostgreSQL torwards the standard regarding case folding

From
Shachar Shemesh
Date:
Josh Berkus wrote:

>I also didn't follow the discussion of why a client-side implementation was 
>technically impossible; this seems like the most obvious course to me, and to 
>have *considerable* benefit.    It's also consistent with our other statement 
>variables, such as datestyle, which are all client-side, per-session 
>settings.   
>  
>
But they are not client side, are they? The date is formatted by the 
server. The client is simply receiving whatever datestyle itselected. 
That is, assuming I understand the behaviour correctly.

I would catagorize it as a server side per-session configuration. If 
that's what you mean, we thought it was too insane because it doesn't 
explain how you are supposed to handle all the catalog and other stuff 
where you are implictly assuming quoting is not necessary.

I'l reiterate the example. Do you really expect to have to write
select "upper"("field") from "table"
and should it be "upper" or "UPPER"?

>A server-side implementation would possibly reqire touching every single 
>source code file in Postgres, something that would justify a lot of effort to 
>avoid.
>  
>
I think the concensus was that the runtime part was aprox. four lines 
where the case folding currently takes place. Obviously, you would have 
to get a var, and propogate that var to that place, but not actually 
change program flow.
         Shachar

-- 
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/



Re: Bringing PostgreSQL torwards the standard regarding

From
"scott.marlowe"
Date:
On Mon, 26 Apr 2004, Josh Berkus wrote:

> Shachar,
> 
> > I think the concensus was that the runtime part was aprox. four lines 
> > where the case folding currently takes place. Obviously, you would have 
> > to get a var, and propogate that var to that place, but not actually 
> > change program flow.
> 
> That's only if you ignore the system catalogs entirely, which maybe you're 
> prepared to do.  If you want to change case folding for the system catalogs, 
> though, you'll need to update code in thousands of places, becuase the 
> back-end code is expecting lower-case identifiers ....

As someone who has discussed this with Tom in the past, I seem to remember 
that there were major issues with handling the system catalogs, because 
internally, the backends treat the identifiers as if they have already 
been quoted.

I think the answer to all of this would require a lot of code being 
touched to either make it case fold, costing performance, or the 
replacement of the default lower cased catalog with upper cased catalog.

i.e. no simple switch setting, but an initdb option that would be set like 
locale currently is, for the life of the cluster.

A more comprehensive solution, one which allowed switching from upper 
folding to lower folding to no folding, to case insensitive, or some 
subset of those possibilities results in 

a:  slower backend performance, due to folding case for system catalogs
b:  touching a helluva lot of backend code to make it possible to fold up 
or down.

I'm not 100% sure on this all, but that seems to be the point Tom and I 
came to in our discussion, and neither of the two solutions seemed very 
good at the time.



Re: Bringing PostgreSQL torwards the standard regarding

From
Shachar Shemesh
Date:
scott.marlowe wrote:

>I think the answer to all of this would require a lot of code being 
>touched to either make it case fold, costing performance, or the 
>replacement of the default lower cased catalog with upper cased catalog.
>  
>
I'm not the one to decide, but it seems to me that this is not a good 
time to enter such a major change, and may never be. How terrible will 
it be if the system is per server (i.e. - one given to initdb?).

>A more comprehensive solution, one which allowed switching from upper 
>folding to lower folding to no folding, to case insensitive, or some 
>subset of those possibilities results in 
>
>a:  slower backend performance, due to folding case for system catalogs
>b:  touching a helluva lot of backend code to make it possible to fold up 
>or down.
>  
>
Well, if it's only the shared catalogs that are affected, this may not 
be such a big problem (very hopeful tone?)

>I'm not 100% sure on this all, but that seems to be the point Tom and I 
>came to in our discussion, and neither of the two solutions seemed very 
>good at the time.
>  
>      Shachar

-- 
Shachar Shemesh
Lingnu Open Source Consulting
http://www.lingnu.com/



Re: Bringing PostgreSQL torwards the standard regarding

From
Hannu Krosing
Date:
scott.marlowe kirjutas T, 27.04.2004 kell 20:43:

> As someone who has discussed this with Tom in the past, I seem to remember 
> that there were major issues with handling the system catalogs, because 
> internally, the backends treat the identifiers as if they have already 
> been quoted.

why not leave it at that and mandate their *external* doublequoted use ?

not only for system catalogs but also for system columns.

I already have had negative experience with a proprietary tool not being
able to use postgres (over ODBC) because it has a system column called
tmin (or was it tid). If things were case-folded to upper automatically,
there would not have been such problem.

--------------
Hannu



Re: Bringing PostgreSQL torwards the standard regarding

From
Andrew Dunstan
Date:

Jan Wieck wrote:

> This is the reason why the setting has to be at least per database and 
> cannot be changed after DB creation. 



I think there's overwhelming consensus that db creation time is the 
latest you can specify the canonical name setting for it. There's 
probably a good case to be made for it to be when you initdb, so that it 
is set as expected for shared tables.

(Is anyone actually doing anything on this?)

cheers

andrew


> What has to change is the behaviour of the name type operators, which 
> will automatically change the uniqueness behaviour of the catalog 
> indexes.
>
> In an UPPERCASE database
>
>     foo/Foo/FOO false = FOO true
>
> In a lowercase database
>
>     foo/Foo/FOO false = foo true
>
> In both of them
>
>     foo/Foo/FOO false <> Foo true
>     foo/Foo/FOO false = foo/Foo/FOO false
>
>
> Jan
>
>
> Dennis Bjorklund wrote:
>
>> On Sun, 25 Apr 2004, Andrew Dunstan wrote:
>>
>>> >> Why do you want two names?  Just keep the original casing, and a 
>>> boolean
>>> >> saying if it's quoted or not.
>>>
>>> Sorry - brain malfunction  - yes, original casing plus boolean would 
>>> work. In effect you could derive the canonical form from those two.
>>
>>
>> Say that you have this in the table with the identifier
>>
>>   name      quoted
>>   ----      ------
>>   Foo       False
>>
>> Now you want to add the name "FOO"
>>
>>   FOO       True
>>
>> should you be allowed or is it a clash with the above?
>>
>> What if you also add "foo"
>>
>>   foo       True
>>
>> One of these two should be forbidden. And what about a quoted "FOO":
>>
>>   FOO       False
>>   FOO       True
>>
>> This case says it is not enough with an expressional unique index on
>> (upper(name), quoted). It would be easier to enforce uniqueness if one
>> store both the converted name and the original name:
>>
>>   name      orig_name
>>   ----      ---------
>>   FOO       NULL                 <-- quoted one
>>   FOO       FOO                  <-- unquoted one
>>
>> and the first case
>>
>>   FOO       Foo                   <-- unquoted
>>   FOO       NULL                  <-- clashes with the first, good
>>   foo       NULL                  <-- no clash, works fine
>>
>> With this one can always use upper case translation as per sql spec and
>> psql can optionally show all unquoted identifiers as upper, lower or 
>> mixed
>> case.
>>
>> Then we also have the INFORMATION_SCHEMA that should show the names in
>> UPPER CASE when not quoted, this since applications that are written for
>> the standard might depend on that (probably no application do today 
>> but it
>> would be a valid case of use of the information schema).
>>
>
>