Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers - Mailing list pgsql-hackers

From Russell Smith
Subject Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Date
Msg-id 47EBCB9D.1070907@pws.com.au
Whole thread Raw
In response to TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers  (Russell Smith <mr-russ@pws.com.au>)
Responses Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
Re: TODO Item: Consider allowing control of upper/lower case folding of unquoted, identifiers
List pgsql-hackers
Hi,

It looks like most of the hard yards will be in getting some form of 
consensus about what should be done for this TODO.  I can't see a reason 
not to get started on the design now.  If a decision is not able to be 
made after 4 years since the original discussion, is it worth removing 
the TODO or letting it sit for another 4?  But to the actual issue at hand.

Andrew Dunstan attempted to summarize the original 2004 thread 
http://archives.postgresql.org/pgsql-hackers/2006-10/msg01545.php;

--
There was some discussion a couple of years ago on the -hackers list 
about it, so you might like to review the archives. The consensus seemed 
to be that behaviour would need to be set no later than createdb time.

The options I thought of were:

1. current postgres behaviour (we need to do this for legacy reasons, of 
course, as well as to keep happy the legions who hate using upper case 
for anything)

2. strictly spec compliant (same as current behaviour, but folding to 
upper case for unquoted identifiers rather than lower)

3. fully case sensitive even for unquoted identifiers (not spec 
compliant at all, but nevertheless possibly attractive especially for 
people migrating from MS SQLServer, where it is an option, IIRC).
--


Supporting all 3 of these behaviours at initdb time is not too invasive 
or complicated from my initial investigation.  The steps appear to be;

1. parser has to parse incoming identifiers with the correct casing 
changes. (currently downcase_truncate_identifier)
2. The output quoting needs to quote identifiers using the same rules as 
the parser. (currently quote_identifier)
3. the client needs to know what quote rules are in place. (libpq: 
PQfname, PQfnumber)
4. psql needs to \ commands to be taught about the fact that case can 
mean different things to different servers.
5. bootstrap needs to correctly case the tables and insert values when 
bootstrapping at initdb time. This is only really an issue for upper 
case folding.

Many people appear advocate a 4th option to only want the column names 
to be case preserved or upper cased.  They expect other identifiers will 
behave as they do now.  This doesn't really bring us any closer to the 
spec, it takes us away from it as Tom has suggested in the past.  It 
also appears to increase the complexity and invasiveness of a patch.  
Being able to support case preservation/sensitivity for all identifiers 
at initdb time appears to be no extra work than supporting the upper and 
lower folding versions.

The discussions around having a name as supplied and a quoted version 
allow lots of flexibility, probably even down to the session level.  
However I personally am struggling to get my head around the corner 
cases for that approach.

If this needs to be at createdb time, I think we add at least the 
following complexities;

1. all relations cases must be altered when copied from the template 
database or quoted when copied.  We have no idea what a template database might look like, all views 
and functions would need to be parsed to ensure they point to valid tables.
2. shared relations must be able to be accessed using different names in 
different databases, eg PG_DATABASE, pg_database.
3. The data in shared relations appears different to the same users in 
different databases.  eg my unquoted username is MrRuss, in db1 (upper): MRRUSS, db2 (case 
sensitive): MrRuss, db3 (lower): mrruss  My guts tells me that's going to lead to user confusion.


Dumping and restoring databases to different foldings can/will present 
an interesting challenge and I'm not sure how to support that.  I don't 
even know if we want to support that officially.

I'm leaning towards initdb time, mainly because I think a patch can be 
produced that isn't to invasive and is much easier to review and 
actually get applied.  I also think that adding the createdb time flags 
will push this task beyond my ability to write up a patch.  Either way 
though, consensus on what implementation we actually want going forward 
will enable some more skilled developer to do this without the pain of 
having to flesh out the design.

In light of this email and the other comments Tom and Andrew have made, 
it's very easy to say 'too hard, we can't get agreement'.  I would have 
thought that standards compliance would have been one good reason to 
push forward with at least the upper case folding ability.  Both of the 
previous threads on this issue raised lots of questions about possible 
options but there never seemed to be any knocking the ideas around and 
getting consensus phase.  I would like to at least nail down some of the 
requirement, if not all.  I have put forward my personal opinion, but I 
expect that is not of significant value as there are many others with 
much more experience than I.

Regards

Russell Smith


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Windows shared_buffers limitations
Next
From: Wizard Shah
Date:
Subject: Re: Indexing for Expression type data using GIST