Thread: Bringing PostgreSQL torwards the standard regarding case folding
I'm opening a new thread, as the previous one was too nested, and contained too much emotions. I'll start by my understanding of a summary of the thread so far. The solution we are seeking would have to satisfy the following conditions: 1. Setting should be on a per-database level. A per-server option is not good enough, and a per-session option is too difficult to implement, with no apparent justifiable return. 2. Old applications already working with PG's lowercase folding should have an option to continue working unmodified for the foreseeable future. Solutions offered so far, and their status: 1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn. 2. Dual state. Fold lower or upper. Break if client is broken. 3. Create a database conversion tool to change existing case. Solution 1: As currently the case folding is performed disjointed from the actual use of the identifier, this solution requires quite a big amount of work. On the other hand, and on second thought, it's main benefit - gradual transition of applications from one to the other, is not really necessary once you declare the current behaviour as there to stay. Existing applications can simply choose to continue using whatever method they currently use. No need for migration. As such, I think we can simply state that tri-state migration path solution can be discarded for the time being. Solution 2: Obviously, this is the way to go. We will have a dabase attribute that states whether things are lower or upper case there. Solution 3: (unrelated to the above) There seems to be some ambiguity about how to handle the translation. Such a tool seems to require guessing which identifiers are accessed quoted, unquoted, or both. The last option, of course, will never work. We may need such a tool, for some projects may wish to transform from one way to the other. It seems to me, however, that such a tool can wait a little. Open issues: 1. What do we do with identifiers in "template1" upon database creation? 2. How do we handle queries to tables belonging the catalog that are shared between databases? Observation: on a lowercase folding DB, any identifier that is not composed only of lowercase characters MUST can be automatically assumed to be accessed only through quoted mode. I therefor suggest the following path to a solution: 1. CreateDB will be able to create databases from either type. 2. template1 will be defined to be one or the other. For the sake of this discussion, let's assume it's lowercase (current situation) 3. CreateDB, upon being asked to create a new DB that has uppercase folding, will copy over template1, as it currently does. 4. While copying, it will check each identifier. If the identifier is not lowercase only, it is safe to copy it verbatim. 5. If the identifier is lowercase only, convert it to uppercase only. I am assuming here that the authors of the client code chose an uppercase-folding database, so they should know what they are doing when accessing stuff from the standard offering. 6. I'm not sure what are the shared tables from the catalog. I don't think it so unreasonable to ask anyone doing catalog work to assume that catalog entries are case-sensitive. As such, maybe it's best to just leave the data as is. 7. Column headers, however, will have to have a solution. A point still open in current design. I'm hoping this summary helps in furthering the discussion. -- Shachar Shemesh Lingnu Open Source Consulting http://www.lingnu.com/
> 5. If the identifier is lowercase only, convert it to uppercase only. I > am assuming here that the authors of the client code chose an > uppercase-folding database, so they should know what they are doing when > accessing stuff from the standard offering. You've just broken one of my databases. In one project I quote nearly everything and do so in lower case only. This was done to ensure portability between PostgreSQL, Oracle, etc. -- but with my preference of lower case names. If someone copied this database with the wrong case folding option, it would break a (reasonably) spec compliant application that is regularly installed on environments where we have little to no control over the database settings. I think copied attributes need to be left alone. Train the PostgreSQL utilities to always quote the identifiers instead. If you want case to be folded, run an external utility to does a bunch of ALTER ... RENAMEs.
Rod Taylor wrote: >>5. If the identifier is lowercase only, convert it to uppercase only. I >>am assuming here that the authors of the client code chose an >>uppercase-folding database, so they should know what they are doing when >>accessing stuff from the standard offering. >> >> > >You've just broken one of my databases. > >In one project I quote nearly everything and do so in lower case only. >This was done to ensure portability between PostgreSQL, Oracle, etc. -- >but with my preference of lower case names. > > I'm not sure you understood me. First, if we don't convert lower->upper, how can anyone expect the following query to work: select lower(id) from table; Even if you quote everything, you'd still probably have: select lower("id") from "table"; Noone can expect you to do: select "lower"("id") from "table"; The problem is that "lower" is defined in template1. If we don't uppercase it when we create the database, the above won't work. Then again, I'm fairly sure that the identifiers you placed as lowercase in your database are not defined by template1. In short, I don't think this suggestion broke your database. Shachar -- Shachar Shemesh Lingnu Open Source Consulting http://www.lingnu.com/
> I'm not sure you understood me. Perhaps not, here is what we have: \c template_db CREATE TABLE "example" ( "col" integer); CREATE DATABASE newdb WITH TEMPLATE template_db UPPERCASE IDENTIFIERS; \c newdb SELECT "col" FROM "example"; > In short, I don't think this suggestion broke your database. I understood that since "example", when copied, would turn into "EXAMPLE" -- but that if it was "Example" it would be copied as "Example".
Shachar Shemesh wrote: > I'm opening a new thread, as the previous one was too nested, and > contained too much emotions. > > I'll start by my understanding of a summary of the thread so far. The > solution we are seeking would have to satisfy the following conditions: > 1. Setting should be on a per-database level. A per-server option is > not good enough, and a per-session option is too difficult to > implement, with no apparent justifiable return. I am not convinced on this point. Why is per-server not good enough? The obvious place to make these changes seems to me to be during or immediatly after the bootstrap phase of initdb. It would avoid a host of later troubles. > > 2. Old applications already working with PG's lowercase folding should > have an option to continue working unmodified for the foreseeable future. > > Solutions offered so far, and their status: > 1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn. > 2. Dual state. Fold lower or upper. Break if client is broken. > 3. Create a database conversion tool to change existing case. I don't think we should rush at this. All of these solutions are based on the existing structures. I have started thinking about a solution that would involve keeping two versions of catalog names: a canonical name and a "name as supplied at creation". There would be heaps of wrinkles, but it might get us where we want to be. But I have not had time to sort it out in my head yet, let alone make any experiments. Let's keep getting more ideas. cheers andrew
On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote: > I don't think we should rush at this. All of these solutions are based > on the existing structures. I have started thinking about a solution > that would involve keeping two versions of catalog names: a canonical > name and a "name as supplied at creation". Why do you want two names? Just keep the original casing, and a boolean saying if it's quoted or not. -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "No deja de ser humillante para una persona de ingenio saber que no hay tonto que no le pueda enseñar algo." (Jean B. Say)
Alvaro Herrera wrote: >On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote: > > > >>I don't think we should rush at this. All of these solutions are based >>on the existing structures. I have started thinking about a solution >>that would involve keeping two versions of catalog names: a canonical >>name and a "name as supplied at creation". >> >> > >Why do you want two names? Just keep the original casing, and a boolean >saying if it's quoted or not. > > Because I was not just considering the upper/lower issue - refer to the thread that relates to unquoted names with case preserved. cheers andrew
I wrote: > > > Alvaro Herrera wrote: > >> On Sun, Apr 25, 2004 at 12:23:55PM -0400, Andrew Dunstan wrote: >> >> >> >>> I don't think we should rush at this. All of these solutions are >>> based on the existing structures. I have started thinking about a >>> solution that would involve keeping two versions of catalog names: a >>> canonical name and a "name as supplied at creation". >>> >> >> >> Why do you want two names? Just keep the original casing, and a boolean >> saying if it's quoted or not. >> >> > > Because I was not just considering the upper/lower issue - refer to > the thread that relates to unquoted names with case preserved. Sorry - brain malfunction - yes, original casing plus boolean would work. In effect you could derive the canonical form from those two. cheers andrew
On Sun, 25 Apr 2004, Andrew Dunstan wrote: > >> Why do you want two names? Just keep the original casing, and a boolean > >> saying if it's quoted or not. > > Sorry - brain malfunction - yes, original casing plus boolean would > work. In effect you could derive the canonical form from those two. Say that you have this in the table with the identifier name quoted ---- ------ Foo False Now you want to add the name "FOO" FOO True should you be allowed or is it a clash with the above? What if you also add "foo" foo True One of these two should be forbidden. And what about a quoted "FOO": FOO False FOO True This case says it is not enough with an expressional unique index on (upper(name), quoted). It would be easier to enforce uniqueness if one store both the converted name and the original name: name orig_name ---- --------- FOO NULL <-- quoted one FOO FOO <--unquoted one and the first case FOO Foo <-- unquoted FOO NULL <-- clashes with the first, good foo NULL <-- no clash, works fine With this one can always use upper case translation as per sql spec and psql can optionally show all unquoted identifiers as upper, lower or mixed case. Then we also have the INFORMATION_SCHEMA that should show the names in UPPER CASE when not quoted, this since applications that are written for the standard might depend on that (probably no application do today but it would be a valid case of use of the information schema). -- /Dennis Björklund
Dennis Bjorklund said: > On Sun, 25 Apr 2004, Andrew Dunstan wrote: > >> >> Why do you want two names? Just keep the original casing, and a >> >> boolean saying if it's quoted or not. >> >> Sorry - brain malfunction - yes, original casing plus boolean would >> work. In effect you could derive the canonical form from those two. > Dennis, Ideas still swirling a bit, but I was thinking that there would be a per database flag (which could indeed be set at db creation time) which would specify the flavor of canonical names being used - upper, or lower, or we could also consider exact (i.e. full case sensitivity, which I seem to recall is a mode that SQLServer allows, possibly even the default, but my memory could be rusty). The canonical form of an unquoted name is dictated by this setting, while the canonical form of a quoted name is the name as supplied. Two names clash if their canonical forms are identical, quoted or not. Assuming that we have a database with the flag set to use upper case canonical names, as per the standard, then ... > Say that you have this in the table with the identifier > > name quoted > ---- ------ > Foo False > > Now you want to add the name "FOO" > > FOO True > > should you be allowed or is it a clash with the above? It's a clash. The canonical for of both is "FOO" > > What if you also add "foo" > > foo True > No clash - "FOO" <> "foo" > One of these two should be forbidden. And what about a quoted "FOO": > > FOO False > FOO True > clash > This case says it is not enough with an expressional unique index on > (upper(name), quoted). It would be easier to enforce uniqueness if one > store both the converted name and the original name: > The constraint would in effect be on CASE WHEN quoted THEN name ELSE upper (name) END. The advantage of using a boolean is that a lot less work would need to be done to use whatever flag was being used for the DB. Possibly a reindex after the files are copied. It might fail on some highly pathological cases, but should never fail on our standard template databases. > name orig_name > ---- --------- > FOO NULL <-- quoted one > FOO FOO <-- unquoted one > > and the first case > > FOO Foo <-- unquoted > FOO NULL <-- clashes with the first, good foo > NULL <-- no clash, works fine > > With this one can always use upper case translation as per sql spec and > psql can optionally show all unquoted identifiers as upper, lower or > mixed case. > My thought was that there would be a user setting that would allow resultset labels to use either canonical or literal names. > Then we also have the INFORMATION_SCHEMA that should show the names in > UPPER CASE when not quoted, this since applications that are written > for the standard might depend on that (probably no application do today > but it would be a valid case of use of the information schema). > I see 2 possibilities: either use the upper case canonical setting I envisioned above, or change the information schema setup to force upper case labels via AS clauses in the views. cheers andrew
On Mon, 26 Apr 2004, Andrew Dunstan wrote: > Ideas still swirling a bit Sure, I'm thinking in public as well. Not something you want to do if you are afraid of being wrong and showing it :-) But I'm not. > The constraint would in effect be on CASE WHEN quoted THEN name ELSE upper > (name) END. That's simple enough (and pretty straight forward). -- /Dennis Björklund
Shachar, I've been giving this some more thought. Here are my contributions: > 1. Setting should be on a per-database level. A per-server option is not > good enough, and a per-session option is too difficult to implement, > with no apparent justifiable return. I disagree with this. I think doing case-folding per database would be preposterously difficult, and that per-server is adequate. Per database settings bring up a whole raft of logical conflicts, particularly around the system catalogs and dblink, that aren't necessarily worth navigating. I also didn't follow the discussion of why a client-side implementation was technically impossible; this seems like the most obvious course to me, and to have *considerable* benefit. It's also consistent with our other statement variables, such as datestyle, which are all client-side, per-session settings. A server-side implementation would possibly reqire touching every single source code file in Postgres, something that would justify a lot of effort to avoid. > 2. Old applications already working with PG's lowercase folding should > have an option to continue working unmodified for the foreseeable future. Si. > 1. Tri-state. Folder upper, if failes, fold lower, if succeeds, warn. Can't see this being possible. > 2. Dual state. Fold lower or upper. Break if client is broken. Best, I think. But it should be client-side. > 3. Create a database conversion tool to change existing case. No thanks. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: >Shachar, > > > >>I think the concensus was that the runtime part was aprox. four lines >>where the case folding currently takes place. Obviously, you would have >>to get a var, and propogate that var to that place, but not actually >>change program flow. >> >> > >That's only if you ignore the system catalogs entirely, which maybe you're >prepared to do. If you want to change case folding for the system catalogs, >though, you'll need to update code in thousands of places, becuase the >back-end code is expecting lower-case identifiers .... > > > IF you want per session setting, yes. If you want per database setting, you only need to worry about the shared catalogs If you want server wide setting, you just create the catalogs with the correct name, and get it over with. That's why I said that per-session setting seems like too much trouble. -- Shachar Shemesh Lingnu Open Source Consulting http://www.lingnu.com/
Shachar, > I think the concensus was that the runtime part was aprox. four lines > where the case folding currently takes place. Obviously, you would have > to get a var, and propogate that var to that place, but not actually > change program flow. That's only if you ignore the system catalogs entirely, which maybe you're prepared to do. If you want to change case folding for the system catalogs, though, you'll need to update code in thousands of places, becuase the back-end code is expecting lower-case identifiers .... -- -Josh BerkusAglio Database SolutionsSan Francisco
Josh Berkus wrote: >I also didn't follow the discussion of why a client-side implementation was >technically impossible; this seems like the most obvious course to me, and to >have *considerable* benefit. It's also consistent with our other statement >variables, such as datestyle, which are all client-side, per-session >settings. > > But they are not client side, are they? The date is formatted by the server. The client is simply receiving whatever datestyle itselected. That is, assuming I understand the behaviour correctly. I would catagorize it as a server side per-session configuration. If that's what you mean, we thought it was too insane because it doesn't explain how you are supposed to handle all the catalog and other stuff where you are implictly assuming quoting is not necessary. I'l reiterate the example. Do you really expect to have to write select "upper"("field") from "table" and should it be "upper" or "UPPER"? >A server-side implementation would possibly reqire touching every single >source code file in Postgres, something that would justify a lot of effort to >avoid. > > I think the concensus was that the runtime part was aprox. four lines where the case folding currently takes place. Obviously, you would have to get a var, and propogate that var to that place, but not actually change program flow. Shachar -- Shachar Shemesh Lingnu Open Source Consulting http://www.lingnu.com/
On Mon, 26 Apr 2004, Josh Berkus wrote: > Shachar, > > > I think the concensus was that the runtime part was aprox. four lines > > where the case folding currently takes place. Obviously, you would have > > to get a var, and propogate that var to that place, but not actually > > change program flow. > > That's only if you ignore the system catalogs entirely, which maybe you're > prepared to do. If you want to change case folding for the system catalogs, > though, you'll need to update code in thousands of places, becuase the > back-end code is expecting lower-case identifiers .... As someone who has discussed this with Tom in the past, I seem to remember that there were major issues with handling the system catalogs, because internally, the backends treat the identifiers as if they have already been quoted. I think the answer to all of this would require a lot of code being touched to either make it case fold, costing performance, or the replacement of the default lower cased catalog with upper cased catalog. i.e. no simple switch setting, but an initdb option that would be set like locale currently is, for the life of the cluster. A more comprehensive solution, one which allowed switching from upper folding to lower folding to no folding, to case insensitive, or some subset of those possibilities results in a: slower backend performance, due to folding case for system catalogs b: touching a helluva lot of backend code to make it possible to fold up or down. I'm not 100% sure on this all, but that seems to be the point Tom and I came to in our discussion, and neither of the two solutions seemed very good at the time.
scott.marlowe wrote: >I think the answer to all of this would require a lot of code being >touched to either make it case fold, costing performance, or the >replacement of the default lower cased catalog with upper cased catalog. > > I'm not the one to decide, but it seems to me that this is not a good time to enter such a major change, and may never be. How terrible will it be if the system is per server (i.e. - one given to initdb?). >A more comprehensive solution, one which allowed switching from upper >folding to lower folding to no folding, to case insensitive, or some >subset of those possibilities results in > >a: slower backend performance, due to folding case for system catalogs >b: touching a helluva lot of backend code to make it possible to fold up >or down. > > Well, if it's only the shared catalogs that are affected, this may not be such a big problem (very hopeful tone?) >I'm not 100% sure on this all, but that seems to be the point Tom and I >came to in our discussion, and neither of the two solutions seemed very >good at the time. > > Shachar -- Shachar Shemesh Lingnu Open Source Consulting http://www.lingnu.com/
scott.marlowe kirjutas T, 27.04.2004 kell 20:43: > As someone who has discussed this with Tom in the past, I seem to remember > that there were major issues with handling the system catalogs, because > internally, the backends treat the identifiers as if they have already > been quoted. why not leave it at that and mandate their *external* doublequoted use ? not only for system catalogs but also for system columns. I already have had negative experience with a proprietary tool not being able to use postgres (over ODBC) because it has a system column called tmin (or was it tid). If things were case-folded to upper automatically, there would not have been such problem. -------------- Hannu
Jan Wieck wrote: > This is the reason why the setting has to be at least per database and > cannot be changed after DB creation. I think there's overwhelming consensus that db creation time is the latest you can specify the canonical name setting for it. There's probably a good case to be made for it to be when you initdb, so that it is set as expected for shared tables. (Is anyone actually doing anything on this?) cheers andrew > What has to change is the behaviour of the name type operators, which > will automatically change the uniqueness behaviour of the catalog > indexes. > > In an UPPERCASE database > > foo/Foo/FOO false = FOO true > > In a lowercase database > > foo/Foo/FOO false = foo true > > In both of them > > foo/Foo/FOO false <> Foo true > foo/Foo/FOO false = foo/Foo/FOO false > > > Jan > > > Dennis Bjorklund wrote: > >> On Sun, 25 Apr 2004, Andrew Dunstan wrote: >> >>> >> Why do you want two names? Just keep the original casing, and a >>> boolean >>> >> saying if it's quoted or not. >>> >>> Sorry - brain malfunction - yes, original casing plus boolean would >>> work. In effect you could derive the canonical form from those two. >> >> >> Say that you have this in the table with the identifier >> >> name quoted >> ---- ------ >> Foo False >> >> Now you want to add the name "FOO" >> >> FOO True >> >> should you be allowed or is it a clash with the above? >> >> What if you also add "foo" >> >> foo True >> >> One of these two should be forbidden. And what about a quoted "FOO": >> >> FOO False >> FOO True >> >> This case says it is not enough with an expressional unique index on >> (upper(name), quoted). It would be easier to enforce uniqueness if one >> store both the converted name and the original name: >> >> name orig_name >> ---- --------- >> FOO NULL <-- quoted one >> FOO FOO <-- unquoted one >> >> and the first case >> >> FOO Foo <-- unquoted >> FOO NULL <-- clashes with the first, good >> foo NULL <-- no clash, works fine >> >> With this one can always use upper case translation as per sql spec and >> psql can optionally show all unquoted identifiers as upper, lower or >> mixed >> case. >> >> Then we also have the INFORMATION_SCHEMA that should show the names in >> UPPER CASE when not quoted, this since applications that are written for >> the standard might depend on that (probably no application do today >> but it >> would be a valid case of use of the information schema). >> > >