Thread: Ideas needed: How to create and store collation tables
I am trying to figure out which is the best way to store custom collation tables on a PostgreSQL server system, and what kind of interface to provide to users to allow them to create their own. A collation table essentially consists of a mapping 'character code -> weight' for every character in the set and some additional considerations for one-to-many and many-to-one mappings, plus a few feature flags. How would a user go about creating such a table? CREATE COLLATION foo ( ... <10000 lines of data> ... ); or would it be preferrable to store the table in some external file and then have the call simply be, say, CREATE COLLATION foo SOURCE 'some file'; The latter has the disadvantage that we'd need some smarts so that pg_dump would not repeat the mistakes that were made with dynamically loadable modules (such as absolute file paths). The former has the disadvantage that it is too unwieldy to be useful. We also need to consider the following two problems: Firstly, if the collation data -- no matter how it is created -- is stored within the database (that is, in some table(s)), then it would be duplicated in every database. Depending on the storage format, a collation table takes between around 100 kB and 800 kB. Multiply that by a few dozen languages, for each database. That would make an external file seem more attractive. (The external file would need to be a binary file that is precomputed for efficient processing, unless we want to reparse and reprocess it every so often, like for every session.) Secondly, because each collation table depends on a particular character encoding (since it is indexed by character code), some sort of magic needs to happen when someone creates a database with a different encoding than the template database. One option is to do some mangling on the registered external file name (such as appending the encoding name to the file name). Another option is to have the notional pg_collate system catalog contain a column for the encoding, and then simply ignore all entries pertaining to encodings other than the database encoding. Comments or better ideas? -- Peter Eisentraut peter_e@gmx.net
On Mon, 18 Nov 2002, Peter Eisentraut wrote: > A collation table essentially consists of a mapping 'character code -> > weight' for every character in the set and some additional considerations > for one-to-many and many-to-one mappings, plus a few feature flags. > > How would a user go about creating such a table? > > CREATE COLLATION foo ( > ... > <10000 lines of data> > ... > ); > > or would it be preferrable to store the table in some external file and > then have the call simply be, say, > > CREATE COLLATION foo SOURCE 'some file'; I'd say the latter makes more sense, but would it be better to use CREATE COLLATION foo FROM EXTERNAL 'some file'; where we say valid implementation defined collation names are references to files of the appropriate type? > Secondly, because each collation table depends on a particular character > encoding (since it is indexed by character code), some sort of magic needs > to happen when someone creates a database with a different encoding than > the template database. One option is to do some mangling on the > registered external file name (such as appending the encoding name to the > file name). Another option is to have the notional pg_collate system > catalog contain a column for the encoding, and then simply ignore all > entries pertaining to encodings other than the database encoding. The SQL92 CREATE COLLATION seems to create a collation for a particular character set, so the latter seems more appropriate to me, especially if we plan to support the full range of SQL's character set/collation/padding attributes at some point.
Peter Eisentraut <peter_e@gmx.net> writes: > I am trying to figure out which is the best way to store custom collation > tables on a PostgreSQL server system, and what kind of interface to > provide to users to allow them to create their own. > A collation table essentially consists of a mapping 'character code -> > weight' for every character in the set and some additional considerations > for one-to-many and many-to-one mappings, plus a few feature flags. I'd be inclined to handle it similarly to the way that Tatsuo did with conversion_procs: let collations be represented by comparison functions that meet some suitable API. I think that trying to represent such a table as an SQL table compactly will be a nightmare, and trying to access it quickly enough for reasonable performance will be worse. Keep the problem out of the API and let each comparison function do what it needs to do internally. > Secondly, because each collation table depends on a particular character > encoding (since it is indexed by character code), some sort of magic needs > to happen when someone creates a database with a different encoding than > the template database. One option is to do some mangling on the > registered external file name (such as appending the encoding name to the > file name). Another option is to have the notional pg_collate system > catalog contain a column for the encoding, and then simply ignore all > entries pertaining to encodings other than the database encoding. SQL92 says that any particular collation is applicable to only one character set (which is their term that matches our "encoding"s). So I think we'd definitely want to associate a character set with each pg_collation entry, and then ignore any entries that don't match the DB encoding. (Further down the road, "the" DB encoding might change into just a "default for tables in this DB" encoding, meaning that we'd need access to collations for multiple encodings anyway.) regards, tom lane
> I'd be inclined to handle it similarly to the way that Tatsuo did with > conversion_procs: let collations be represented by comparison functions > that meet some suitable API. I think that trying to represent such a > table as an SQL table compactly will be a nightmare, and trying to > access it quickly enough for reasonable performance will be worse. Keep > the problem out of the API and let each comparison function do what it > needs to do internally. Agreed. That was the way I have been thinking about collation/create character stuff too. -- Tatsuo Ishii