database schema quality survey & postgresql pitfalls - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | database schema quality survey & postgresql pitfalls |
Date | |
Msg-id | alpine.DEB.2.02.1201231152010.3160@localhost6.localdomain6 Whole thread Raw |
List | pgsql-hackers |
Dear pgdevs, I've just completed the final version of a survey of database schema quality in open source software. The survey covers 512 projects which use MySQL and/or PostgreSQL for storing their data. Automatic analyses are performed by querying the information schema. Statistical validations are computed on the results. Here are some findings of specific interest to the list. About MySQL vs PostgreSQL usage: - MySQL is much more often used than PostgreSQL... not a surprise. - Even for projects which seems to support PostgreSQL, this support is often an afterthough and not necessarily functional.For instance, you may find MySQL-specific syntax in the PostgreSQL-specific script, which were clearly nevertested. - Projects which use PostgreSQL statistically have a better quality compared to projects which use MySQL. However, thedifference is mostly due to issues with MySQL. - Projects which use PostgreSQL are more often maintained than projects with MySQL. Some features or default behavior of PostgreSQL seem especially error-prone: - SERIAL attributes seem to be considered automatically as a primary key, so that the primary key declaration is oftenforgotten. This suggests that: * the documentation should insist on the potential issue. * a WARNING should be displayed when SERIAL is used without an associated PRIMARY KEY, or possibly UNIQUE. It shouldbe very rare to desire a SERIAL which is not a PK, so this is worth a warning. - when loading a schema definition from an SQL script, the default behavior of "psql" is to ignore errors and go on. This lead to projects with failing declarations to be ignored because the ERROR is lost in the flow of WARNING and NOTICE.If a table is missing, the error will be detected because the project is not functional, but if a constraint is missing, it will just be lost. This occur in about 10% of pg projects! In order to avoid this behavior, one must do a "\set ON_ERROR_STOP 1" at the beginning of the script. However, thisis never done. Moreover, there is no simple way to trigger the safer behavior from the command line but quite a long"-v ON_ERROR_STOP=1". It seems to me that: * the documentation should suggest to use an explicit stop on error setting in every script. * a "psql -C foo.sql" (check?) or equivalent short option would help? * The current client default verbosity is counter productive for quality. It should be reduced to WARNING and above,but should *not* include NOTICE which add a lot of noise ignored by the user and which hides more importantmessages. For instance, a PK implies an INDEX, a SERIAL implies a SEQUENCE, fine, but what is the point of tellingit over and over? So I suggest to choose a default "client_min_messages = warning". Finally, some issues where found and already reported some time ago about the implementation of the information schema by PostgreSQL. For instance, auto-generated constraint names are not unique as they should be, which makes having a "standard" information schema a little bit pointless, as querying it returns wrong results:-( The summary of the answer was "do not use the information schema", or "give unique names", which does not make much sense for me who is analysing existing projects, and as most constraint names are generated automatically by PostgreSQL. For those interested in more details about the survey, a preprint of the paper is available here : http://www.cri.ensmp.fr/classement/doc/A-488.pdf And the tool used for analysing the projects is available at : http://coelho.net/salix/ Comments are welcome. -- Fabien.
pgsql-hackers by date: