Re: PostgreSQL Advocacy, Thoughts and Comments - Mailing list pgsql-general

From Chris Travers
Subject Re: PostgreSQL Advocacy, Thoughts and Comments
Date
Msg-id 00fa01c3b9b0$d7cffa60$b100053d@SAMUEL
Whole thread Raw
In response to Re: PostgreSQL Advocacy, Thoughts and Comments  ("Jason Tesser" <JTesser@nbbc.edu>)
Responses Re: PostgreSQL Advocacy, Thoughts and Comments
List pgsql-general

Here is a paper I have written for the purposes of providing some additional educational material for the MySQL crowd.  Permission is granted for it to be redistributed via the mailing list and archives, but not to be posted elsewhere until I have a final version completed.

I am interested in peer review regarding the concepts in this paper, etc.  Any feedback?

 

Choosing a Database Manager

A Study of Open Source Relational Database Management Systems

By Chris Travers

 

Introduction

There sometimes comes a point of epiphany, dreaded by programmers, when it becomes clear that at least one of the components of an application is completely inadequate for the task at hand, and thus limits progress or usefulness of the program in some fundamental way.  All too often this is the relational database management system (RDBMS).

 

Choosing the proper relational database management system can make or break an application.  The choice of a primary RDBMS is thus very important for the long-term viability of an application.  While this paper focuses on open source RDBMS’s, the same methodology can be applied towards commercial systems as well, such as Oracle, MS SQL, and DB2. Also, the concept of application specific database managers will be discussed briefly in order to provide some degree of contrast to the relational paradigm.

Types of Database Managers

Application Specific Database Managers

Application specific database managers include the GDBM (GNU Database Manager), and the Berkeley Database Manager (available, open source or commercially licensed from Sleepycat software).  Basically, such a database provides a simple persistence layer for database objects.  They do not allow for complex querying, and are often used in single-purpose applications where the data does not need to be accessed by any outside application.

 

If two applications share the same database using these database managers, they must be intimately familiar with the internal data structures that are stored in the database manager.  No presentation is given to the data aside from how it is represented in the database.

Relational Database Managers

In contrast to the application specific database managers, relational database managers present an abstraction layer between the information in the database and the application using that information.  The information can then be subject to much more complex queries, and the information can be represented to the application in whatever form makes sense to the application.  Usually, SQL is the language used for such queries.

 

In a relational database manager, data is stored in “tables” and tables can be joined together based on common criteria.  In this way, several different applications can access the same information in different ways.

What does “Object Relational” mean?

PostgreSQL is “Object Relational” which means that although it is a relational database management system, the database “objects” have at least some of the traits of objects in object-oriented programming.  This tables can be inherited, though this does not apply to user defined data types, etc.  With an Object Relational system, purely relational databases can be designed and used, but also inheritance can be used in various ways.  Use of such features is beyond the scope of this paper, however.

Features Important to Enterprise Applications

Subselects

Subselects are important in large applications for two reasons.  First they allow performance tuning for certain types of queries above and beyond what a planner can normally do.  Very often subselects will outperform inner joins because the planner can focus on a simpler pattern with a maximum of one table scan per subselect and a smaller memory footprint than other forms of single-scan joins.

 

Secondly, when compared to inner joins, subselects often make for more readable queries, especially where several tables are involved.  Greater readability makes for easier maintenance of the application.

Views

A view is a stored query that appears to be a table from the application’s perspective.  Views can be used to create alternative representations of the data for other applications or for reporting purposes.  In many database managers, views can be updateable, and sometimes data can even be inserted or deleted from the view.  Thus views can be used to create logical tables that present data in a way that is less normalized than the actual structure of the information in the tables themselves.

 

Views can also be used for partitioning tables for permissions purposes-- vertically (groups of rows), horizontally (groups of columns), or both.  They can also be used to create complex virtual tables for reporting or online analytic processing, or OLAP.  In this way, views can help manage information and also help make sense of complex patterns as well.

Stored Procedures

Stored Procedures, also known as User Defined Functions, allow for a greater quantity of business logic to be stored in the database.  These functions can allow for the development of more complex views than would otherwise be possible, and can also allow for the development of a database-level API that can then be accessed by any application that has access to the database.  In this way, they can reduce development time by allowing the development of “modules” which are common to all clients of the database.

 

Stored procedures also reduce network latency issues by reducing the number of communications between the client and server.  Whole sets of queries can be run before any result set needs to be sent to the client.  In this way, performance can be improved, given sufficient hardware on the side of the database server.

Triggers

Triggers are a mechanism for automatic maintenance and management of information in the database.  Triggers can be activated when information is modified in any table. Triggers are often used to enforce referential integrity (ensuring that records in one table reference records in another table that actually exist).

 

One example of another use of a trigger is in maintaining some other record of the change made to the information of a database.  This information can later be used for reporting purposes to ensure that a consistent picture of the database can be constructed for any arbitrary point in time.  For example, I could store historical information about prices for items from my online store, and then do historical processing to see how sales were affected by price changes at a later date since all historical information would be maintained in the database.

Database Tasks: How These Features Help

Data Integrity Enforcement

For data to be useful in a large business program, it must be meaningful.  In many cases, the largest threat to the ability to make the data meaningful is the so-called “orphaned record” problem.  Orphaned records refer to records in other tables that, from the point of view of the application, no longer exist.  This can happen when records are deleted or their primary keys are updated.  Triggers can be used to ensure that this problem never exists by doing one of several things:  cascading updates/deletes, restricting updates/deletes when records reference the column being updated or row being deleted, or setting the foreign key in the referencing table to NULL (which means, in database terms, “unknown”).

 

Triggers can also restrict the types of information being inserted or updated in the database, ensuring that only meaningful input is stored.  They can also be used to maintain historical information, allowing a consistent picture of all tables to be built for any point in time.

Data Presentation

Assume for a moment that your database has several front-ends that access the information in different ways.  Perhaps this is not the case now, but it could be in the future.  It is extremely helpful if the database manager can present the information to the programs in ways that would be natural for them.  This effectively negates the requirement that every application accessing the database must have in-depth knowledge of the database schema.  Instead, the application’s expected database schema can be presented to it.

 

A very useful side effect of this capability is in integrating applications that may or may not have been designed to be integrated.  In this way, larger, more unified databases can be built for use of internal applications. 

 

Views and, depending on complexity, stored procedures would be needed to implement this sort of system, which could also provide advanced reporting capabilities, also based on views and stored procedures.

Common Modules As Stored Procedures

When several programs access the same database, there may be a reasonable amount of common functionality in the functions used by the programs.  These can be simplified using views to some extent, but often this is simply not enough.  Using stored procedures, the functions that two or more front-ends have in common can be moved to the database backend, so that there is one single point of maintenance and reference.  This technique can reduce development and maintenance time and is essential for any database being used by multiple applications.

Questions to Ask

Where will my application be deployed?

The choice of a database manager must, no doubt, take into account availability of the solutions at the point of deployment.  Of course, applications run at the business’s site can use whatever solutions are found to be affordable and/or best suited to the task at hand.  Hosted web applications are more at the mercy of what is available, however.

How necessary is integration with other programs?

Integration capability is always a selling point, as it allows for more powerful solutions to be built in pieces and lowers overall system maintenance.  However, the ability to integrate with other solutions may come at a cost, especially regarding availability of the chosen database manager in a shared hosting environment.

Major Open Source Relational Database Managers

MySQL

MySQL is the most common open source RDBMS among low-cost web hosting providers.  As of version 4.0, it does not support subselects, views, stored procedures, or triggers.  Additionally, its enforcement of data integrity is questionable at best, and it is sometimes known to truncate numeric or text values when they are too large to fit.  It is also sometimes possible to insert invalid timestamps into the database.  This can cause extreme problems for any application that relies on the information in the database being exactly what was entered (for example, accounting applications).

 

Even when views are supported in MySQL, it may be some time before these views can be used for insert, update, or delete queries.  A similar period of maturation should be expected for other features such as stored procedures and triggers.

 

Despite its limitations, MySQL’s ubiquity and the availability of a Windows port makes it an attractive database platform.  For internal business tools, however, there are better choices.

Firebird

Firebird is based on a source-code release of Borland’s Interbase RDBMS.  It is mature, and supports views, subselects, triggers, and stored procedures. As of version 1.0, it has a native Windows port available, making it ideal for in-house applications that must support both Windows and UNIX operating systems.  However, I found the learning curve to be steeper than other RDBMS’s particularly regarding the development of stored procedures.

PostgreSQL

PostgreSQL is an object-relational database management system that features every feature listed here, and extensible type system, and the ability to write stored procedures in a variety of different languages.  Indeed, there are four stored procedure languages (Perl, Tcl, SQL, and PL/PGSQL) distributed with PostgreSQL and many more available from third-party sites.

 

Although the learning curve is steeper than that of MySQL, I found it easier than Firebird.

 

As of version 7.4, PostgreSQL does not yet have a native Windows port available, making it unsuitable at the moment for applications that must be able to run on Windows database servers.

 

Although PostgreSQL has not achieved the ubiquity among shared hosting providers that MySQL has, there are many providers offering hosting at a variety of costs.  You can find a list of them at http://techdocs.postgresql.org/hosting.php.

Conclusion

Selecting the right database manager is crucial to the long-term viability of any application.  Care should be taken to ensure that the database manager chosen will continue to meet the needs of the application in the future as well as in the present.

pgsql-general by date:

Previous
From: Carmen Gloria Sepulveda Dedes
Date:
Subject: Re: autocommit off -- Sorry
Next
From: Jan Wieck
Date:
Subject: Re: Feature Request for 7.5