Re: [GENERAL] Re: Is PostgreSQL ready for mission criticalapplications? - Mailing list pgsql-general

From Mike Mascari
Subject Re: [GENERAL] Re: Is PostgreSQL ready for mission criticalapplications?
Date
Msg-id 386A5BBD.71375361@mascari.com
Whole thread Raw
In response to RE: [GENERAL] Re: Is PostgreSQL ready for mission criticalapplications?  ("Barnes" <aardvark@ibm.net>)
Responses Re: [GENERAL] Re: Is PostgreSQL ready for mission criticalapplications?  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-general
Barnes wrote:
>
> It would be helpful to me to hear about successful and stable
> implementations as well.  If some of you who are using PostgreSQL
> successfully could comment on your experiences, I think it would shed some
> worthwhile light on it's capabilities.  I'm considering using it for a
> mission critical project, and I would like to know what I am getting into.
> Thank you.
>
> David Barnes
>

We've used it successfully in a production environment (24 x
7) for over a year now. Simply reading the mailing list will
greatly improve your chances of success. The problems with
PostgreSQL can be avoided if you know, in advance, what to
avoid. But must people don't. Here's my list of things which
can get you into trouble:

0. Running with fsync on - There is the probability that
modified records written into kernel buffers but not written
to disk could exist at the moment of an operating system
crash. Therefore, PostgreSQL's default mode is to run with
fsync() on. This slows down the database by (quite
literally) several orders of magnitude. We've run with fsync
off (-o -F) without a problem. Dump/Reload of large
databases with fsync() on really tests one's pain threshold.
If you trust your OS, run with it off.

1. Making use of oids - Problems with dumping/restoring oids
make this development path dangerous. Most people use the
SERIAL data type to generate primary keys. However, SERIAL
itself has some peculiarities, since it just auto-creates a
sequence. Dropping the associated table doesn't drop the
sequence (IIRC), so scripted or automated schema creation
may not be obvious. I prefer to manually use a sequence and
an int4 type for primary keys. In fact, you could use the
same sequence for all of your primary keys, if you aren't
exposing the value to the user in any meaningful way, and
don't plan to hit the 4.2 billion limit of int4 soon, and
don't care about gaps...(although a purist would argue, and
I agree, that IF you are going to use generated keys THEN
the key should have no more meaning then  that it refers to
a record).

2. Using views created with large queries - Views use the
rewrite system and rules to rewrite a query against it to
properly fetch data from the underlying tables. Because
there is currently a limit on the size of a single database
record (8192 bytes), the queries associated with views can
only be so big. In addition, you can get into trouble if
views are built on top of user-defined functions, which is a
common thing to do. If you drop/recreate the underlying
function, then the view needs to be dropped and recreated as
well. In addition, I've had trouble with dump/reload of
views in the past, and have always kept my schema in
separate views.sql script, just in case...

3. Using non-standard types - Because of problems with data
comparisons, type coercion and spaces, we avoided types such
as bpchar, char, and even text. We avoided text since ODBC
clients could not determine maximum field width. We also
avoided all of the non-4 byte integer types, such as int2.
This is because the default type coercion (sp?) code in the
past has had trouble being smart enough to use indexes when
given a SELECT such as:

CREATE TABLE y (x text, z int2);

SELECT x FROM y WHERE z = 3;

because the 3 would be coerced to an int4 and, if z was an
int2, would result in a sequential scan, whereas:

SELECT x FROM y WHERE z = '3';

would use the index since it is initially parsed as a string
and coerced properly at a later point. I think much of this
has been fixed, but nevertheless... In addition, our
varchar() types are pretty much under the 255 limit since
some ODBC clients have problems with varchar() types greater
than 255. We only use: int4, varchar, datetime, and float8.
On rare occasion, we'll use text for free-form information,
but we NEVER index it. Although its VERY tempting, (and
PostgreSQL itself uses them), we avoid arrays.

4. Be careful about user-defined functions/triggers -
PostgreSQL keeps track of everything by its oid, not by name
(which would obviously be too slow). But, unfortunately, it
does not yet support the modification of functions, allowing
the function to retain its original oid (or perform a
cascading update - it will be nice when RI is integrated
into the system catalogue!). As a result, odd things can
happen if you drop and recreate a function. For example, you
could have a trigger call a procedural language which, in
turn, could select from a view, from which one of the
attributes is the result of a function. If you
dropped/recreated that function, things go a bit weird and
usually result in an error such as "function not in cache".

5. Using DDL statements in transactions - PostgreSQL has
trouble rolling back transactions which have aborted which
contain DDL statements. As a result, you might find yourself
having to delete a filesystem file, because, even though a
TABLE create might have been rolled back as far as the
system catalogue is concerned, the underlying file might
still manage to exist. Or worse, rollback of index
DROP/CREATE in a transaction yields erroneous results.

6. Using indexes on large fields - Apparently the code
requires 3 tuples per page (or something like that) for the
index to function properly. This can include plpgsql source,
so be careful. We never index on anything larger than 255,
but I believe around 2.8K is the limit before tickling
bugs...

7. Using INSERTS instead of COPY - Even when you have
fsync() off and are running INSERT statements in a
transaction, the processing of individual INSERT statements
by the thousands is also several orders of magnitude slower
than COPY. We have large mainframe datasets which we import
nightly - we first covert them to data appropriate for COPY
and then COPY them in, instead INSERT's record by record.
The problem with COPY is it runs as user postgres, so you
need to have the data files readable by user postgres.

8. Not running VACUUM - PostgreSQL won't use indexes, or
won't optimize correctly unless the record count and
dispersion estimates are up-to-date. People have reported
problems with running vacuum while under heavy load. We
haven't seen it, but we run vacuum each night at 4:05 a.m.
However, if you perform a LARGE number of INSERTS/UPDATES,
it is better for you to do the following:

DROP INDEX index_on_heavilty_used_table;
VACUUM ANALYZE;
CREATE INDEX index_on_heavily_used_table;

Because VACUUM will sit there, and, row by row, essentially
"defragment" your indexes, which can take damn near forever
for any number of updates or deletes greater than, say,
30,000 rows.

9. ALTER TABLE ADD COLUMN - Its better to rebuild the table
by hand then to use this DDL statement. First off, any
column constraints (such as NOT NULL), will silently
ignored, and secondly, inherited relations have problems
with dump/restore.

10. IN, INTERSECT, EXCEPT - When writing your application,
these SQL functions seem nice, particularly since the data
in your design database may be small, initially. But all
three of these SQL expressions (whatever) force a nested
sequential scan on the relation. For example:

emptoris=> explain SELECT employee FROM employees WHERE
employee NOT IN (SELECT webuser FROM webusers);
NOTICE:  QUERY PLAN:

Seq Scan on employees  (cost=3.95 rows=59 width=12)
  SubPlan
    ->  Seq Scan on webusers  (cost=7.78 rows=145 width=12)

EXPLAIN

Since INTERSECT/EXCEPT rewrite the query to use IN, the
problem exists with them as well. And since PostgreSQL does
not yet have outer joins, you should instead write the query
using a correlated sub query (EXISTS):

emptoris=> explain SELECT employee FROM employees WHERE NOT
EXISTS (SELECT webuser FROM webusers WHERE webusers.webuser
= employees.employee);
NOTICE:  QUERY PLAN:

Seq Scan on employees  (cost=3.95 rows=59 width=12)
  SubPlan
    ->  Index Scan using k_webusers1 on webusers  (cost=2.05
rows=1 width=12)

EXPLAIN

There are many more such things which, if avoided, allow
PostgreSQL to work great. But with each release, a lot of
these things become obsolete.

Mike Mascari

pgsql-general by date:

Previous
From: The Hermit Hacker
Date:
Subject: Re: [GENERAL] Future of PostgreSQL
Next
From: "Aaron J. Seigo"
Date:
Subject: Re: [GENERAL] Future of PostgreSQL