Re: Postgres for a "data warehouse", 5-10 TB - Mailing list pgsql-performance

From Marti Raudsepp
Subject Re: Postgres for a "data warehouse", 5-10 TB
Date
Msg-id CABRT9RB-etP-K-0u0y-Wv+BysSd9ToZm71BqcY7hFQZC2tmjYQ@mail.gmail.com
Whole thread Raw
In response to Re: Postgres for a "data warehouse", 5-10 TB  (Andy Colson <andy@squeakycode.net>)
Responses Re: Postgres for a "data warehouse", 5-10 TB  (Shaun Thomas <sthomas@peak6.com>)
Re: Postgres for a "data warehouse", 5-10 TB  (Robert Klemme <shortcutter@googlemail.com>)
List pgsql-performance
On Sun, Sep 11, 2011 at 17:23, Andy Colson <andy@squeakycode.net> wrote:
> On 09/11/2011 08:59 AM, Igor Chudov wrote:
>> By the way, does that INSERT UPDATE functionality or something like this exist in Postgres?
> You have two options:
> 1) write a function like:
> create function doinsert(_id integer, _value text) returns void as
> 2) use two sql statements:

Unfortunately both of these options have caveats. Depending on your
I/O speed, you might need to use multiple loader threads to saturate
the write bandwidth.

However, neither option is safe from race conditions. If you need to
load data from multiple threads at the same time, they won't see each
other's inserts (until commit) and thus cause unique violations. If
you could somehow partition their operation by some key, so threads
are guaranteed not to conflict each other, then that would be perfect.
The 2nd option given by Andy is probably faster.

You *could* code a race-condition-safe function, but that would be a
no-go on a data warehouse, since each call needs a separate
subtransaction which involves allocating a transaction ID.

----

Which brings me to another important point: don't do lots of small
write transactions, SAVEPOINTs or PL/pgSQL subtransactions. Besides
being inefficient, they introduce a big maintenance burden. In
PostgreSQL's MVCC, each tuple contains a reference to the 32-bit
transaction ID that inserted it (xmin). After hitting the maximum
32-bit value transaction ID, the number "wraps around". To prevent old
rows from appearing as new, a "vacuum freeze" process will run after
passing autovacuum_freeze_max_age transactions (200 million by
default) to update all old rows in your database. Using fewer
transaction IDs means it runs less often.

On small databases, this is usually not important. But on a 10TB data
warehouse, rewriting a large part of your database totally kills
performance for any other processes.
This is detailed in the documentation:
http://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

Regards,
Marti

pgsql-performance by date:

Previous
From: Claudio Freire
Date:
Subject: Re: Postgres for a "data warehouse", 5-10 TB
Next
From: Ogden
Date:
Subject: Re: Postgres for a "data warehouse", 5-10 TB