Home > mailing lists

Re: Postgres for a "data warehouse", 5-10 TB - Mailing list pgsql-performance

From	Marti Raudsepp
Subject	Re: Postgres for a "data warehouse", 5-10 TB
Date	September 11, 2011 14:02:48
Msg-id	CABRT9RB-etP-K-0u0y-Wv+BysSd9ToZm71BqcY7hFQZC2tmjYQ@mail.gmail.com Whole thread
In response to	Re: Postgres for a "data warehouse", 5-10 TB (Andy Colson <andy@squeakycode.net>)
Responses	Re: Postgres for a "data warehouse", 5-10 TB Re: Postgres for a "data warehouse", 5-10 TB
List	pgsql-performance

Tree view

On Sun, Sep 11, 2011 at 17:23, Andy Colson <andy@squeakycode.net> wrote:
> On 09/11/2011 08:59 AM, Igor Chudov wrote:
>> By the way, does that INSERT UPDATE functionality or something like this exist in Postgres?
> You have two options:
> 1) write a function like:
> create function doinsert(_id integer, _value text) returns void as
> 2) use two sql statements:

Unfortunately both of these options have caveats. Depending on your
I/O speed, you might need to use multiple loader threads to saturate
the write bandwidth.

However, neither option is safe from race conditions. If you need to
load data from multiple threads at the same time, they won't see each
other's inserts (until commit) and thus cause unique violations. If
you could somehow partition their operation by some key, so threads
are guaranteed not to conflict each other, then that would be perfect.
The 2nd option given by Andy is probably faster.

You *could* code a race-condition-safe function, but that would be a
no-go on a data warehouse, since each call needs a separate
subtransaction which involves allocating a transaction ID.

----

Which brings me to another important point: don't do lots of small
write transactions, SAVEPOINTs or PL/pgSQL subtransactions. Besides
being inefficient, they introduce a big maintenance burden. In
PostgreSQL's MVCC, each tuple contains a reference to the 32-bit
transaction ID that inserted it (xmin). After hitting the maximum
32-bit value transaction ID, the number "wraps around". To prevent old
rows from appearing as new, a "vacuum freeze" process will run after
passing autovacuum_freeze_max_age transactions (200 million by
default) to update all old rows in your database. Using fewer
transaction IDs means it runs less often.

On small databases, this is usually not important. But on a 10TB data
warehouse, rewriting a large part of your database totally kills
performance for any other processes.
This is detailed in the documentation:
http://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

Regards,
Marti

pgsql-performance by date:

From: Claudio Freire
Date: 11 September 2011, 11:44:25
Subject: Re: Postgres for a "data warehouse", 5-10 TB

From: Ogden
Date: 11 September 2011, 15:37:07
Subject: Re: Postgres for a "data warehouse", 5-10 TB

Re: Postgres for a "data warehouse", 5-10 TB - Mailing list pgsql-performance

Previous

Next