Re: Benchmark Data requested - Mailing list pgsql-performance

From Dimitri Fontaine
Subject Re: Benchmark Data requested
Date
Msg-id 200802051815.25800.dfontaine@hi-media.com
Whole thread Raw
In response to Re: Benchmark Data requested  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: Benchmark Data requested
Re: Benchmark Data requested
Re: Benchmark Data requested --- pgloader CE design ideas
List pgsql-performance
Le mardi 05 février 2008, Simon Riggs a écrit :
> I'll look at COPY FROM internals to make this faster. I'm looking at
> this now to refresh my memory; I already had some plans on the shelf.

Maybe stealing some ideas from pg_bulkload could somewhat help here?
  http://pgfoundry.org/docman/view.php/1000261/456/20060709_pg_bulkload.pdf

IIRC it's mainly about how to optimize index updating while loading data, and
I've heard complaints on the line "this external tool has to know too much
about PostgreSQL internals to be trustworthy as non-core code"... so...

> > The basic idea is for pgloader to ask PostgreSQL about
> > constraint_exclusion, pg_inherits and pg_constraint and if pgloader
> > recognize both the CHECK expression and the datatypes involved, and if we
> > can implement the CHECK in python without having to resort to querying
> > PostgreSQL, then we can run a thread per partition, with as many COPY
> > FROM running in parallel as there are partition involved (when threads =
> > -1).
> >
> > I'm not sure this will be quicker than relying on PostgreSQL trigger or
> > rules as used for partitioning currently, but ISTM Jignesh quoted § is
> > just about that.
>
> Much better than triggers and rules, but it will be hard to get it to
> work.

Well, I'm thinking about providing a somewhat modular approach where pgloader
code is able to recognize CHECK constraints, load a module registered to the
operator and data types, then use it.
The modules and their registration should be done at the configuration level,
I'll provide some defaults and users will be able to add their code, the same
way on-the-fly reformat modules are handled now.

This means that I'll be able to provide (hopefully) quickly the basic cases
(CHECK on dates >= x and < y), numeric ranges, etc, and users will be able to
care about more complex setups.

When the constraint won't match any configured pgloader exclusion module, the
trigger/rule code will get used (COPY will go to the main table), and when
the python CHECK implementation will be wrong (worst case) PostgreSQL will
reject the data and pgloader will fill your reject data and log files. And
you're back to debugging your python CHECK implementation...

All of this is only a braindump as of now, and maybe quite an optimistic
one... but baring any 'I know this can't work' objection that's what I'm
gonna try to implement for next pgloader version.

Thanks for comments, input is really appreciated !
--
dim

Attachment

pgsql-performance by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: Benchmark Data requested
Next
From: Simon Riggs
Date:
Subject: Re: Benchmark Data requested