Re: Performace Optimization for Dummies - Mailing list pgsql-performance

From Carlo Stonebanks
Subject Re: Performace Optimization for Dummies
Date
Msg-id efsjpk$1cb7$1@news.hub.org
Whole thread Raw
In response to Performace Optimization for Dummies  ("Carlo Stonebanks" <stonec.register@sympatico.ca>)
Responses Re: Performace Optimization for Dummies
List pgsql-performance
> My experience with that type of load process is that doing this
> row-by-row is a very expensive approach and your results bear that out.

I expected this, and had warned the client before the project started that
this is exactly where SQL underperforms.

> It is often better to write each step as an SQL statement that operates
> on a set of rows at one time.

The problem with this approach is that every row of data is dependent on the
previous row's data being validated and imported. e.g.

Import Row 1:
John Q Smith
Foobar Corp
123 Main St,
Bigtown, MD 12345-6789

Import Row 2:
John Quincy Smith
FuzzyLoginc Inc
123 Main St, Suite 301
Bigtown, MD 12345-6789

Import Row 3:
Bobby Jones
Foobar Corp
123 Main Strett Suite 300,
Bigtown, MD 12345

Every row must be imported into the table so that the next row may see the
data and consider it when assigning ID's to the name, company and address.
(all data must be normalised) How can this be done using set logic?

> You can also improve performance by ordering your checks so that the
> ones most likely to fail happen first.

Already done - I believe the problem is definitely in the navigational
access model. What I am doing now makes perfect sense as far as the logic of
the process goes - any other developer will read it and understand what is
going on. At 3000 lines of code, this will be tedious, but understandable.
But SQL hates it.

> Trying to achieve a high level of data quality in one large project is
> not often possible. Focus on the most critical areas of checking and get
> that working first with acceptable performance, then layer on additional
> checks while tuning. The complexity of the load programs you have also
> means they are susceptible to introducing data quality problems rather
> than removing them, so an incremental approach will also aid debugging
> of the load suite.

I couldn't agree more.

Carlo



pgsql-performance by date:

Previous
From: "Carlo Stonebanks"
Date:
Subject: Re: Performace Optimization for Dummies
Next
From: "Carlo Stonebanks"
Date:
Subject: Re: Performace Optimization for Dummies