Thread: Performace Optimization for Dummies

Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

28 September 2006, 13:44:23

I am a software developer who is acting in a (temporary) dba role for a
project. I had recommended PostgreSQL be brought in to replace the proposed
MySQL DB - I chose PostgreSQL because of its reputation as a more stable
solution than MySQL.

At this early stage in the project, we are initializing our portal's
database with millions of rows of imported data in over 50 different
flattened tables; each table's structure is unique to the data provider.
This requires a pretty complex import program, because the data must be
matched semantically, not literally. Even with all of the expression
matching and fuzzy logic in the code,our performance statistics show that
the program spends over 75% of its time in SQL queries looking for matching
and/or duplicate data.

The import is slow - and degrades as the tables grow. With even more
millions of rows in dozens of import tables to come, the imports will take
forever. My ability to analyse the queries is limited; because of the nature
of the import process, the SQL queries are mutable, every imported row can
change the structure of a SQL query as the program adds and subtracts search
conditions to the SQL command text before execution. The import program is
scripted in Tcl. An attempt to convert our queries to prepared queries
(curiousy) did not bring any performance improvements, and we converted back
to simplify the code.

We urgently need a major performance improvement. We are running the
PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual core
3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) disc
subsystem. Sorry about the long intro, but here are my questions:

1) Are we paying any big penalties by running Windows vs LINUX (or any other
OS)?

2) Has the debate over PostgreSQL and Xeon processors been settled? Is this
a factor?

3) Are there any easy-to-use performance analysis/optimisation tools that we
can use? I am dreaming of one that could point out problems and suggest
and.or effect solutions.

4) Can anyone recommend any commercial PostgreSQL service providers that may
be able to swiftly come in and assist us with our performance issues?

Below, please find what I believe are the configuration settings of interest
in our system

Any help and advice will be much appreciated. TIA,

Carlo

max_connections = 100
shared_buffers = 50000
work_mem = 32768
maintenance_work_mem = 32768
checkpoint_segments = 128
effective_cache_size = 10000
random_page_cost = 3
stats_start_collector = on
stats_command_string = on
stats_row_level = on
autovacuum = on

Re: Performace Optimization for Dummies

From

"Joshua D. Drake"

Date:

28 September 2006, 14:11:39

> The import is slow - and degrades as the tables grow. With even more
> millions of rows in dozens of import tables to come, the imports will take
> forever. My ability to analyse the queries is limited; because of the nature
> of the import process, the SQL queries are mutable, every imported row can
> change the structure of a SQL query as the program adds and subtracts search
> conditions to the SQL command text before execution. The import program is
> scripted in Tcl. An attempt to convert our queries to prepared queries
> (curiousy) did not bring any performance improvements, and we converted back
> to simplify the code.

How are you loading the tables? Copy? Insert?

>
> We urgently need a major performance improvement. We are running the
> PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual core
> 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) disc
> subsystem. Sorry about the long intro, but here are my questions:
>
> 1) Are we paying any big penalties by running Windows vs LINUX (or any other
> OS)?

Yes. Linux or FreeBSD is going to stomp Win32 for PostgreSQL performance.

>
> 2) Has the debate over PostgreSQL and Xeon processors been settled? Is this
> a factor?

Depends. PostgreSQL is much better with the Xeon in general, but are you
running woodcrest based CPUs or the older models?

>
> 3) Are there any easy-to-use performance analysis/optimisation tools that we
> can use? I am dreaming of one that could point out problems and suggest
> and.or effect solutions.

I don't know about Windows, but *nix has a number of tools available
directly at the operating system level to help you determine various
bottlenecks.

>
> 4) Can anyone recommend any commercial PostgreSQL service providers that may
> be able to swiftly come in and assist us with our performance issues?

http://www.commandprompt.com/ (disclaimer, I am an employee)

>
> Below, please find what I believe are the configuration settings of interest
> in our system
>
> Any help and advice will be much appreciated. TIA,
>
> Carlo
>
> max_connections = 100
> shared_buffers = 50000

This could probably be higher.

> work_mem = 32768

Depending on what you are doing, this is could be to low or to high.

> maintenance_work_mem = 32768
> checkpoint_segments = 128
> effective_cache_size = 10000

This coudl probably be higher.

> random_page_cost = 3
> stats_start_collector = on
> stats_command_string = on
> stats_row_level = on
> autovacuum = on

Stats are a hit... you need to determine if you actually need them.

Joshua D. Drake



>
>
>
>
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>


--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/

Re: Performace Optimization for Dummies

From

"Merlin Moncure"

Date:

28 September 2006, 14:18:11

On 9/28/06, Carlo Stonebanks <stonec.register@sympatico.ca> wrote:
> We urgently need a major performance improvement. We are running the
> PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual core
> 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type) disc
> subsystem. Sorry about the long intro, but here are my questions:

are you using the 'copy' interface?

> 1) Are we paying any big penalties by running Windows vs LINUX (or any other
> OS)?

thats a tough question.  my gut says that windows will not scale as
well as recent linux kernels in high load environments.

> 2) Has the debate over PostgreSQL and Xeon processors been settled? Is this
> a factor?

hearing good things about the woodcrest. pre-woodcrest xeon (dempsey
down) is outclassed by the opteron.

> Below, please find what I believe are the configuration settings of interest
> in our system

1. can probably run fsync=off during the import
2. if import is single proecess, consider temporary bump to memory for
index creation. or, since you have four cores consider having four
processes import the data somehow.
3. turn off stats collector, stats_command_string, stats_row_level,
and autovacuum during import.

merlin

> Any help and advice will be much appreciated. TIA,
>
> Carlo
>
> max_connections = 100
> shared_buffers = 50000
> work_mem = 32768
> maintenance_work_mem = 32768
> checkpoint_segments = 128
> effective_cache_size = 10000
> random_page_cost = 3
> stats_start_collector = on
> stats_command_string = on
> stats_row_level = on
> autovacuum = on
>
>
>
>
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

28 September 2006, 14:48:00

> How are you loading the tables? Copy? Insert?

Once the data is transformed, it is inserted. I don't have stats, but the
programs visual feedback does not spend a lot of time on the "inserting
data" message. Then again, if there is an asynchronous component to an
insert, perhaps I am not seeing how slow an insert really is until I query
the table.

> Yes. Linux or FreeBSD is going to stomp Win32 for PostgreSQL performance.

Don't suppose you'd care to hazard a guess on what sort of scale we're
talking about? Are we talking about 10%? 100% I know this is a hard one to
judge,  My impression was that the *NIX improvements were with concurrent
use and right now, I am obsessing on this single-threaded issue.

> Depends. PostgreSQL is much better with the Xeon in general, but are you
> running woodcrest based CPUs or the older models?

Weren't those released in July? This server is a few months older, so I
guess not. But maybe? Does Dell have the ability to install CPUs from the
future like Cyberdyne does? ;-)

> I don't know about Windows, but *nix has a number of tools available
> directly at the operating system level to help you determine various
> bottlenecks.

Are we talking about I/O operations? I was thinking of SQL query analysis.
The stuff I read here about query analysis is pretty intruiging, but to
someone unfamiliar with this type of query analysis it all looks quite
uncertain to me. I mean, I read the threads and it all looks like a lot of
trying ot figure out how to cajole PostgreSQL to do what you want, rather
than telling it: HEY I CREATED THAT INDEX FOR A REASON, USE IT!

I know this may be non-dba sophistication on my part, but I would like a
tool that would make this whole process a little less arcane. I'm not the
Gandalf type.

>> 4) Can anyone recommend any commercial PostgreSQL service providers that
>> may
>> be able to swiftly come in and assist us with our performance issues?
>
> http://www.commandprompt.com/ (disclaimer, I am an employee)

Very much appreciated.

>> max_connections = 100
>> shared_buffers = 50000
>
 This could probably be higher.

Ok, good start...

>
>> work_mem = 32768
>
> Depending on what you are doing, this is could be to low or to high.

Is this like "You could be too fat or too thin"? Aren't you impressed with
the fact that I managed to pick the one number that was not right for
anything?

>
>> maintenance_work_mem = 32768
>> checkpoint_segments = 128
>> effective_cache_size = 10000
>
> This coudl probably be higher.

... noted...

>
>> random_page_cost = 3
>> stats_start_collector = on
>> stats_command_string = on
>> stats_row_level = on
>> autovacuum = on
>
> Stats are a hit... you need to determine if you actually need them.

Unfortunately, this is the only way I know of of getting the query string to
appear in the PostgreSQL server status display. While trying to figure out
what is slowing things down, having that is really helpful. I also imagined
that this sort of thing would be a performance hit when you are getting lots
of small, concurrent queries. In my case, we have queries which are taking
around a second to perform outer joins. They aren't competing with any other
requests as the site is not running, we are just running one app to seed the
data.

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

28 September 2006, 14:53:36

> are you using the 'copy' interface?

Straightforward inserts - the import data has to transformed, normalised and
de-duped by the import program. I imagine the copy interface is for more
straightforward data importing. These are - buy necessity - single row
inserts.

> thats a tough question.  my gut says that windows will not scale as
> well as recent linux kernels in high load environments.

But not in the case of a single import program trying to seed a massive
database?

> hearing good things about the woodcrest. pre-woodcrest xeon (dempsey
> down) is outclassed by the opteron.

Need to find a way to deterimine the Xeon type. The server was bought in
early 2006, and it looks like woodcrest was form July.

> 1. can probably run fsync=off during the import
> 2. if import is single proecess, consider temporary bump to memory for
> index creation. or, since you have four cores consider having four
> processes import the data somehow.
> 3. turn off stats collector, stats_command_string, stats_row_level,
> and autovacuum during import.

Very helpful, thanks.

Carlo

Re: Performace Optimization for Dummies

From

Steve Atkins

Date:

28 September 2006, 15:26:33

On Sep 28, 2006, at 10:53 AM, Carlo Stonebanks wrote:

>> are you using the 'copy' interface?
>
> Straightforward inserts - the import data has to transformed,
> normalised and
> de-duped by the import program. I imagine the copy interface is for
> more
> straightforward data importing. These are - buy necessity - single row
> inserts.

Are you wrapping all this in a transaction?

You're doing some dynamically generated selects as part of the
"de-duping" process? They're probably the expensive bit. What
do those queries tend to look like?

Are you analysing the table periodically? If not, then you might
have statistics based on an empty table, or default statistics, which
might cause the planner to choose bad plans for those selects.

Talking of which, are there indexes on the table? Normally you
wouldn't have indexes in place during a bulk import, but if you're
doing selects as part of the data load process then you'd be forcing
sequential scans for every query, which would explain why it gets
slower as the table gets bigger.

Cheers,
   Steve

Re: Performace Optimization for Dummies

From

"Dave Dutcher"

Date:

28 September 2006, 15:27:26

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Carlo
Stonebanks
> Subject: [PERFORM] Performace Optimization for Dummies
>
> At this early stage in the project, we are initializing our portal's
> database with millions of rows of imported data in over 50 different
> flattened tables; each table's structure is unique to the
> data provider.
> This requires a pretty complex import program, because the
> data must be
> matched semantically, not literally. Even with all of the expression
> matching and fuzzy logic in the code,our performance
> statistics show that
> the program spends over 75% of its time in SQL queries
> looking for matching
> and/or duplicate data.
>
> The import is slow - and degrades as the tables grow.

So your program first transforms the data and then inserts it?  And it is
the transforming process which is running select statements that is slow?
If that is the case you could use duration logging to find the slow select
statement, and then you could post an EXPLAIN ANALYZE of the select.

One question off the top of my head is are you using regular expressions for
your fuzzy logic if so do your indexes have the right operator classes?
(see http://www.postgresql.org/docs/8.1/static/indexes-opclass.html)

Dave

Re: Performace Optimization for Dummies

From

"Jim C. Nasby"

Date:

28 September 2006, 15:34:29

On Thu, Sep 28, 2006 at 10:11:31AM -0700, Joshua D. Drake wrote:
> > 4) Can anyone recommend any commercial PostgreSQL service providers that may
> > be able to swiftly come in and assist us with our performance issues?
>
> http://www.commandprompt.com/ (disclaimer, I am an employee)

You forgot us. :)
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Performace Optimization for Dummies

From

"Jim C. Nasby"

Date:

28 September 2006, 15:44:29

On Thu, Sep 28, 2006 at 01:47:44PM -0400, Carlo Stonebanks wrote:
> > How are you loading the tables? Copy? Insert?
>
> Once the data is transformed, it is inserted. I don't have stats, but the
> programs visual feedback does not spend a lot of time on the "inserting
> data" message. Then again, if there is an asynchronous component to an
> insert, perhaps I am not seeing how slow an insert really is until I query
> the table.

Well, individual inserts are slow, especially if they're not wrapped up
in a transaction. And you also mentioned checking for dupes. I suspect
that you're not going to find any huge gains in tuning the database...
it sounds like the application (as in: how it's using the database) is
what needs help.

> >> work_mem = 32768
> >
> > Depending on what you are doing, this is could be to low or to high.
>
> Is this like "You could be too fat or too thin"? Aren't you impressed with
> the fact that I managed to pick the one number that was not right for
> anything?

For what you're doing, it's probably fine where it is... but while
you're in the single-thread case, you can safely make that pretty big
(like 1000000).

> >
> >> maintenance_work_mem = 32768
> >> checkpoint_segments = 128
> >> effective_cache_size = 10000
> >
> > This coudl probably be higher.

I'd suggest setting it to about 3G, or 375000.
> >
> >> random_page_cost = 3
> >> stats_start_collector = on
> >> stats_command_string = on
> >> stats_row_level = on
> >> autovacuum = on
> >
> > Stats are a hit... you need to determine if you actually need them.
>
> Unfortunately, this is the only way I know of of getting the query string to
> appear in the PostgreSQL server status display. While trying to figure out
> what is slowing things down, having that is really helpful. I also imagined
> that this sort of thing would be a performance hit when you are getting lots
> of small, concurrent queries. In my case, we have queries which are taking
> around a second to perform outer joins. They aren't competing with any other
> requests as the site is not running, we are just running one app to seed the
> data.

stats_command_string can extract a huge penalty pre-8.2, on the order of
30%. I'd turn it off unless you *really* need it. Command logging (ie:
log_min_duration_statement) is much less of a burden.

The fact that you're doing outer joins while loading data really makes
me suspect that the application needs to be changed for any real
benefits to be had. But you should still look at what EXPLAIN ANALYZE is
showing you on those queries; you might be able to find some gains
there.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Performace Optimization for Dummies

From

"Jim C. Nasby"

Date:

28 September 2006, 15:46:07

On Thu, Sep 28, 2006 at 01:53:22PM -0400, Carlo Stonebanks wrote:
> > are you using the 'copy' interface?
>
> Straightforward inserts - the import data has to transformed, normalised and
> de-duped by the import program. I imagine the copy interface is for more
> straightforward data importing. These are - buy necessity - single row
> inserts.

BTW, stuff like de-duping is something you really want the database -
not an external program - to be doing. Think about loading the data into
a temporary table and then working on it from there.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Performace Optimization for Dummies

From

Matthew Nuzum

Date:

28 September 2006, 16:07:37

Carlo Stonebanks wrote:
>> are you using the 'copy' interface?
>
> Straightforward inserts - the import data has to transformed, normalised and
> de-duped by the import program. I imagine the copy interface is for more
> straightforward data importing. These are - buy necessity - single row
> inserts.
>

I know this is an answer to a question you didn't ask, but here it is. I
was once doing stuff where I processed log files and had to do many
lookups to normalize the data before insertion.

I started out doing everything in SQL and using postgresql tables and it
took a little over 24 hours to process 24 hours worth of data. Like you,
it was single process, many lookups.

I found a better way. I rewrote it (once in c#, again in python) and
used hashtables/dictionaries instead of tables for the lookup data. For
example, I'd start by loading the data into hash tables (yes, this took
a *lot* of ram) then for each row I did something like:
 1. is it in the hash table?
   1. If not, insert it into the db
   1. Insert it into the hashtable
 2. Get the lookup field out of the hash table
 3. Output normalized data

This allow me to create text files containing the data in COPY format
which can then be inserted into the database at dramatically increased
speeds.

My first version in C# (mono) cut the time down to 6 hours for 24 hours
worth of data. I tweaked the algorithms and rewrote it in Python and got
it down to 45 min. (Python can't take all the credit for the performance
boost, I used an improved technique that could have been done in C# as
well) This time included the time needed to do the copy and update the
indexes.

I created a version that also used gdb databases instead of hash tables.
It increased the time from 45 min to a little over an hour but decreased
the memory usage to something like 45MB (vs dozens or hundreds of MB per
hashtable)
--
Matthew Nuzum
newz2000 on freenode

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

28 September 2006, 16:11:03

> Are you wrapping all this in a transaction?

Yes, the transactions can typically wrap 1 to 10 single-table, single-row
inserts and updates.


> You're doing some dynamically generated selects as part of the
> "de-duping" process? They're probably the expensive bit. What
> do those queries tend to look like?

Without a doubt, this is the expensive bit.

> Are you analysing the table periodically? If not, then you might
> have statistics based on an empty table, or default statistics, which
> might cause the planner to choose bad plans for those selects.

Now there's something I didn't know - I thought that analysis and planning
was done with every select, and the performance benefit of prepared
statements was to plan-once, execute many. I can easily put in a periodic
analyse statement. I obviously missed how to use analyze properluy, I
thought it was just human-readable output - do I understand correctly, that
it can be used to get the SQL server to revaluate its plan based on newer
statistics - even on non-prepared queries?

> Talking of which, are there indexes on the table? Normally you
> wouldn't have indexes in place during a bulk import, but if you're
> doing selects as part of the data load process then you'd be forcing
> sequential scans for every query, which would explain why it gets
> slower as the table gets bigger.

There are indexes for every obvious "where this = that" clauses. I don't
believe that they will work for ilike expressions.

>
> Cheers,
>   Steve
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

28 September 2006, 16:47:38

> So your program first transforms the data and then inserts it?  And it is
> the transforming process which is running select statements that is slow?

There are cross-referencing and deduplication processes. Does this person
have an office at this exact address? In a similarily named building in the
same zip code? City? What is the similarity of the building or enterprise
names? Is there a person with a similar name with the same type of
professional license nearby? We basically look for the statistical
likelyhood that they already exist to decide whether to update their data,
or insert a new data element.

These are all extremely soft queries and require left outer joins with all
of the related tables that would contain this data (the left outer join
tells us whether the related element satisfied the search condition). As I
mentioned, as the data comes in, we examine what we have to work with and
modify the tables and columns we can check - which is what I meant by " the
SQL queries are mutable, every imported row can change the structure of a
SQL query as the program adds and subtracts search conditions to the SQL
command text before execution."

> If that is the case you could use duration logging to find the slow select
> statement, and then you could post an EXPLAIN ANALYZE of the select.

I'm pretty sure I know who the culprit is, and - like I said, it comes from
a section of code that creates a mutable statement. However, everyone is
being so helpful and I should post this data as soon as I can.

> One question off the top of my head is are you using regular expressions
> for
> your fuzzy logic if so do your indexes have the right operator classes?
> (see http://www.postgresql.org/docs/8.1/static/indexes-opclass.html)

I am using regular expressions and fuzzy logic, but mostly on the client
side (I have a Tcl implementation of levenshtein, for example). I don't
think you can use indexes on functions such as levenshtein, because it
requires a parameter only available at execution time. The link you sent me
was very interesting - I will definitely reconsider my query logic if I can
optimise regular expression searches on the server. Thanks!

Carlo

Re: Performace Optimization for Dummies

From

"Merlin Moncure"

Date:

28 September 2006, 17:07:02

On 9/28/06, Carlo Stonebanks <stonec.register@sympatico.ca> wrote:
> > are you using the 'copy' interface?
>
> Straightforward inserts - the import data has to transformed, normalised and
> de-duped by the import program. I imagine the copy interface is for more
> straightforward data importing. These are - buy necessity - single row
> inserts.

right. see comments below.

> > thats a tough question.  my gut says that windows will not scale as
> > well as recent linux kernels in high load environments.
>
> But not in the case of a single import program trying to seed a massive
> database?

probably not.

> > hearing good things about the woodcrest. pre-woodcrest xeon (dempsey
> > down) is outclassed by the opteron.
>
> Need to find a way to deterimine the Xeon type. The server was bought in
> early 2006, and it looks like woodcrest was form July.

ok, there are better chips out there but again this is not something
you would really notice outside of high load environements.

> > 1. can probably run fsync=off during the import
> > 2. if import is single proecess, consider temporary bump to memory for
> > index creation. or, since you have four cores consider having four
> > processes import the data somehow.
> > 3. turn off stats collector, stats_command_string, stats_row_level,
> > and autovacuum during import.

by the way, stats_command_string is a known performance killer that
iirc was improved in 8.2. just fyi.

I would suggest at least consideration of retooling your import as
follows...it might be a fun project to learn some postgresql
internals.  I'm assuming you are doing some script preprocessing in a
language like perl:

bulk load denomalized tables into scratch tables into the postgresql
database. create indexes appropriate to the nomalization process
remembering you can index on virtually any expression in postgresql
(including regex substitution).

use sql to process the data. if tables are too large to handle with
monolithic queries, use cursors and/or functions to handle the
conversion.  now you can keep track of progress using pl/pgsql raise
notice command for example.

merlin

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

28 September 2006, 17:15:16

The deduplication process requires so many programmed procedures that it
runs on the client. Most of the de-dupe lookups are not "straight" lookups,
but calculated ones emplying fuzzy logic. This is because we cannot dictate
the format of our input data and must deduplicate with what we get.

This was one of the reasons why I went with PostgreSQL in the first place,
because of the server-side programming options. However, I saw incredible
performance hits when running processes on the server and I partially
abandoned the idea (some custom-buiilt name-comparison functions still run
on the server).

I am using Tcl on both the server and the client. I'm not a fan of Tcl, but
it appears to be quite well implemented and feature-rich in PostgreSQL. I
find PL/pgsql awkward - even compared to Tcl. (After all, I'm just a
programmer...  we do tend to be a little limited.)

The import program actually runs on the server box as a db client and
involves about 3000 lines of code (and it will certainly grow steadily as we
add compatability with more import formats). Could a process involving that
much logic run on the db server, and would there really be a benefit?

Carlo

""Jim C. Nasby"" <jim@nasby.net> wrote in message
news:20060928184538.GV34238@nasby.net...
> On Thu, Sep 28, 2006 at 01:53:22PM -0400, Carlo Stonebanks wrote:
>> > are you using the 'copy' interface?
>>
>> Straightforward inserts - the import data has to transformed, normalised
>> and
>> de-duped by the import program. I imagine the copy interface is for more
>> straightforward data importing. These are - buy necessity - single row
>> inserts.
>
> BTW, stuff like de-duping is something you really want the database -
> not an external program - to be doing. Think about loading the data into
> a temporary table and then working on it from there.
> --
> Jim Nasby                                            jim@nasby.net
> EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Re: Performace Optimization for Dummies

From

"Merlin Moncure"

Date:

28 September 2006, 17:56:04

On 9/28/06, Carlo Stonebanks <stonec.register@sympatico.ca> wrote:
> The deduplication process requires so many programmed procedures that it
> runs on the client. Most of the de-dupe lookups are not "straight" lookups,
> but calculated ones emplying fuzzy logic. This is because we cannot dictate
> the format of our input data and must deduplicate with what we get.
>
> This was one of the reasons why I went with PostgreSQL in the first place,
> because of the server-side programming options. However, I saw incredible
> performance hits when running processes on the server and I partially
> abandoned the idea (some custom-buiilt name-comparison functions still run
> on the server).

imo, the key to high performance big data movements in postgresql is
mastering sql and pl/pgsql, especially the latter.  once you get good
at it, your net time of copy+plpgsql is going to be less than
insert+tcl.

merlin

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

28 September 2006, 18:04:42

Lots of great info here, I will see what applies to my situation. However, I
don't see bulk inserts of the tables working, because all of the tables need
to be refreshed as values to deduplicate and match will change with every
row added. In order for this to work, i would have to write queries against
the hash tables. This is where something like MySQL's in-memory tables would
have come in handy...

What is GDB?

Carlo

"Matthew Nuzum" <matthew.nuzum@canonical.com> wrote in message
news:451C18BE.8050908@canonical.com...
> Carlo Stonebanks wrote:
>>> are you using the 'copy' interface?
>>
>> Straightforward inserts - the import data has to transformed, normalised
>> and
>> de-duped by the import program. I imagine the copy interface is for more
>> straightforward data importing. These are - buy necessity - single row
>> inserts.
>>
>
> I know this is an answer to a question you didn't ask, but here it is. I
> was once doing stuff where I processed log files and had to do many
> lookups to normalize the data before insertion.
>
> I started out doing everything in SQL and using postgresql tables and it
> took a little over 24 hours to process 24 hours worth of data. Like you,
> it was single process, many lookups.
>
> I found a better way. I rewrote it (once in c#, again in python) and
> used hashtables/dictionaries instead of tables for the lookup data. For
> example, I'd start by loading the data into hash tables (yes, this took
> a *lot* of ram) then for each row I did something like:
> 1. is it in the hash table?
>   1. If not, insert it into the db
>   1. Insert it into the hashtable
> 2. Get the lookup field out of the hash table
> 3. Output normalized data
>
> This allow me to create text files containing the data in COPY format
> which can then be inserted into the database at dramatically increased
> speeds.
>
> My first version in C# (mono) cut the time down to 6 hours for 24 hours
> worth of data. I tweaked the algorithms and rewrote it in Python and got
> it down to 45 min. (Python can't take all the credit for the performance
> boost, I used an improved technique that could have been done in C# as
> well) This time included the time needed to do the copy and update the
> indexes.
>
> I created a version that also used gdb databases instead of hash tables.
> It increased the time from 45 min to a little over an hour but decreased
> the memory usage to something like 45MB (vs dozens or hundreds of MB per
> hashtable)
> --
> Matthew Nuzum
> newz2000 on freenode
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faq
>

Re: Performace Optimization for Dummies

From

Steve Atkins

Date:

28 September 2006, 18:05:57

On Sep 28, 2006, at 12:10 PM, Carlo Stonebanks wrote:

>
>> Are you wrapping all this in a transaction?
>
> Yes, the transactions can typically wrap 1 to 10 single-table,
> single-row
> inserts and updates.
>
>
>> You're doing some dynamically generated selects as part of the
>> "de-duping" process? They're probably the expensive bit. What
>> do those queries tend to look like?
>
> Without a doubt, this is the expensive bit.

If you could give some samples of those queries here I suspect
people could be a lot more helpful with some optimisations, or
at least pinpoint where the performance issues are likely to be.

>
>> Are you analysing the table periodically? If not, then you might
>> have statistics based on an empty table, or default statistics, which
>> might cause the planner to choose bad plans for those selects.
>
> Now there's something I didn't know - I thought that analysis and
> planning
> was done with every select, and the performance benefit of prepared
> statements was to plan-once, execute many. I can easily put in a
> periodic
> analyse statement. I obviously missed how to use analyze properluy, I
> thought it was just human-readable output - do I understand
> correctly, that
> it can be used to get the SQL server to revaluate its plan based on
> newer
> statistics - even on non-prepared queries?

I think you're confusing "explain" and "analyze". "Explain" gives you
human readable output as to what the planner decided to do with the
query you give it.

"Analyze" samples the data in tables and stores the statistical
distribution
of the data, and estimates of table size and that sort of thing for the
planner to use to decide on a good query plan. You need to run
analyze when the statistics or size of a table has changed
significantly,
so as to give the planner the best chance of choosing an appropriate
plan.

If you're not running analyze occasionally then the planner will be
working on default stats or empty table stats and will tend to avoid
indexes. I don't know whether autovacuum will also analyze tables
for you automagically, but it would be a good idea to analyze the table
every so often, especially early on in the load - as the stats
gathered for
a small table will likely give painful performance once the table has
grown a lot.

>
>> Talking of which, are there indexes on the table? Normally you
>> wouldn't have indexes in place during a bulk import, but if you're
>> doing selects as part of the data load process then you'd be forcing
>> sequential scans for every query, which would explain why it gets
>> slower as the table gets bigger.
>
> There are indexes for every obvious "where this = that" clauses. I
> don't
> believe that they will work for ilike expressions.

If you're doing a lot of "where foo ilike 'bar%'" queries, with the
pattern
anchored to the left you might want to look at using a functional index
on lower(foo) and rewriting the query to look like "where lower(foo)
like
lower('bar%')".

Similarly if you have many queries where the pattern is anchored
at the right of the string then a functional index on the reverse of the
string can be useful.

Cheers,
   Steve

Re: Performace Optimization for Dummies

From

"Jim C. Nasby"

Date:

28 September 2006, 20:52:55

On Thu, Sep 28, 2006 at 02:04:21PM -0700, Steve Atkins wrote:
> I think you're confusing "explain" and "analyze". "Explain" gives you
> human readable output as to what the planner decided to do with the
> query you give it.

Don't forget about EXPLAIN ANALYZE, which is related to EXPLAIN but has
nothing to do with the ANALYZE command.

> indexes. I don't know whether autovacuum will also analyze tables
> for you automagically, but it would be a good idea to analyze the table

It does.

> >>Talking of which, are there indexes on the table? Normally you
> >>wouldn't have indexes in place during a bulk import, but if you're
> >>doing selects as part of the data load process then you'd be forcing
> >>sequential scans for every query, which would explain why it gets
> >>slower as the table gets bigger.
> >
> >There are indexes for every obvious "where this = that" clauses. I
> >don't
> >believe that they will work for ilike expressions.
>
> If you're doing a lot of "where foo ilike 'bar%'" queries, with the
> pattern
> anchored to the left you might want to look at using a functional index
> on lower(foo) and rewriting the query to look like "where lower(foo)
> like
> lower('bar%')".
>
> Similarly if you have many queries where the pattern is anchored
> at the right of the string then a functional index on the reverse of the
> string can be useful.

tsearch might prove helpful... I'm not sure how it handles substrings.

Something else to consider... databases love doing bulk operations. It
might be useful to load prospective data into a temporary table, and
then do as many operations as you can locally (ie: within the database)
on that table, hopefully eleminating as many candidate rows as possible
along the way.

I also suspect that running multiple merge processes at once would help.
Right now, your workload looks something like this:

client sends query      database is idle
client is idle          database runs query
client gets query back  database is idle

Oversimplification, but you get the point. There's a lot of time spent
waiting on each side. If the import code is running on the server, you
should probably run one import process per CPU. If it's on an external
server, 2 per CPU would probably be better (and that might be faster
than running local on the server at that point).
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Performace Optimization for Dummies

From

Matthew Nuzum

Date:

29 September 2006, 00:08:49

Carlo Stonebanks wrote:
> Lots of great info here, I will see what applies to my situation. However, I
> don't see bulk inserts of the tables working, because all of the tables need
> to be refreshed as values to deduplicate and match will change with every
> row added. In order for this to work, i would have to write queries against
> the hash tables. This is where something like MySQL's in-memory tables would
> have come in handy...
>
> What is GDB?
>
> Carlo

Sorry, meant GDBM (disk based hash/lookup table).

With Postgres if your tables fit into RAM then they are in-memory as
long as they're actively being used.

Hashtables and GDBM, as far as I know, are only useful for key->value
lookups. However, for this they are *fast*. If you can figure out a way
to make them work I'll bet things speed up.
--
Matthew Nuzum
newz2000 on freenode

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

29 September 2006, 01:31:05

> Something else to consider... databases love doing bulk operations. It
> might be useful to load prospective data into a temporary table, and
> then do as many operations as you can locally (ie: within the database)
> on that table, hopefully eleminating as many candidate rows as possible
> along the way.

I wish this would work... it was definitely something I considered early on
in the project. Even thinking of explaining why it won't work is giving me a
headache...

But I sure wish it would.

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

29 September 2006, 01:37:43

> imo, the key to high performance big data movements in postgresql is
> mastering sql and pl/pgsql, especially the latter.  once you get good
> at it, your net time of copy+plpgsql is going to be less than
> insert+tcl.

If this implies bulk inserts, I'm afraid I have to consider something else.
Any data that has been imported and dedpulicated has to be placed back into
the database so that it can be available for the next imported row (there
are currently 16 tables affected, and more to come). If I was to cache all
inserts into a seperate resource, then I would have to search 32 tables -
the local pending resources, as well as the data still in the system. I am
not even mentioning that imports do not just insert rows, they could just
rows, adding their own complexity.

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

29 September 2006, 01:47:02

> by the way, stats_command_string is a known performance killer that
> iirc was improved in 8.2. just fyi.

This is a handy fact, I will get on this right away.

> bulk load denomalized tables into scratch tables into the postgresql
> database. create indexes appropriate to the nomalization process
> remembering you can index on virtually any expression in postgresql
> (including regex substitution).

> use sql to process the data. if tables are too large to handle with
> monolithic queries, use cursors and/or functions to handle the
> conversion.  now you can keep track of progress using pl/pgsql raise
> notice command for example.

For reasons I've exlained elsewhere, the import process is not well suited
to breaking up the data into smaller segments. However, I'm interested in
what can be indexed. I am used to the idea that indexing only applies to
expressions that allows the data to be sorted, and then binary searches can
be performed on the sorted list. For example, I can see how you can create
an index to support:

where foo like 'bar%'

But is there any way to create an index expression that will help with:

where foo like '%bar%'?

I don't see it - but then again, I'm ready to be surprised!

Carlo

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

29 September 2006, 01:49:37

> Don't forget about EXPLAIN ANALYZE, which is related to EXPLAIN but has
> nothing to do with the ANALYZE command.

Ah, hence my confusion. Thanks for the clarification... I never knew about
ANALYZE as a seperate command.

Re: Performace Optimization for Dummies

From

Tom Lane

Date:

29 September 2006, 01:49:48

"Carlo Stonebanks" <stonec.register@sympatico.ca> writes:
> But is there any way to create an index expression that will help with:
> where foo like '%bar%'?

If you are concerned about that, what you are probably really looking
for is full-text indexing.  See contrib/tsearch2 for our current best
answer to that.

            regards, tom lane

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

29 September 2006, 02:02:03

>> indexes. I don't know whether autovacuum will also analyze tables
>> for you automagically, but it would be a good idea to analyze the table
>
> It does.

So, I have checked my log and I see an autovacuum running once every minute
on our various databases being hosted on the server - once every minute!

From what I can see, autovacuum is hitting the db's in question about once
every five minutes. Does this imply an ANALYZE is being done automatically
that would meet the requirements we are talking about here? Is there any
benefit ot explicitly performing an ANALYZE?

(Or does this go hand-in-and with turning off autovacuum...?)

Re: Performace Optimization for Dummies

From

Heikki Linnakangas

Date:

29 September 2006, 06:12:45

Carlo Stonebanks wrote:
> We urgently need a major performance improvement. We are running the
> PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor,
> dual core
> 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what
> type) disc
> subsystem. Sorry about the long intro, but here are my questions:

Others have already drilled down to the way you do the inserts and
statistics etc., so I'll just point out:

Are you fully utilizing all the 4 cores you have? Could you parallelize
the loading process, if you're currently running just one client? Are
you I/O bound or CPU bound?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Re: Performace Optimization for Dummies

From

Simon Riggs

Date:

29 September 2006, 07:58:44

On Thu, 2006-09-28 at 12:44 -0400, Carlo Stonebanks wrote:

> At this early stage in the project, we are initializing our portal's
> database with millions of rows of imported data in over 50 different
> flattened tables; each table's structure is unique to the data provider.
> This requires a pretty complex import program, because the data must be
> matched semantically, not literally. Even with all of the expression
> matching and fuzzy logic in the code,our performance statistics show that
> the program spends over 75% of its time in SQL queries looking for matching
> and/or duplicate data.

My experience with that type of load process is that doing this
row-by-row is a very expensive approach and your results bear that out.

It is often better to write each step as an SQL statement that operates
on a set of rows at one time. The lookup operations become merge joins
rather than individual SQL Selects via an index, so increase the
efficiency of the lookup process by using bulk optimisations and
completely avoiding any program/server call traffic. Data can move from
step to step by using Insert Selects into temporary tables, as Jim has
already suggested.

The SQL set approach is different to the idea of simply moving the code
server-side by dropping it in a function. That helps with the net
traffic but has other issues also. You don't need to use esoteric
in-memory thingies if you use the more efficient join types already
available when you do set based operations (i.e. join all rows at once
in one big SQL statement).

You can also improve performance by ordering your checks so that the
ones most likely to fail happen first.

Trying to achieve a high level of data quality in one large project is
not often possible. Focus on the most critical areas of checking and get
that working first with acceptable performance, then layer on additional
checks while tuning. The complexity of the load programs you have also
means they are susceptible to introducing data quality problems rather
than removing them, so an incremental approach will also aid debugging
of the load suite. Dynamic SQL programs are particularly susceptible to
this kind of bug because you can't eyeball the code.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Performace Optimization for Dummies

From

Markus Schaber

Date:

29 September 2006, 09:20:22

Hi, Carlo,

Carlo Stonebanks wrote:

> We urgently need a major performance improvement.

Did you think about putting the whole data into PostgreSQL using COPY in
a nearly unprocessed manner, index it properly, and then use SQL and
stored functions to transform the data inside the database to the
desired result?

We're using this way for some 3rd-party databases we have to process
in-house.

HTH,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Performace Optimization for Dummies

From

Markus Schaber

Date:

29 September 2006, 09:25:44

Hi, Carlo,

Carlo Stonebanks wrote:

> From what I can see, autovacuum is hitting the db's in question about once
> every five minutes. Does this imply an ANALYZE is being done automatically
> that would meet the requirements we are talking about here? Is there any
> benefit ot explicitly performing an ANALYZE?

Autovacuum looks at the modification statistics (they count how much
modifications happened on the table), and decides whether it's time to
VACUUM (reclaim empty space) and ANALYZE (update column value
distributions) the table.

The exact thresholds for Autovacuum to kick in are configurable, see the
docs.

HTH,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Performace Optimization for Dummies

From

Bill Moran

Date:

29 September 2006, 09:58:10

In response to "Carlo Stonebanks" <stonec.register@sympatico.ca>:

> >> indexes. I don't know whether autovacuum will also analyze tables
> >> for you automagically, but it would be a good idea to analyze the table
> >
> > It does.
>
> So, I have checked my log and I see an autovacuum running once every minute
> on our various databases being hosted on the server - once every minute!
>
> From what I can see, autovacuum is hitting the db's in question about once
> every five minutes. Does this imply an ANALYZE is being done automatically
> that would meet the requirements we are talking about here? Is there any
> benefit ot explicitly performing an ANALYZE?
>
> (Or does this go hand-in-and with turning off autovacuum...?)

It's only checking to see if vacuum/analyze needs done every 5 minutes.
It may or may not do any actual work at that time, based on how much
the tables have changed.  See:
http://www.postgresql.org/docs/8.1/interactive/maintenance.html#AUTOVACUUM

This is a case, during your bulk loads, where autovacuum might actually
hurt you.  How many records are you inserting/updating in 5 minutes?
You may be exceeding autovacuum's ability to keep things clean.

I can't say for sure, but I would suspect that you'd be better off not
using autovacuum until after the initial data loads are done.  My
guess is that you'll get better performance if you disable autovac and
write manual vacuum/analyze into your load scripts.  Exactly how often
to have your script do it is something that will require testing to
figure out, but probably starting with every 100 or so, then adjust
it up and down and see what works best.

Explicitly performing a vacuum or analyze can be very beneficial,
especially if you know what kind of changes your creating in the data.
(Now that I think of it, there's no reason to disable autovac, as it
will notice if you've just manually vacuumed a table and not do it
again.)  If you know that you're radically changing the kind of data
in a table, manually running analyze is a good idea.  If you know that
you're creating a lot of dead tuples, manually vacuuming is a good
idea.  Especially during a big data load where these changes might be
taking place faster than autovac notices.

--
Bill Moran
Collaborative Fusion Inc.

Re: Performace Optimization for Dummies

From

"Merlin Moncure"

Date:

29 September 2006, 11:39:30

On 9/29/06, Carlo Stonebanks <stonec.register@sympatico.ca> wrote:
> For reasons I've exlained elsewhere, the import process is not well suited
> to breaking up the data into smaller segments. However, I'm interested in
> what can be indexed. I am used to the idea that indexing only applies to
> expressions that allows the data to be sorted, and then binary searches can
> be performed on the sorted list. For example, I can see how you can create
> an index to support:
>
> where foo like 'bar%'
>
> But is there any way to create an index expression that will help with:
>
> where foo like '%bar%'?
>
> I don't see it - but then again, I'm ready to be surprised!

using standard (btree) index, you can create an index on any constant
expression.  so, you  can create in index that matches '%bar%, but if
you also want to match '%bat%', you need another index.  there are
other exotic methods like t_search and gist approach which may or may
not be suitable.

regarding your import process, you came to this list and asked for
advice on how to fix your particular problem.  tweaking
postgresql.conf, etc will get you incremental gains but are unlikely
to have a huge impact.  as i understand it, your best shot at
improvement using current process is to:
1. fork your import somhow to get all 4 cores running
2. write the code that actually does the insert in C and use the
parameterized prepared statement.

however, your general approach has been 'please give me advice, but
only the advice that i want'.  if you really want to fix your problem,
give more specific details about your import and open the door to
improvements in your methodology which i suspect is not optimal.  you
concluded that client side coding was the way to go, but here you are
asking how to make it work.  if you want help (and there are some
extremely smart people here who may give you world class advice for
free), you need to lay your cards on the table and be willing to
consider alternative solutions.   you may find that a few properly
consructed queries will knock out 75% of your code and running time.

remember often the key to optimization is choosing the right algorithm

merlin

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

02 October 2006, 23:44:06

I have loaded three of the four cores by running three different versions of
the import program to import three different segments of the table to
import. The server jumps to 75% usage, with three postgresql processes
eating up 25% each., the actual client itself taking up just a few ticks.

"Heikki Linnakangas" <heikki@enterprisedb.com> wrote in message
news:451CE38B.7070107@enterprisedb.com...
> Carlo Stonebanks wrote:
>> We urgently need a major performance improvement. We are running the
>> PostgreSQL 8.1.4 on a Windows 2003 x64 Server on a dual processor, dual
>> core
>> 3.2Ghz Xeon box with 4gb RAM and a RAID (sorry, I don't know what type)
>> disc
>> subsystem. Sorry about the long intro, but here are my questions:
>
> Others have already drilled down to the way you do the inserts and
> statistics etc., so I'll just point out:
>
> Are you fully utilizing all the 4 cores you have? Could you parallelize
> the loading process, if you're currently running just one client? Are you
> I/O bound or CPU bound?
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

03 October 2006, 00:03:53

> Did you think about putting the whole data into PostgreSQL using COPY in
> a nearly unprocessed manner, index it properly, and then use SQL and
> stored functions to transform the data inside the database to the
> desired result?

This is actually what we are doing. The slowness is on the row-by-row
transformation. Every row reqauires that all the inserts and updates of the
pvious row be committed - that's why we have problems figuring out how to
use this using SQL set logic.

Carlo

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

03 October 2006, 00:04:17

> My experience with that type of load process is that doing this
> row-by-row is a very expensive approach and your results bear that out.

I expected this, and had warned the client before the project started that
this is exactly where SQL underperforms.

> It is often better to write each step as an SQL statement that operates
> on a set of rows at one time.

The problem with this approach is that every row of data is dependent on the
previous row's data being validated and imported. e.g.

Import Row 1:
John Q Smith
Foobar Corp
123 Main St,
Bigtown, MD 12345-6789

Import Row 2:
John Quincy Smith
FuzzyLoginc Inc
123 Main St, Suite 301
Bigtown, MD 12345-6789

Import Row 3:
Bobby Jones
Foobar Corp
123 Main Strett Suite 300,
Bigtown, MD 12345

Every row must be imported into the table so that the next row may see the
data and consider it when assigning ID's to the name, company and address.
(all data must be normalised) How can this be done using set logic?

> You can also improve performance by ordering your checks so that the
> ones most likely to fail happen first.

Already done - I believe the problem is definitely in the navigational
access model. What I am doing now makes perfect sense as far as the logic of
the process goes - any other developer will read it and understand what is
going on. At 3000 lines of code, this will be tedious, but understandable.
But SQL hates it.

> Trying to achieve a high level of data quality in one large project is
> not often possible. Focus on the most critical areas of checking and get
> that working first with acceptable performance, then layer on additional
> checks while tuning. The complexity of the load programs you have also
> means they are susceptible to introducing data quality problems rather
> than removing them, so an incremental approach will also aid debugging
> of the load suite.

I couldn't agree more.

Carlo

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

03 October 2006, 00:05:30

> 1. fork your import somhow to get all 4 cores running

This is already happening, albeit only 3. No improvement - it appears we
have taken the same problem, and divided it by 3. Same projected completion
time. this is really curious, to say the least.

> 2. write the code that actually does the insert in C and use the
> parameterized prepared statement.

I had already tried the paremetrised prepare statement; I had mentioned that
I was surprised that it had no effect. No one here seemed surprised, or at
least didn't think of commenting on it.

> however, your general approach has been 'please give me advice, but
> only the advice that i want'.

I'm sorry I don't understand - I had actually originally come asking four
questions asking for recommendations and opinions on hardware, O/S and
commercial support. I did also ask for comments on my config setup.

Re: Performace Optimization for Dummies

From

Markus Schaber

Date:

03 October 2006, 05:15:41

Hi, Carlo,

Carlo Stonebanks wrote:
>> Did you think about putting the whole data into PostgreSQL using COPY in
>> a nearly unprocessed manner, index it properly, and then use SQL and
>> stored functions to transform the data inside the database to the
>> desired result?
>
> This is actually what we are doing. The slowness is on the row-by-row
> transformation. Every row reqauires that all the inserts and updates of the
> pvious row be committed - that's why we have problems figuring out how to
> use this using SQL set logic.

Maybe "group by", "order by", "distinct on" and hand-written functions
and aggregates (like first() or best()) may help.

You could combine all relevant columns into an user-defined compund
type, then group by entity, and have a self-defined aggregate generate
the accumulated tuple for each entity.

Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Performace Optimization for Dummies

From

Markus Schaber

Date:

03 October 2006, 05:20:58

Hi, Carlo,

Carlo Stonebanks wrote:

>> Trying to achieve a high level of data quality in one large project is
>> not often possible. Focus on the most critical areas of checking and get
>> that working first with acceptable performance, then layer on additional
>> checks while tuning. The complexity of the load programs you have also
>> means they are susceptible to introducing data quality problems rather
>> than removing them, so an incremental approach will also aid debugging
>> of the load suite.
>
> I couldn't agree more.

I still think that using a PL in the backend might be more performant
than having an external client, alone being the SPI interface more
efficient compared to the network serialization for external applications.

HTH,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

03 October 2006, 06:32:52

> I still think that using a PL in the backend might be more performant
> than having an external client, alone being the SPI interface more
> efficient compared to the network serialization for external applications.

I would actually love for this to work better, as this is technology that I
would like to develop in general - I see db servers with strong server-side
programming languages as being able to operate as application servers, with
the enterprises business logic centralised on the server.

The import routine that I wrote will actually work on the server as well -
it will detect the presence of the spi_ calls, and replace the pg_* calls
with spi_* calls. So, you see this WAS my intention.

However, the last time I tried to run something that complex from the db
server, it ran quite slowly compared to from a client. This may have had
something to do with the client that I used to call the stored procedure - I
thought that perhaps the client created an implicit transaction around my
SQL statement to allow a rollback, and all of the updates and inserts got
backed up in a massive transaction queue that took forever to commit.

Carlo

Re: Performace Optimization for Dummies

From

"Carlo Stonebanks"

Date:

03 October 2006, 06:36:49

> Maybe "group by", "order by", "distinct on" and hand-written functions
> and aggregates (like first() or best()) may help.

We use these - we have lexical analysis functions which assign a rating to
each row in a set, and the likelyhood that the data is a match, and then we
sort our results.

I thought this would be the cause of the slowdowns - and it is, but a very
small part of it. I have identified the problem code, and the problems are
within some very simple joins. I have posted the code under a related topic
header. I obviously have a few things to learn about optimising SQL joins.

Carlo

>
> You could combine all relevant columns into an user-defined compund
> type, then group by entity, and have a self-defined aggregate generate
> the accumulated tuple for each entity.
>
> Markus
> --
> Markus Schaber | Logical Tracking&Tracing International AG
> Dipl. Inf.     | Software Development GIS
>
> Fight against software patents in Europe! www.ffii.org
> www.nosoftwarepatents.org
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>