Thread: general questions postgresql performance config

general questions postgresql performance config

From
Dino Vliet
Date:
Dear postgresql people,
 
Introduction
Today I've been given the task to proceed with my plan to use postgresql and other open source techniques to demonstrate to the management of my department the usefullness and the "cost savings" potential that lies ahead. You can guess how excited I am right now. However, I should plan and execute at the highest level because I really want to show results. I'm employed in the financial services.
 
Context of the problem
Given 25 million input data, transform and load 10 million records to a single table DB2 database containing already 120 million records (the whole history).
 
The current process is done on the MVS mainframe while the SAS system is used to process the records (ETL like operations). The records of the two last months (so 20 million records) are also stored in a single SAS dataset, where users can access them through SAS running on their Windows PC's. With SAS PC's they can also analyse the historical records in the DB2 table on the mainframe.
 
These users are not tech savvy so this access method is not very productive for them but because the data is highly valued, they use it without complaining too much.
 
Currently it takes 5 to 6 hours before everything is finished.
 
Proof of concept
I want to showcase that a solution process like:
 
input-->Talend/Pentaho Kettle for ETL-->postgresql-->pentaho report designer, is feasible while staying in the 5~6 hours processing and loading time.
 
Input: flat files, position based
ETL: Pentaho Kettle or Talend to process these files
DBMS: postgresql 8 (on debian, opensuse, or freebsd)
Reporting: Pentaho report wizard
 
Hardware
  • AMD AM2 singlecore CPU with 4GB RAM
  • Two mirrored SATA II disks (raid-0)
 
Now that I have introduced my situation, I hope this list can give me some tips, advice, examples, pitfalls regarding the requirements I have.
 
Questions
1) Although this is not exactly rocket science, the sheer volume of the data makes it a hard task. Do you think my "solution" is viable/achievable?
 
2) What kind of OS would you choose for the setup I have proposed? I prefer FreeBSD with UFS2 as a filesystem, but I guess Debian with ext3 filesystems or openSUSE with ext3 or Ubuntu server with ext3 would all be very good candidates too??
 
3) Would you opt for the ETL tools mentioned by me (pentaho and talend) or just rely on the unix/linux apps like gawk, sed, perl? I'm familiar with gawk. The ETL tools require java, so I would have to configure postgresql to not use all the available RAM otherwise risking the java out of memory error message. With that said, it would be best if I first configure my server to do the ETL processing and then afterwards configure it for database usage.
 
4) what values would you advice for the various postgresql.conf values which can impact performance like shared buffers, temp_buffers, sort_mem, etc etc? Or is this more of like an "art" where I change and restart the db server, analyze the queries and iterate until I find optimal values?
 
5) Other considerations?
 
Thanks in advanced,
 
Dino
 

Re: general questions postgresql performance config

From
Craig Ringer
Date:
On 26/01/2010 12:15 AM, Dino Vliet wrote:

> 5) Other considerations?


To get optimal performance for bulk loading you'll want to do concurrent
data loading over several connections - up to as many as you have disk
spindles. Each connection will individually be slower, but the overall
throughput will be much greater.

Just how many connections you'll want to use depends on your I/O
subsystem and to a lesser extent your CPU capacity.

Inserting data via multiple connections isn't something every data
loading tool supports, so make sure to consider this carefully.


Another consideration is how you insert the data. It's vital to insert
your data in large transaction-batched chunks (or all in one
transaction) ; even with synchronized_commit = off you'll still see
rather poor performance if you do each INSERT in its own transaction.
Doing your inserts as prepared statements where each INSERT is multi-row
valued will help too.

Even better is to use COPY to load large chunks of data. libpq provides
access to the COPY interface if you feel like some C coding. The JDBC
driver (dev version only so far) now provides access to the COPY API, so
you can also bulk-load via Java very efficiently now.

If your data needs little/no transformation and is externally validated
you can look into pg_bulkload as an alternative to all this.

--
Craig Ringer

Re: general questions postgresql performance config

From
Scott Marlowe
Date:
On Mon, Jan 25, 2010 at 9:15 AM, Dino Vliet <dino_vliet@yahoo.com> wrote:
>
> Introduction
> Today I've been given the task to proceed with my plan to use postgresql and other open source techniques to
demonstrateto the management of my department the usefullness and the "cost savings" potential that lies ahead. You can
guesshow excited I am right now. However, I should plan and execute at the highest level because I really want to show
results.I'm employed in the financial services. 
>
> Context of the problem
> Given 25 million input data, transform and load 10 million records to a single table DB2 database containing
already 120million records (the whole history). 

Are these rows pretty wide?  or are they narrow?  Matters a lot.
120Million records of ~100 or so bytes each are gonna load a lot
quicker than 120Million records of 2,000 bytes, which will be faster
than rows of 20,000 bytes, and so on.

> The current process is done on the MVS mainframe while the SAS system is used to process the records (ETL like
operations).The records of the two last months (so 20 million records) are also stored in a single SAS dataset, where
userscan access them through SAS running on their Windows PC's. With SAS PC's they can also analyse the historical
recordsin the DB2 table on the mainframe. 

This sounds like you're gonna want to look into partitioning your
postgresql database.   Follow the manual's advice to use triggers not
rules to implement it.

> These users are not tech savvy so this access method is not very productive for them but because the data is highly
valued,they use it without complaining too much. 
>
> Currently it takes 5 to 6 hours before everything is finished.

The import or the user reports?  I assume the import process.

> Proof of concept
> I want to showcase that a solution process like:
>
> input-->Talend/Pentaho Kettle for ETL-->postgresql-->pentaho report designer, is feasible while staying in the 5~6
hoursprocessing and loading time. 

Keep in mind that if a simple desktop PC can run this in 24 hours or
something like that, you can expect a server class machine with a
decent RAID array to run it in some fraction of that.

> Input: flat files, position based
> ETL: Pentaho Kettle or Talend to process these files
> DBMS: postgresql 8 (on debian, opensuse, or freebsd)
> Reporting: Pentaho report wizard

Make sure and step up to at LEAST postgresql 8.3.latest.  8.4 doesn't
have tons of performance improvements, but it does have tons of
functional improvements that may make it worth your while to go to it
as well.

> Hardware
>
> AMD AM2 singlecore CPU with 4GB RAM
> Two mirrored SATA II disks (raid-0)

So, definitely a proof of concept on a workstation type machine.  Be
careful, if the workstation runs the import or reports in some
fractional percentage of time that the big machines do, it may become
a server on the spot. (it's happened to me before.)  So consider
making that RAID-1 up there in case it does.

> Questions
> 1) Although this is not exactly rocket science, the sheer volume of the data makes it a hard task. Do you think my
"solution"is viable/achievable? 

Yes.  I've done similar on small workstation machines before and
gotten acceptable performance for reports that can run overnight.

> 2) What kind of OS would you choose for the setup I have proposed? I prefer FreeBSD with UFS2 as a filesystem, but I
guessDebian with ext3 filesystems or openSUSE with ext3 or Ubuntu server with ext3 would all be very good candidates
too??

You should use the flavor of linux you're most familiar with the
pitfalls of.  They've all got warts, no need to learn new ones because
some other flavor is more popular.  OTOH, if you've got a row of
cabinets running RHEL and a few RHEL sysadmins around, you can appeal
to their vanity to get them to help tune the machine you're running.

> 3) Would you opt for the ETL tools mentioned by me (pentaho and talend) or just rely on the unix/linux apps like
gawk,sed, perl? I'm familiar with gawk. The ETL tools require java, so I would have to configure postgresql to not use
allthe available RAM otherwise risking the java out of memory error message. With that said, it would be best if I
firstconfigure my server to do the ETL processing and then afterwards configure it for database usage. 

I'd use unix tools myself.  gawk, sed, p(erl/ython/hp) are all great
for tossing together something that works quickly.  If you need more
flexibility then look at ETL tools later, unless you're already
familiar enough with one to spend the time getting it setup and
running.

> 4) what values would you advice for the various postgresql.conf values which can impact performance like shared
buffers,temp_buffers, sort_mem, etc etc? Or is this more of like an "art" where I change and restart the db server,
analyzethe queries and iterate until I find optimal values? 

Go here: http://www.westnet.com/~gsmith/content/postgresql/

> 5) Other considerations?

Get some ideas of what hardware you can afford to throw at this if it
goes live.  An 8 core 16 disk SAS server with 8 or 16 Gig of ram is
now in the $7000 range or below.  For that kind of money you can get a
serious performer.  Don't forget to get the best hardware RAID
controller you can afford, that gets good performance on your OS.

Use monitoring tools like iotop, iostat, vmstat, top, and so on to get
an idea what is working your system the hardest so you'll know how to
tune it.

When you get it working, and need it to go faster, post a new message
to pgsql-perform.

Re: general questions postgresql performance config

From
Andy Colson
Date:
On 1/25/2010 8:12 PM, Craig Ringer wrote:
> On 26/01/2010 12:15 AM, Dino Vliet wrote:
>
>> 5) Other considerations?
>
>
> Even better is to use COPY to load large chunks of data. libpq provides
> access to the COPY interface if you feel like some C coding. The JDBC
> driver (dev version only so far) now provides access to the COPY API, so
> you can also bulk-load via Java very efficiently now.
>
> --
> Craig Ringer
>

I recall seeing someplace that you can avoid WAL if you start a
transaction, then truncate the table, then start a COPY.

Is that correct?  Still hold true?  Would it make a lot of difference?

(Also, small plug, perl supports the COPY api too)

-Andy

Re: general questions postgresql performance config

From
Greg Smith
Date:
Andy Colson wrote:
> I recall seeing someplace that you can avoid WAL if you start a
> transaction, then truncate the table, then start a COPY.
>
> Is that correct?  Still hold true?  Would it make a lot of difference?

That is correct, still true, and can make a moderate amount of
difference if the WAL is really your bottleneck.  More of a tweak for
loading small to medium size things as I see it.  Once the database and
possibly its indexes get large enough, the loading time starts being
dominated by handling all that work, with its random I/O, rather than
being limited by the sequential writes to the WAL.  It's certainly a
useful optimization to take advantage of when you can, given that it's
as easy as:

BEGIN;
TRUNCATE TABLE x;
COPY x FROM ... ;
COMMIT;

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: general questions postgresql performance config

From
Jayadevan M
Date:
Hi,
Regarding Pentaho - please keep in mind that Pentaho needs significant amount of memory. We had a lot of issues with Pentaho crashing with java out of memory error. If you are using a 64 bit machine, you may be able to give it sufficient RAM and keep it happy. If all you have is one 4 GB machine to run PostgreSQL and the ETL tool, I have my doubts. It depends on the volume of data - how many GBs, rather than the number of records. Pentaho added PostgreSQL bulk loader as an experimental component recently. You can try that out. Talend can generate Java or perl components and was faster than Pentaho in our case. Since Talend  community edition did not provide a shared development environment, we opted for Pentato.
 If there is not a lot of complex 'transformations', you should be able to manage fine with shell scripts.
Jayadevan




From:        Dino Vliet <dino_vliet@yahoo.com>
To:        pgsql-general@postgresql.org
Date:        01/25/2010 09:57 PM
Subject:        [GENERAL] general questions postgresql performance config
Sent by:        pgsql-general-owner@postgresql.org




Dear postgresql people,
 
Introduction
Today I've been given the task to proceed with my plan to use postgresql and other open source techniques to demonstrate to the management of my department the usefullness and the "cost savings" potential that lies ahead. You can guess how excited I am right now. However, I should plan and execute at the highest level because I really want to show results. I'm employed in the financial services.
 
Context of the problem
Given 25 million input data, transform and load 10 million records to a single table DB2 database containing already 120 million records (the whole history).
 
The current process is done on the MVS mainframe while the SAS system is used to process the records (ETL like operations). The records of the two last months (so 20 million records) are also stored in a single SAS dataset, where users can access them through SAS running on their Windows PC's. With SAS PC's they can also analyse the historical records in the DB2 table on the mainframe.
 
These users are not tech savvy so this access method is not very productive for them but because the data is highly valued, they use it without complaining too much.
 
Currently it takes 5 to 6 hours before everything is finished.
 
Proof of concept
I want to showcase that a solution process like:
 
input-->Talend/Pentaho Kettle for ETL-->postgresql-->pentaho report designer, is feasible while staying in the 5~6 hours processing and loading time.
 
Input: flat files, position based
ETL: Pentaho Kettle or Talend to process these files
DBMS: postgresql 8 (on debian, opensuse, or freebsd)
Reporting: Pentaho report wizard
 
Hardware
  • AMD AM2 singlecore CPU with 4GB RAM
  • Two mirrored SATA II disks (raid-0)
 
Now that I have introduced my situation, I hope this list can give me some tips, advice, examples, pitfalls regarding the requirements I have.
 
Questions
1) Although this is not exactly rocket science, the sheer volume of the data makes it a hard task. Do you think my "solution" is viable/achievable?
 
2) What kind of OS would you choose for the setup I have proposed? I prefer FreeBSD with UFS2 as a filesystem, but I guess Debian with ext3 filesystems or openSUSE with ext3 or Ubuntu server with ext3 would all be very good candidates too??
 
3) Would you opt for the ETL tools mentioned by me (pentaho and talend) or just rely on the unix/linux apps like gawk, sed, perl? I'm familiar with gawk. The ETL tools require java, so I would have to configure postgresql to not use all the available RAM otherwise risking the java out of memory error message. With that said, it would be best if I first configure my server to do the ETL processing and then afterwards configure it for database usage.
 
4) what values would you advice for the various postgresql.conf values which can impact performance like shared buffers, temp_buffers, sort_mem, etc etc? Or is this more of like an "art" where I change and restart the db server, analyze the queries and iterate until I find optimal values?
 
5) Other considerations?
 
Thanks in advanced,
 
Dino
 







DISCLAIMER:


"The information in this e-mail and any attachment is intended only for the person to whom it is addressed and may contain confidential and/or privileged material. If you have received this e-mail in error, kindly contact the sender and destroy all copies of the original communication. IBS makes no warranty, express or implied, nor guarantees the accuracy, adequacy or completeness of the information contained in this email or any attachment and is not liable for any errors, defects, omissions, viruses or for resultant loss or damage, if any, direct or indirect."