Home > mailing lists

general questions postgresql performance config - Mailing list pgsql-general

From	Dino Vliet
Subject	general questions postgresql performance config
Date	January 25, 2010 12:28:58
Msg-id	345416.55275.qm@web51102.mail.re2.yahoo.com Whole thread Raw
Responses	Re: general questions postgresql performance config Re: general questions postgresql performance config Re: general questions postgresql performance config
List	pgsql-general

Tree view

Dear postgresql people,

Introduction

Today I've been given the task to proceed with my plan to use postgresql and other open source techniques to demonstrate to the management of my department the usefullness and the "cost savings" potential that lies ahead. You can guess how excited I am right now. However, I should plan and execute at the highest level because I really want to show results. I'm employed in the financial services.

Context of the problem

Given 25 million input data, transform and load 10 million records to a single table DB2 database containing already 120 million records (the whole history).

The current process is done on the MVS mainframe while the SAS system is used to process the records (ETL like operations). The records of the two last months (so 20 million records) are also stored in a single SAS dataset, where users can access them through SAS running on their Windows PC's. With SAS PC's they can also analyse the historical records in the DB2 table on the mainframe.

These users are not tech savvy so this access method is not very productive for them but because the data is highly valued, they use it without complaining too much.

Currently it takes 5 to 6 hours before everything is finished.

Proof of concept

I want to showcase that a solution process like:

input-->Talend/Pentaho Kettle for ETL-->postgresql-->pentaho report designer, is feasible while staying in the 5~6 hours processing and loading time.

Input: flat files, position based

ETL: Pentaho Kettle or Talend to process these files

DBMS: postgresql 8 (on debian, opensuse, or freebsd)

Reporting: Pentaho report wizard

Hardware

AMD AM2 singlecore CPU with 4GB RAM
Two mirrored SATA II disks (raid-0)

Now that I have introduced my situation, I hope this list can give me some tips, advice, examples, pitfalls regarding the requirements I have.

Questions

1) Although this is not exactly rocket science, the sheer volume of the data makes it a hard task. Do you think my "solution" is viable/achievable?

2) What kind of OS would you choose for the setup I have proposed? I prefer FreeBSD with UFS2 as a filesystem, but I guess Debian with ext3 filesystems or openSUSE with ext3 or Ubuntu server with ext3 would all be very good candidates too??

3) Would you opt for the ETL tools mentioned by me (pentaho and talend) or just rely on the unix/linux apps like gawk, sed, perl? I'm familiar with gawk. The ETL tools require java, so I would have to configure postgresql to not use all the available RAM otherwise risking the java out of memory error message. With that said, it would be best if I first configure my server to do the ETL processing and then afterwards configure it for database usage.

4) what values would you advice for the various postgresql.conf values which can impact performance like shared buffers, temp_buffers, sort_mem, etc etc? Or is this more of like an "art" where I change and restart the db server, analyze the queries and iterate until I find optimal values?

5) Other considerations?

Thanks in advanced,

Dino

pgsql-general by date:

From: Sam Mason
Date: 25 January 2010, 12:16:57
Subject: Re: Make & Install contrib/tablefunc Problems

From: "Joshua D. Drake"
Date: 25 January 2010, 12:31:56
Subject: Re: Log full of: statement_timeout out of the valid range.

general questions postgresql performance config - Mailing list pgsql-general

Previous

Next