Home > mailing lists

Re: Optimizing data layout for reporting in postgres - Mailing list pgsql-general

From	Scott Marlowe
Subject	Re: Optimizing data layout for reporting in postgres
Date	December 24, 2009 18:58:44
Msg-id	dcc563d10912241453g55040de3sc370c8b2c50bc3a1@mail.gmail.com Whole thread Raw
In response to	Optimizing data layout for reporting in postgres (Doug El <doug_el@yahoo.com>)
List	pgsql-general

Tree view

On Thu, Dec 24, 2009 at 1:01 PM, Doug El <doug_el@yahoo.com> wrote:
> Hi,
>
> I have to summarize some data to be queried and there are about 5 million raw records a day I need to summarize. In a
nutshellI don't think I'm laying it out in an optimal fashion, or not taking advantage of some postgres features
perhaps,I'm looking for feedback. 
>
> The raw incoming data is in the form of
>
> ip string uint uint uint uint
>
> So for any given record say:
>
> 8.8.8.8 helloworld 1 2 3 4
>
> First, I need to be able to query how many total and how many unique requests there were (unique by ip), over given
timeframe. 
>
> So for the below data on the same day that's total two, but one unique
>
> 8.8.8.8 helloworld 1 2 3 4
> 8.8.8.8 helloworld 1 2 3 4
>
> Further for all fields (but ip which is not stored) I need to be able to query and get total/unique counts based off
anycombination of criteria. 
>
> So if I refer to them as columns A-E
>
> A               B               C               D               E
> string  uint    uint    uint    uint
>
> I need to be able and say how many where col A = 'helloworld' and say col C = 4.
> Or  perhaps col E = 4 and col c < 3 etc, any combination.
>
> The only way I could see to do this was to take the 5 million daily raw records, sort them, then summarize that list
withtotal and unique counts as so: 
>
> A               B                       C               D               E               F               G            
 H 
> date    stringid        uint    uint    uint    uint    total   unique
>
> Primary key is A-F (date stringid uint uint uint uint)
>
> This gives me a summary of about 900k records a day from the 4 million raw.
>
> I have things organized with monthly tables and yearly schemas. The string column also has its own monthly lookup
table,so there's just a string id that's looked up. 
>
> The database however is still quite huge and grows very fast, even simple daily queries are fairly slow even on a
fastserver. I have a few indexes on what I know are common columns queried against but again, any combination of data
canbe queried, and  indexes do increase db size of course. 
>
> I feel like there's got to be some better way to organize this data and make it searchable.  Overall speed is more
importantthan disk space usage for this application. 
>
> Perhaps there are some native features in postgres I'm not taking advantage of here, that would tip the scales in my
favor.I've done a fair amount of research on the configuration file settings and feel like I have a fairly optimized
configfor it as far as that goes, and have done the things mentioned here:
http://wiki.postgresql.org/wiki/SlowQueryQuestions
>
> Very much appreciate any suggestions, thank you in advance.

We run a nightly cron job that creates all the summary tables etc at
midnight.  On a fast machine it takes about 1 to 2 hours to run, but
makes the queries run during the day go from 10 or 20 seconds to a few
hundred milliseconds.

You might want to look into table partitioning and also materialized
views.  There's a great tutorial on how to roll your own at:

http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views

pgsql-general by date:

From: Adrian Klaver
Date: 24 December 2009, 17:56:48
Subject: Re: cross-database time extract?

From: "Andrus"
Date: 24 December 2009, 19:56:37
Subject: Re: How to add month.year column validation

Re: Optimizing data layout for reporting in postgres - Mailing list pgsql-general

Previous

Next