Optimizing data layout for reporting in postgres - Mailing list pgsql-general

From Doug El
Subject Optimizing data layout for reporting in postgres
Date
Msg-id 245275.51880.qm@web114001.mail.gq1.yahoo.com
Whole thread Raw
Responses Re: Optimizing data layout for reporting in postgres  (Scott Marlowe <scott.marlowe@gmail.com>)
Re: Optimizing data layout for reporting in postgres  (Alban Hertroys <dalroi@solfertje.student.utwente.nl>)
List pgsql-general
Hi,

I have to summarize some data to be queried and there are about 5 million raw records a day I need to summarize. In a
nutshellI don't think I'm laying it out in an optimal fashion, or not taking advantage of some postgres features
perhaps,I'm looking for feedback. 

The raw incoming data is in the form of

ip string uint uint uint uint

So for any given record say:

8.8.8.8 helloworld 1 2 3 4

First, I need to be able to query how many total and how many unique requests there were (unique by ip), over given
timeframe. 

So for the below data on the same day that's total two, but one unique

8.8.8.8 helloworld 1 2 3 4
8.8.8.8 helloworld 1 2 3 4

Further for all fields (but ip which is not stored) I need to be able to query and get total/unique counts based off
anycombination of criteria. 

So if I refer to them as columns A-E

A        B        C        D        E
string     uint     uint     uint     uint

I need to be able and say how many where col A = 'helloworld' and say col C = 4.
Or  perhaps col E = 4 and col c < 3 etc, any combination.

The only way I could see to do this was to take the 5 million daily raw records, sort them, then summarize that list
withtotal and unique counts as so: 

A        B            C        D        E        F        G        H
date    stringid     uint     uint     uint     uint    total    unique

Primary key is A-F (date stringid uint uint uint uint)

This gives me a summary of about 900k records a day from the 4 million raw.

I have things organized with monthly tables and yearly schemas. The string column also has its own monthly lookup
table,so there's just a string id that's looked up. 

The database however is still quite huge and grows very fast, even simple daily queries are fairly slow even on a fast
server.I have a few indexes on what I know are common columns queried against but again, any combination of data can be
queried,and  indexes do increase db size of course. 

I feel like there's got to be some better way to organize this data and make it searchable.  Overall speed is more
importantthan disk space usage for this application.  

Perhaps there are some native features in postgres I'm not taking advantage of here, that would tip the scales in my
favor.I've done a fair amount of research on the configuration file settings and feel like I have a fairly optimized
configfor it as far as that goes, and have done the things mentioned here:
http://wiki.postgresql.org/wiki/SlowQueryQuestions

Very much appreciate any suggestions, thank you in advance.

Doug





pgsql-general by date:

Previous
From: Alexander Solovkin
Date:
Subject: Installation problem
Next
From: "J. Greg Davidson"
Date:
Subject: Any need to copy args before SPI C function callls SQL function?