Thread: Optimizing data layout for reporting in postgres

Optimizing data layout for reporting in postgres

From
Doug El
Date:
Hi,

I have to summarize some data to be queried and there are about 5 million raw records a day I need to summarize. In a
nutshellI don't think I'm laying it out in an optimal fashion, or not taking advantage of some postgres features
perhaps,I'm looking for feedback. 

The raw incoming data is in the form of

ip string uint uint uint uint

So for any given record say:

8.8.8.8 helloworld 1 2 3 4

First, I need to be able to query how many total and how many unique requests there were (unique by ip), over given
timeframe. 

So for the below data on the same day that's total two, but one unique

8.8.8.8 helloworld 1 2 3 4
8.8.8.8 helloworld 1 2 3 4

Further for all fields (but ip which is not stored) I need to be able to query and get total/unique counts based off
anycombination of criteria. 

So if I refer to them as columns A-E

A        B        C        D        E
string     uint     uint     uint     uint

I need to be able and say how many where col A = 'helloworld' and say col C = 4.
Or  perhaps col E = 4 and col c < 3 etc, any combination.

The only way I could see to do this was to take the 5 million daily raw records, sort them, then summarize that list
withtotal and unique counts as so: 

A        B            C        D        E        F        G        H
date    stringid     uint     uint     uint     uint    total    unique

Primary key is A-F (date stringid uint uint uint uint)

This gives me a summary of about 900k records a day from the 4 million raw.

I have things organized with monthly tables and yearly schemas. The string column also has its own monthly lookup
table,so there's just a string id that's looked up. 

The database however is still quite huge and grows very fast, even simple daily queries are fairly slow even on a fast
server.I have a few indexes on what I know are common columns queried against but again, any combination of data can be
queried,and  indexes do increase db size of course. 

I feel like there's got to be some better way to organize this data and make it searchable.  Overall speed is more
importantthan disk space usage for this application.  

Perhaps there are some native features in postgres I'm not taking advantage of here, that would tip the scales in my
favor.I've done a fair amount of research on the configuration file settings and feel like I have a fairly optimized
configfor it as far as that goes, and have done the things mentioned here:
http://wiki.postgresql.org/wiki/SlowQueryQuestions

Very much appreciate any suggestions, thank you in advance.

Doug





Re: Optimizing data layout for reporting in postgres

From
Scott Marlowe
Date:
On Thu, Dec 24, 2009 at 1:01 PM, Doug El <doug_el@yahoo.com> wrote:
> Hi,
>
> I have to summarize some data to be queried and there are about 5 million raw records a day I need to summarize. In a
nutshellI don't think I'm laying it out in an optimal fashion, or not taking advantage of some postgres features
perhaps,I'm looking for feedback. 
>
> The raw incoming data is in the form of
>
> ip string uint uint uint uint
>
> So for any given record say:
>
> 8.8.8.8 helloworld 1 2 3 4
>
> First, I need to be able to query how many total and how many unique requests there were (unique by ip), over given
timeframe. 
>
> So for the below data on the same day that's total two, but one unique
>
> 8.8.8.8 helloworld 1 2 3 4
> 8.8.8.8 helloworld 1 2 3 4
>
> Further for all fields (but ip which is not stored) I need to be able to query and get total/unique counts based off
anycombination of criteria. 
>
> So if I refer to them as columns A-E
>
> A               B               C               D               E
> string  uint    uint    uint    uint
>
> I need to be able and say how many where col A = 'helloworld' and say col C = 4.
> Or  perhaps col E = 4 and col c < 3 etc, any combination.
>
> The only way I could see to do this was to take the 5 million daily raw records, sort them, then summarize that list
withtotal and unique counts as so: 
>
> A               B                       C               D               E               F               G            
 H 
> date    stringid        uint    uint    uint    uint    total   unique
>
> Primary key is A-F (date stringid uint uint uint uint)
>
> This gives me a summary of about 900k records a day from the 4 million raw.
>
> I have things organized with monthly tables and yearly schemas. The string column also has its own monthly lookup
table,so there's just a string id that's looked up. 
>
> The database however is still quite huge and grows very fast, even simple daily queries are fairly slow even on a
fastserver. I have a few indexes on what I know are common columns queried against but again, any combination of data
canbe queried, and  indexes do increase db size of course. 
>
> I feel like there's got to be some better way to organize this data and make it searchable.  Overall speed is more
importantthan disk space usage for this application. 
>
> Perhaps there are some native features in postgres I'm not taking advantage of here, that would tip the scales in my
favor.I've done a fair amount of research on the configuration file settings and feel like I have a fairly optimized
configfor it as far as that goes, and have done the things mentioned here:
http://wiki.postgresql.org/wiki/SlowQueryQuestions
>
> Very much appreciate any suggestions, thank you in advance.

We run a nightly cron job that creates all the summary tables etc at
midnight.  On a fast machine it takes about 1 to 2 hours to run, but
makes the queries run during the day go from 10 or 20 seconds to a few
hundred milliseconds.

You might want to look into table partitioning and also materialized
views.  There's a great tutorial on how to roll your own at:

http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views

Re: Optimizing data layout for reporting in postgres

From
Alban Hertroys
Date:
On 24 Dec 2009, at 21:01, Doug El wrote:

> Hi,
>
> I have to summarize some data to be queried and there are about 5 million raw records a day I need to summarize. In a
nutshellI don't think I'm laying it out in an optimal fashion, or not taking advantage of some postgres features
perhaps,I'm looking for feedback. 
>
> The raw incoming data is in the form of
>
> ip string uint uint uint uint
>
> So for any given record say:
>
> 8.8.8.8 helloworld 1 2 3 4
>
> First, I need to be able to query how many total and how many unique requests there were (unique by ip), over given
timeframe. 
>
> So for the below data on the same day that's total two, but one unique
>
> 8.8.8.8 helloworld 1 2 3 4
> 8.8.8.8 helloworld 1 2 3 4
>
> Further for all fields (but ip which is not stored) I need to be able to query and get total/unique counts based off
anycombination of criteria. 
>
> So if I refer to them as columns A-E
>
> A        B        C        D        E
> string     uint     uint     uint     uint

That's going to be a sequential scan no matter how you write it - even if you follow Scott's advice, albeit the scan
happensat a more convenient time and only once. To optimise this you first minimise the number of scans you need to do
toget the desired results and next minimise the amount of work that the database needs to perform per row. 

I think a good start would be:
SELECT COUNT(*) AS total, COUNT(DISTINCT A||B::text||C::text||D::text||E::text)
  FROM table;

Be wary of NULL values as those will make the result of the concatenation NULL as well. Coalesce() is your friend if
thathappens. 

To further speed up the query you could pre-generate the concatenation of those columns, either when you insert the
dataor with a nightly cron-job. In the latter case make sure it runs after all the data has arrived and before anyone
queriesthat column or there will be NULLs in it. 

> I need to be able and say how many where col A = 'helloworld' and say col C = 4.
> Or  perhaps col E = 4 and col c < 3 etc, any combination.

I don't see what the problem is here? Aren't those just queries like:
SELECT COUNT(*) FROM table WHERE A = 'helloworld' AND C = 4;
SELECT COUNT(*) FROM table WHERE E = 4 AND C < 3;

If you know beforehand which conditions you want to query you can do them all in one go - it will result in a
sequentialscan though: 

SELECT SUM((A = 'helloworld' AND C = 4)::int), SUM((E = 4 AND C < 3)::int), ...
  FROM table;

This makes use of the fact that a boolean result cast to int results in 0 for False and 1 for True respectively.

If the conditions you want to summarise are fixed (never change) then you could store their respective values in a
separatecolumn using a bit-field or something of the kind, or in multiple columns with descriptive names. 

> The database however is still quite huge and grows very fast, even simple daily queries are fairly slow even on a
fastserver. I have a few indexes on what I know are common columns queried against but again, any combination of data
canbe queried, and  indexes do increase db size of course. 

Indexes don't help for queries where a sequential scan is used, so you don't need to worry about intricate index
designsfor such queries. They certainly do help (a lot!) for queries that query a minority of the total rows. 

> I feel like there's got to be some better way to organize this data and make it searchable.  Overall speed is more
importantthan disk space usage for this application.  

Anything that you can calculate at a convenient time before you need to query it is a win to store, as long as querying
theresults isn't more work than querying the original data. 

I've used cron jobs and triggers for such cases both successfully and unsuccessfully - the risk here is that the
deriveddata has to match the data it was derived from or you end up with wrong results. At one point (at a previous
job)we were creating a materialised view from data from about ten tables using triggers, but we noticed the results
didn'talways match what we expected. Figuring out where it went wrong took more time than we could afford so we ended
upusing a nightly cron-job instead. 
I suppose I mean to say to use triggers to pre-calculate data for simple cases but to prefer cron jobs for the
complicatedones. Debugging complicated triggers can be time-consuming. 

Regards,
Alban Hertroys

--
Screwing up is the best way to attach something to the ceiling.


!DSPAM:737,4b3752129951606741641!