Re: Optimizing data layout for reporting in postgres - Mailing list pgsql-general

From Alban Hertroys
Subject Re: Optimizing data layout for reporting in postgres
Date
Msg-id 8F451F6D-1FAE-40E1-9068-E49C7FA93C77@solfertje.student.utwente.nl
Whole thread Raw
In response to Optimizing data layout for reporting in postgres  (Doug El <doug_el@yahoo.com>)
List pgsql-general
On 24 Dec 2009, at 21:01, Doug El wrote:

> Hi,
>
> I have to summarize some data to be queried and there are about 5 million raw records a day I need to summarize. In a
nutshellI don't think I'm laying it out in an optimal fashion, or not taking advantage of some postgres features
perhaps,I'm looking for feedback. 
>
> The raw incoming data is in the form of
>
> ip string uint uint uint uint
>
> So for any given record say:
>
> 8.8.8.8 helloworld 1 2 3 4
>
> First, I need to be able to query how many total and how many unique requests there were (unique by ip), over given
timeframe. 
>
> So for the below data on the same day that's total two, but one unique
>
> 8.8.8.8 helloworld 1 2 3 4
> 8.8.8.8 helloworld 1 2 3 4
>
> Further for all fields (but ip which is not stored) I need to be able to query and get total/unique counts based off
anycombination of criteria. 
>
> So if I refer to them as columns A-E
>
> A        B        C        D        E
> string     uint     uint     uint     uint

That's going to be a sequential scan no matter how you write it - even if you follow Scott's advice, albeit the scan
happensat a more convenient time and only once. To optimise this you first minimise the number of scans you need to do
toget the desired results and next minimise the amount of work that the database needs to perform per row. 

I think a good start would be:
SELECT COUNT(*) AS total, COUNT(DISTINCT A||B::text||C::text||D::text||E::text)
  FROM table;

Be wary of NULL values as those will make the result of the concatenation NULL as well. Coalesce() is your friend if
thathappens. 

To further speed up the query you could pre-generate the concatenation of those columns, either when you insert the
dataor with a nightly cron-job. In the latter case make sure it runs after all the data has arrived and before anyone
queriesthat column or there will be NULLs in it. 

> I need to be able and say how many where col A = 'helloworld' and say col C = 4.
> Or  perhaps col E = 4 and col c < 3 etc, any combination.

I don't see what the problem is here? Aren't those just queries like:
SELECT COUNT(*) FROM table WHERE A = 'helloworld' AND C = 4;
SELECT COUNT(*) FROM table WHERE E = 4 AND C < 3;

If you know beforehand which conditions you want to query you can do them all in one go - it will result in a
sequentialscan though: 

SELECT SUM((A = 'helloworld' AND C = 4)::int), SUM((E = 4 AND C < 3)::int), ...
  FROM table;

This makes use of the fact that a boolean result cast to int results in 0 for False and 1 for True respectively.

If the conditions you want to summarise are fixed (never change) then you could store their respective values in a
separatecolumn using a bit-field or something of the kind, or in multiple columns with descriptive names. 

> The database however is still quite huge and grows very fast, even simple daily queries are fairly slow even on a
fastserver. I have a few indexes on what I know are common columns queried against but again, any combination of data
canbe queried, and  indexes do increase db size of course. 

Indexes don't help for queries where a sequential scan is used, so you don't need to worry about intricate index
designsfor such queries. They certainly do help (a lot!) for queries that query a minority of the total rows. 

> I feel like there's got to be some better way to organize this data and make it searchable.  Overall speed is more
importantthan disk space usage for this application.  

Anything that you can calculate at a convenient time before you need to query it is a win to store, as long as querying
theresults isn't more work than querying the original data. 

I've used cron jobs and triggers for such cases both successfully and unsuccessfully - the risk here is that the
deriveddata has to match the data it was derived from or you end up with wrong results. At one point (at a previous
job)we were creating a materialised view from data from about ten tables using triggers, but we noticed the results
didn'talways match what we expected. Figuring out where it went wrong took more time than we could afford so we ended
upusing a nightly cron-job instead. 
I suppose I mean to say to use triggers to pre-calculate data for simple cases but to prefer cron jobs for the
complicatedones. Debugging complicated triggers can be time-consuming. 

Regards,
Alban Hertroys

--
Screwing up is the best way to attach something to the ceiling.


!DSPAM:737,4b3752129951606741641!



pgsql-general by date:

Previous
From: Alban Hertroys
Date:
Subject: Re: How to get a list of tables that have a particular column value?
Next
From: Alex -
Date:
Subject: Get Comments on Tables / Functions