Re: optimized counting of web statistics - Mailing list pgsql-performance

From Matthew Nuzum
Subject Re: optimized counting of web statistics
Date
Msg-id f3c0b408050628145545699c4e@mail.gmail.com
Whole thread Raw
In response to optimized counting of web statistics  (Billy extyeightysix <exty86@gmail.com>)
Responses Re: optimized counting of web statistics
List pgsql-performance
On 6/28/05, Billy extyeightysix <exty86@gmail.com> wrote:
> Hola folks,
>
> I have a web statistics Pg database (user agent, urls, referrer, etc)
> that is part of an online web survey system. All of the data derived
> from analyzing web server logs is stored in one large table with each
> record representing an analyzed webserver log entry.
>
> Currently all reports are generated when the logs are being analyzed
> and before the data ever goes into the large table I mention above.
> Well, the time has come to build an interface that will allow a user
> to make ad-hoc queries against the stats and that is why I am emailing
> the performance list.

Load your data into a big table, then pre-process into additional
tables that have data better organized for running your reports.

For example, you may want a table that shows a sum of all hits for
each site, for each hour of the day. You may want an additional table
that shows the sum of all page views, or maybe sessions for each site
for each hour of the day.

So, if you manage a single site, each day you will add 24 new records
to the sum table.

You may want the following fields:
site (string)
atime (timestamptz)
hour_of_day (int)
day_of_week (int)
total_hits (int8)

A record may look like this:
site | atime | hour_of_day | day_of_week | total_hits
'www.yoursite.com'  '2005-06-28 16:00:00 -0400'  18  2  350

Index all of the fields except total_hits (unless you want a report
that shows all hours where hits were greater than x or less than x).

Doing:
select sum(total_hits) as total_hits from summary_table where atime
between now() and (now() - '7 days'::interval);
should be pretty fast.

You can also normalize your data such as referrers, user agents, etc
and create similar tables to the above.

In case you haven't guessed, I've already done this very thing.

I do my batch processing daily using a python script I've written. I
found that trying to do it with pl/pgsql took more than 24 hours to
process 24 hours worth of logs. I then used C# and in memory hash
tables to drop the time to 2 hours, but I couldn't get mono installed
on some of my older servers. Python proved the fastest and I can
process 24 hours worth of logs in about 15 minutes. Common reports run
in < 1 sec and custom reports run in < 15 seconds (usually).
--
Matthew Nuzum
www.bearfruit.org

pgsql-performance by date:

Previous
From: Billy extyeightysix
Date:
Subject: Re: optimized counting of web statistics
Next
From: Jean-Max Reymond
Date:
Subject: Re: perl garbage collector