Re: optimized counting of web statistics - Mailing list pgsql-performance

From Matthew Nuzum
Subject Re: optimized counting of web statistics
Date
Msg-id f3c0b40805062819546605c0cc@mail.gmail.com
Whole thread Raw
In response to Re: optimized counting of web statistics  (Rudi Starcevic <tech@wildcash.com>)
List pgsql-performance
On 6/29/05, Rudi Starcevic <tech@wildcash.com> wrote:
> Hi,
>
> >I do my batch processing daily using a python script I've written. I
> >found that trying to do it with pl/pgsql took more than 24 hours to
> >process 24 hours worth of logs. I then used C# and in memory hash
> >tables to drop the time to 2 hours, but I couldn't get mono installed
> >on some of my older servers. Python proved the fastest and I can
> >process 24 hours worth of logs in about 15 minutes. Common reports run
> >in < 1 sec and custom reports run in < 15 seconds (usually).
> >
> >
>
> When you say you do your batch processing in a Python script do you mean
> a you are using 'plpython' inside
> PostgreSQL or using Python to execut select statements and crunch the
> data 'outside' PostgreSQL?
>
> Your reply is very interesting.

Sorry for not making that clear... I don't use plpython, I'm using an
external python program that makes database connections, creates
dictionaries and does the normalization/batch processing in memory. It
then saves the changes to a textfile which is copied using psql.

I've tried many things and while this is RAM intensive, it is by far
the fastest aproach I've found. I've also modified the python program
to optionally use disk based dictionaries based on (I think) gdb. This
signfincantly increases the time to closer to 25 min. ;-) but drops
the memory usage by an order of magnitude.

To be fair to C# and .Net, I think that python and C# can do it
equally fast, but between the time of creating the C# version and the
python version I learned some new optimization techniques. I feel that
both are powerful languages. (To be fair to python, I can write the
dictionary lookup code in 25% (aprox) fewer lines than similar hash
table code in C#. I could go on but I think I'm starting to get off
topic.)
--
Matthew Nuzum
www.bearfruit.org

pgsql-performance by date:

Previous
From: Rudi Starcevic
Date:
Subject: Re: optimized counting of web statistics
Next
From: Dawid Kuroczko
Date:
Subject: Re: tricky query