Re: Need for speed 3 - Mailing list pgsql-performance

From Nicholas E. Wakefield
Subject Re: Need for speed 3
Date
Msg-id 14C909669B6CE74381F7BC95A85A0C7F4AD1B1@knimail01.kineticnetworks.com
Whole thread Raw
In response to Need for speed 3  (Ulrich Wisser <ulrich.wisser@relevanttraffic.se>)
List pgsql-performance
Ulrich,

Luke cc'd me on his reply and you definitely should have a look at
Bizgres Clickstream. Even if the whole stack doesn't match you needs,
though it sounds like it would. The clickstream focused TELL and BizGres
enhancements could make your life a little easier.

Basically the stack components that you might want to look at first are:

BizGres flavor of PostGreSQL - Enhanced for business intelligence and
data warehousing - The www.bizgres.com website can speak to this in more
detail.
Clickstream Data Model - Pageview fact table surrounded by various
dimensions and 2 core staging tables for the cleansed weblog data.
ETL Platform - Contains a weblog sessionizer, cleanser and ETL
transformations, which can handle 2-3 million hits without any trouble.
With native support for the COPY command, for even greater performance.
JasperReports - For pixel perfect reporting.

Sorry for sounding like I'm in marketing or sales, however I'm not.

Couple of key features that might interest you, considering your email.
The weblog parsing component allows for relatively complex cleansing,
allowing for less data to be written to the DB and therefore increasing
throughput. In addition, if you run every 5 minutes there would be no
need to truncate the days data and reload, the ETL knows how to connect
the data from before. The copy enhancement to postgresql found in
bizgres, makes a noticeable improvement when loading data.
The schema is basically

Dimension tables Session, Known Party (If cookies are logged), Page, IP
Address, Date, Time, Referrer, Referrer Page.
Fact tables: Pageview, Hit Subset (Not everyone wants all hits).

Staging Tables: Hits (Cleansed hits or just pageviews without surrogate
keys), Session (Session data gathered while parsing the log).

Regards

Nick


-----Original Message-----
From: Luke Lonergan [mailto:llonergan@greenplum.com]
Sent: Thursday, September 01, 2005 9:38 AM
To: Ulrich Wisser; pgsql-performance@postgresql.org
Cc: Nicholas E. Wakefield; Barry Klawans; Daria Hutchinson
Subject: Re: [PERFORM] Need for speed 3

Ulrich,

On 9/1/05 6:25 AM, "Ulrich Wisser" <ulrich.wisser@relevanttraffic.se>
wrote:

> My application basically imports Apache log files into a Postgres
> database. Every row in the log file gets imported in one of three (raw
> data) tables. My columns are exactly as in the log file. The import is

> run approx. every five minutes. We import about two million rows a
month.

Bizgres Clickstream does this job using an ETL (extract transform and
load) process to transform the weblogs into an optimized schema for
reporting.

> After every import the data from the current day is deleted from the
> reporting table and recalculated from the raw data table.

This is something the optimized ETL in Bizgres Clickstream also does
well.

> What do you think of this approach? Are there better ways to do it? Is

> there some literature you recommend reading?

I recommend the Bizgres Clickstream docs, you can get it from Bizgres
CVS, and there will shortly be a live html link on the website.

Bizgres is free - it also improves COPY performance by almost 2x, among
other enhancements.

- Luke




pgsql-performance by date:

Previous
From: gdh@eyjar.is
Date:
Subject: Re: Query slow after VACUUM ANALYZE
Next
From: "Rigmor Ukuhe"
Date:
Subject: Re: When to do a vacuum for highly active table