Re: Need for speed 3 - Mailing list pgsql-performance
From | Nicholas E. Wakefield |
---|---|
Subject | Re: Need for speed 3 |
Date | |
Msg-id | 14C909669B6CE74381F7BC95A85A0C7F4AD1B1@knimail01.kineticnetworks.com Whole thread Raw |
In response to | Need for speed 3 (Ulrich Wisser <ulrich.wisser@relevanttraffic.se>) |
List | pgsql-performance |
Ulrich, Luke cc'd me on his reply and you definitely should have a look at Bizgres Clickstream. Even if the whole stack doesn't match you needs, though it sounds like it would. The clickstream focused TELL and BizGres enhancements could make your life a little easier. Basically the stack components that you might want to look at first are: BizGres flavor of PostGreSQL - Enhanced for business intelligence and data warehousing - The www.bizgres.com website can speak to this in more detail. Clickstream Data Model - Pageview fact table surrounded by various dimensions and 2 core staging tables for the cleansed weblog data. ETL Platform - Contains a weblog sessionizer, cleanser and ETL transformations, which can handle 2-3 million hits without any trouble. With native support for the COPY command, for even greater performance. JasperReports - For pixel perfect reporting. Sorry for sounding like I'm in marketing or sales, however I'm not. Couple of key features that might interest you, considering your email. The weblog parsing component allows for relatively complex cleansing, allowing for less data to be written to the DB and therefore increasing throughput. In addition, if you run every 5 minutes there would be no need to truncate the days data and reload, the ETL knows how to connect the data from before. The copy enhancement to postgresql found in bizgres, makes a noticeable improvement when loading data. The schema is basically Dimension tables Session, Known Party (If cookies are logged), Page, IP Address, Date, Time, Referrer, Referrer Page. Fact tables: Pageview, Hit Subset (Not everyone wants all hits). Staging Tables: Hits (Cleansed hits or just pageviews without surrogate keys), Session (Session data gathered while parsing the log). Regards Nick -----Original Message----- From: Luke Lonergan [mailto:llonergan@greenplum.com] Sent: Thursday, September 01, 2005 9:38 AM To: Ulrich Wisser; pgsql-performance@postgresql.org Cc: Nicholas E. Wakefield; Barry Klawans; Daria Hutchinson Subject: Re: [PERFORM] Need for speed 3 Ulrich, On 9/1/05 6:25 AM, "Ulrich Wisser" <ulrich.wisser@relevanttraffic.se> wrote: > My application basically imports Apache log files into a Postgres > database. Every row in the log file gets imported in one of three (raw > data) tables. My columns are exactly as in the log file. The import is > run approx. every five minutes. We import about two million rows a month. Bizgres Clickstream does this job using an ETL (extract transform and load) process to transform the weblogs into an optimized schema for reporting. > After every import the data from the current day is deleted from the > reporting table and recalculated from the raw data table. This is something the optimized ETL in Bizgres Clickstream also does well. > What do you think of this approach? Are there better ways to do it? Is > there some literature you recommend reading? I recommend the Bizgres Clickstream docs, you can get it from Bizgres CVS, and there will shortly be a live html link on the website. Bizgres is free - it also improves COPY performance by almost 2x, among other enhancements. - Luke
pgsql-performance by date: