Thread: Need for speed 3

Need for speed 3

From
Ulrich Wisser
Date:
Hi again,

first I want to say ***THANK YOU*** for everyone who kindly shared their
thoughts on my hardware problems. I really appreciate it. I started to
look for a new server and I am quite sure we'll get a serious hardware
"update". As suggested by some people I would like now to look closer at
possible algorithmic improvements.

My application basically imports Apache log files into a Postgres
database. Every row in the log file gets imported in one of three (raw
data) tables. My columns are exactly as in the log file. The import is
run approx. every five minutes. We import about two million rows a month.

Between 30 and 50 users are using the reporting at the same time.

Because reporting became so slow, I did create a reporting table. In
that table data is aggregated by dropping time (date is preserved), ip,
referer, user-agent. And although it breaks normalization some data from
a master table is copied, so no joins are needed anymore.

After every import the data from the current day is deleted from the
reporting table and recalculated from the raw data table.


Is this description understandable? If so

What do you think of this approach? Are there better ways to do it? Is
there some literature you recommend reading?

TIA

Ulrich


Re: Need for speed 3

From
"Merlin Moncure"
Date:
Ulrich wrote:
> Hi again,
>
> first I want to say ***THANK YOU*** for everyone who kindly shared
their
> thoughts on my hardware problems. I really appreciate it. I started to
> look for a new server and I am quite sure we'll get a serious hardware
> "update". As suggested by some people I would like now to look closer
at
> possible algorithmic improvements.
>
> My application basically imports Apache log files into a Postgres
> database. Every row in the log file gets imported in one of three (raw
> data) tables. My columns are exactly as in the log file. The import is
> run approx. every five minutes. We import about two million rows a
month.
>
> Between 30 and 50 users are using the reporting at the same time.
>
> Because reporting became so slow, I did create a reporting table. In
> that table data is aggregated by dropping time (date is preserved),
ip,
> referer, user-agent. And although it breaks normalization some data
from
> a master table is copied, so no joins are needed anymore.
>
> After every import the data from the current day is deleted from the
> reporting table and recalculated from the raw data table.
>

schemas would be helpful.  You may be able to tweak the import table a
bit and how it moves over to the data tables.

Just a thought: have you considered having apache logs write to a
process that immediately makes insert query(s) to postgresql?

You could write small C program which executes advanced query interface
call to the server.

Merlin

Re: Need for speed 3

From
Ulrich Wisser
Date:
Hi Merlin,

> schemas would be helpful.

right now I would like to know if my approach to the problem makes
sense. Or if I should rework the whole procedure of import and aggregate.

> Just a thought: have you considered having apache logs write to a
> process that immediately makes insert query(s) to postgresql?

Yes we have considered that, but dismissed the idea very soon. We need
Apache to be as responsive as possible. It's a two server setup with
load balancer and failover. Serving about ones thousand domains and
counting. It needs to be as failsafe as possible and under no
circumstances can any request be lost. (The click counting is core
business and relates directly to our income.)
That said it seemed quite save to let Apache write logfiles. And import
them later. By that a database downtime wouldn't be mission critical.


> You could write small C program which executes advanced query interface
> call to the server.

How would that improve performance?

Ulrich

Re: Need for speed 3

From
"Merlin Moncure"
Date:
> Hi Merlin,
> > Just a thought: have you considered having apache logs write to a
> > process that immediately makes insert query(s) to postgresql?
>
> Yes we have considered that, but dismissed the idea very soon. We need
> Apache to be as responsive as possible. It's a two server setup with
> load balancer and failover. Serving about ones thousand domains and
> counting. It needs to be as failsafe as possible and under no
> circumstances can any request be lost. (The click counting is core
> business and relates directly to our income.)
> That said it seemed quite save to let Apache write logfiles. And
import
> them later. By that a database downtime wouldn't be mission critical.

hm.  well, it may be possible to do this in a fast and safe way but I
understand your reservations here, but I'm going to spout off my opinion
anyways :).

If you are not doing this the following point is moot.  But take into
consideration you could set a very low transaction time out (like .25
seconds) and siphon log entries off to a text file if your database
server gets in trouble.  2 million hits a month is not very high even if
your traffic is bursty (there are approx 2.5 million seconds in a
month).

With a direct linked log file you get up to date stats always and spare
yourself the dump/load song and dance which is always a headache :(.
Also, however you are doing your billing, it will be easier to manage it
if everything is extracted from pg and not some conglomeration of log
files, *if* you can put 100% faith in your database.  When it comes to
pg now, I'm a believer.

> > You could write small C program which executes advanced query
interface
> > call to the server.
>
> How would that improve performance?

The functions I'm talking about are PQexecParams and PQexecPrepared.
The query string does not need to be encoded or decoded and is very
light on server resources and is very low latency.  Using them you could
get prob. 5000 inserts/sec on a cheap server if you have some type of
write caching in place with low cpu load.

Merlin



Re: Need for speed 3

From
"Luke Lonergan"
Date:
Ulrich,

On 9/1/05 6:25 AM, "Ulrich Wisser" <ulrich.wisser@relevanttraffic.se> wrote:

> My application basically imports Apache log files into a Postgres
> database. Every row in the log file gets imported in one of three (raw
> data) tables. My columns are exactly as in the log file. The import is
> run approx. every five minutes. We import about two million rows a month.

Bizgres Clickstream does this job using an ETL (extract transform and load)
process to transform the weblogs into an optimized schema for reporting.

> After every import the data from the current day is deleted from the
> reporting table and recalculated from the raw data table.

This is something the optimized ETL in Bizgres Clickstream also does well.

> What do you think of this approach? Are there better ways to do it? Is
> there some literature you recommend reading?

I recommend the Bizgres Clickstream docs, you can get it from Bizgres CVS,
and there will shortly be a live html link on the website.

Bizgres is free - it also improves COPY performance by almost 2x, among
other enhancements.

- Luke



Re: Need for speed 3

From
"Nicholas E. Wakefield"
Date:
Ulrich,

Luke cc'd me on his reply and you definitely should have a look at
Bizgres Clickstream. Even if the whole stack doesn't match you needs,
though it sounds like it would. The clickstream focused TELL and BizGres
enhancements could make your life a little easier.

Basically the stack components that you might want to look at first are:

BizGres flavor of PostGreSQL - Enhanced for business intelligence and
data warehousing - The www.bizgres.com website can speak to this in more
detail.
Clickstream Data Model - Pageview fact table surrounded by various
dimensions and 2 core staging tables for the cleansed weblog data.
ETL Platform - Contains a weblog sessionizer, cleanser and ETL
transformations, which can handle 2-3 million hits without any trouble.
With native support for the COPY command, for even greater performance.
JasperReports - For pixel perfect reporting.

Sorry for sounding like I'm in marketing or sales, however I'm not.

Couple of key features that might interest you, considering your email.
The weblog parsing component allows for relatively complex cleansing,
allowing for less data to be written to the DB and therefore increasing
throughput. In addition, if you run every 5 minutes there would be no
need to truncate the days data and reload, the ETL knows how to connect
the data from before. The copy enhancement to postgresql found in
bizgres, makes a noticeable improvement when loading data.
The schema is basically

Dimension tables Session, Known Party (If cookies are logged), Page, IP
Address, Date, Time, Referrer, Referrer Page.
Fact tables: Pageview, Hit Subset (Not everyone wants all hits).

Staging Tables: Hits (Cleansed hits or just pageviews without surrogate
keys), Session (Session data gathered while parsing the log).

Regards

Nick


-----Original Message-----
From: Luke Lonergan [mailto:llonergan@greenplum.com]
Sent: Thursday, September 01, 2005 9:38 AM
To: Ulrich Wisser; pgsql-performance@postgresql.org
Cc: Nicholas E. Wakefield; Barry Klawans; Daria Hutchinson
Subject: Re: [PERFORM] Need for speed 3

Ulrich,

On 9/1/05 6:25 AM, "Ulrich Wisser" <ulrich.wisser@relevanttraffic.se>
wrote:

> My application basically imports Apache log files into a Postgres
> database. Every row in the log file gets imported in one of three (raw
> data) tables. My columns are exactly as in the log file. The import is

> run approx. every five minutes. We import about two million rows a
month.

Bizgres Clickstream does this job using an ETL (extract transform and
load) process to transform the weblogs into an optimized schema for
reporting.

> After every import the data from the current day is deleted from the
> reporting table and recalculated from the raw data table.

This is something the optimized ETL in Bizgres Clickstream also does
well.

> What do you think of this approach? Are there better ways to do it? Is

> there some literature you recommend reading?

I recommend the Bizgres Clickstream docs, you can get it from Bizgres
CVS, and there will shortly be a live html link on the website.

Bizgres is free - it also improves COPY performance by almost 2x, among
other enhancements.

- Luke