Thread: Need for speed 3
Hi again, first I want to say ***THANK YOU*** for everyone who kindly shared their thoughts on my hardware problems. I really appreciate it. I started to look for a new server and I am quite sure we'll get a serious hardware "update". As suggested by some people I would like now to look closer at possible algorithmic improvements. My application basically imports Apache log files into a Postgres database. Every row in the log file gets imported in one of three (raw data) tables. My columns are exactly as in the log file. The import is run approx. every five minutes. We import about two million rows a month. Between 30 and 50 users are using the reporting at the same time. Because reporting became so slow, I did create a reporting table. In that table data is aggregated by dropping time (date is preserved), ip, referer, user-agent. And although it breaks normalization some data from a master table is copied, so no joins are needed anymore. After every import the data from the current day is deleted from the reporting table and recalculated from the raw data table. Is this description understandable? If so What do you think of this approach? Are there better ways to do it? Is there some literature you recommend reading? TIA Ulrich
Ulrich wrote: > Hi again, > > first I want to say ***THANK YOU*** for everyone who kindly shared their > thoughts on my hardware problems. I really appreciate it. I started to > look for a new server and I am quite sure we'll get a serious hardware > "update". As suggested by some people I would like now to look closer at > possible algorithmic improvements. > > My application basically imports Apache log files into a Postgres > database. Every row in the log file gets imported in one of three (raw > data) tables. My columns are exactly as in the log file. The import is > run approx. every five minutes. We import about two million rows a month. > > Between 30 and 50 users are using the reporting at the same time. > > Because reporting became so slow, I did create a reporting table. In > that table data is aggregated by dropping time (date is preserved), ip, > referer, user-agent. And although it breaks normalization some data from > a master table is copied, so no joins are needed anymore. > > After every import the data from the current day is deleted from the > reporting table and recalculated from the raw data table. > schemas would be helpful. You may be able to tweak the import table a bit and how it moves over to the data tables. Just a thought: have you considered having apache logs write to a process that immediately makes insert query(s) to postgresql? You could write small C program which executes advanced query interface call to the server. Merlin
Hi Merlin, > schemas would be helpful. right now I would like to know if my approach to the problem makes sense. Or if I should rework the whole procedure of import and aggregate. > Just a thought: have you considered having apache logs write to a > process that immediately makes insert query(s) to postgresql? Yes we have considered that, but dismissed the idea very soon. We need Apache to be as responsive as possible. It's a two server setup with load balancer and failover. Serving about ones thousand domains and counting. It needs to be as failsafe as possible and under no circumstances can any request be lost. (The click counting is core business and relates directly to our income.) That said it seemed quite save to let Apache write logfiles. And import them later. By that a database downtime wouldn't be mission critical. > You could write small C program which executes advanced query interface > call to the server. How would that improve performance? Ulrich
> Hi Merlin, > > Just a thought: have you considered having apache logs write to a > > process that immediately makes insert query(s) to postgresql? > > Yes we have considered that, but dismissed the idea very soon. We need > Apache to be as responsive as possible. It's a two server setup with > load balancer and failover. Serving about ones thousand domains and > counting. It needs to be as failsafe as possible and under no > circumstances can any request be lost. (The click counting is core > business and relates directly to our income.) > That said it seemed quite save to let Apache write logfiles. And import > them later. By that a database downtime wouldn't be mission critical. hm. well, it may be possible to do this in a fast and safe way but I understand your reservations here, but I'm going to spout off my opinion anyways :). If you are not doing this the following point is moot. But take into consideration you could set a very low transaction time out (like .25 seconds) and siphon log entries off to a text file if your database server gets in trouble. 2 million hits a month is not very high even if your traffic is bursty (there are approx 2.5 million seconds in a month). With a direct linked log file you get up to date stats always and spare yourself the dump/load song and dance which is always a headache :(. Also, however you are doing your billing, it will be easier to manage it if everything is extracted from pg and not some conglomeration of log files, *if* you can put 100% faith in your database. When it comes to pg now, I'm a believer. > > You could write small C program which executes advanced query interface > > call to the server. > > How would that improve performance? The functions I'm talking about are PQexecParams and PQexecPrepared. The query string does not need to be encoded or decoded and is very light on server resources and is very low latency. Using them you could get prob. 5000 inserts/sec on a cheap server if you have some type of write caching in place with low cpu load. Merlin
Ulrich, On 9/1/05 6:25 AM, "Ulrich Wisser" <ulrich.wisser@relevanttraffic.se> wrote: > My application basically imports Apache log files into a Postgres > database. Every row in the log file gets imported in one of three (raw > data) tables. My columns are exactly as in the log file. The import is > run approx. every five minutes. We import about two million rows a month. Bizgres Clickstream does this job using an ETL (extract transform and load) process to transform the weblogs into an optimized schema for reporting. > After every import the data from the current day is deleted from the > reporting table and recalculated from the raw data table. This is something the optimized ETL in Bizgres Clickstream also does well. > What do you think of this approach? Are there better ways to do it? Is > there some literature you recommend reading? I recommend the Bizgres Clickstream docs, you can get it from Bizgres CVS, and there will shortly be a live html link on the website. Bizgres is free - it also improves COPY performance by almost 2x, among other enhancements. - Luke
Ulrich, Luke cc'd me on his reply and you definitely should have a look at Bizgres Clickstream. Even if the whole stack doesn't match you needs, though it sounds like it would. The clickstream focused TELL and BizGres enhancements could make your life a little easier. Basically the stack components that you might want to look at first are: BizGres flavor of PostGreSQL - Enhanced for business intelligence and data warehousing - The www.bizgres.com website can speak to this in more detail. Clickstream Data Model - Pageview fact table surrounded by various dimensions and 2 core staging tables for the cleansed weblog data. ETL Platform - Contains a weblog sessionizer, cleanser and ETL transformations, which can handle 2-3 million hits without any trouble. With native support for the COPY command, for even greater performance. JasperReports - For pixel perfect reporting. Sorry for sounding like I'm in marketing or sales, however I'm not. Couple of key features that might interest you, considering your email. The weblog parsing component allows for relatively complex cleansing, allowing for less data to be written to the DB and therefore increasing throughput. In addition, if you run every 5 minutes there would be no need to truncate the days data and reload, the ETL knows how to connect the data from before. The copy enhancement to postgresql found in bizgres, makes a noticeable improvement when loading data. The schema is basically Dimension tables Session, Known Party (If cookies are logged), Page, IP Address, Date, Time, Referrer, Referrer Page. Fact tables: Pageview, Hit Subset (Not everyone wants all hits). Staging Tables: Hits (Cleansed hits or just pageviews without surrogate keys), Session (Session data gathered while parsing the log). Regards Nick -----Original Message----- From: Luke Lonergan [mailto:llonergan@greenplum.com] Sent: Thursday, September 01, 2005 9:38 AM To: Ulrich Wisser; pgsql-performance@postgresql.org Cc: Nicholas E. Wakefield; Barry Klawans; Daria Hutchinson Subject: Re: [PERFORM] Need for speed 3 Ulrich, On 9/1/05 6:25 AM, "Ulrich Wisser" <ulrich.wisser@relevanttraffic.se> wrote: > My application basically imports Apache log files into a Postgres > database. Every row in the log file gets imported in one of three (raw > data) tables. My columns are exactly as in the log file. The import is > run approx. every five minutes. We import about two million rows a month. Bizgres Clickstream does this job using an ETL (extract transform and load) process to transform the weblogs into an optimized schema for reporting. > After every import the data from the current day is deleted from the > reporting table and recalculated from the raw data table. This is something the optimized ETL in Bizgres Clickstream also does well. > What do you think of this approach? Are there better ways to do it? Is > there some literature you recommend reading? I recommend the Bizgres Clickstream docs, you can get it from Bizgres CVS, and there will shortly be a live html link on the website. Bizgres is free - it also improves COPY performance by almost 2x, among other enhancements. - Luke