Thread: Fw: [GENERAL] Including unique users in huge data warehouse in Postgresql...

Fw: [GENERAL] Including unique users in huge data warehouse in Postgresql...

From
Mark Jensen
Date:
posting this here instead of the GENERAL list...richard is right, this is more of a performance question than a general
question.

thanks,

____________________________________
Mark Jensen

----- Forwarded Message ----
From: Mark Jensen <musicnyman1974@yahoo.com>
To: Richard Huxton <dev@archonet.com>
Cc: pgsql-general@postgresql.org
Sent: Wednesday, November 29, 2006 2:40:58 PM
Subject: Re: [GENERAL] Including unique users in huge data warehouse in Postgresql...

thanks Richard.  I've talking to Ron Mayer about this as well offline.  I think the main problem is dedupping users,
andnot being able to aggregate visits in the fact table.  that's where most of the query time takes place.  but the
businessguys just won't accept using visits, but not actual uniques dedupped.  if visits was in the fact table as an
integeri could sum up, i'd be fine.  Ron also said he's put the unique user ids into arrays so it's faster to count
them,but placing them into aggregate tables.  only problem is i'd still have to know what things to aggregate by to
createthese, which is impossible since we have so many dimensions and facts that are going to be ad-hoc.  i have about
20summary tables i create per day, but most of the time, they have something new they want to query that's not in
summary. and will never come up again. 

I tried installing Bizgres using their Bizgres loader and custom postgresql package with bitmap indexes, but doesn't
seemto increase performance "that" much.  or as much as i would like compared to the normal postgresql install.  loads
arepretty slow when using their bitmap indexes compared to just using btree indexes in the standard postgresql install.
Query time is pretty good, but i also have to make sure load times are acceptable as well.  and had some problems with
thebizgres loader losing connection to the database for no reason at all, but when using the normal copy command in
8.2RC1,works fine.  love the new query inclusion in the copy command by the way, makes it so easy to aggregrate hourly
facttables into daily/weekly/monthly in one shot :) 

and yes, work_mem is optimized as much as possible.  postgresql is using about 1.5 gigs of working memory when it runs
thesequeries.  looking into getting 64 bit hardware with 16-32 gigs of RAM so i can throw most of this into memory to
speedit up.  we're also using 3par storage which is pretty fast.  we're going to try and put postgresql on a local disk
arrayusing RAID 5 as well to see if it makes a difference. 

and yes, right now, these are daily aggregate tables summed up from the hourly.  so about 17 million rows per day.
hourlyfact tables are impossible to query right now, so i have to at least put these into daily fact tables.  so when
youhave 30 days in this database, then yes, table scans are going to be huge, thus why it's taking so long, plus
deduppingon unique user id :) 

and you're right, i should put this on the performance mailing list... see you there :)

thanks guys.

____________________________________
Mark Jensen

----- Original Message ----
From: Richard Huxton <dev@archonet.com>
To: Mark Jensen <musicnyman1974@yahoo.com>
Cc: pgsql-general@postgresql.org
Sent: Wednesday, November 29, 2006 2:29:35 PM
Subject: Re: [GENERAL] Including unique users in huge data warehouse in Postgresql...

Mark Jensen wrote:
> So i've been given the task of designing a data warehouse in
> either Postgresql or Mysql for our clickstream data for our sites.  I
> started with Mysql but the joins in Mysql are just way too slow
> compared to Postgresql when playing with star schemas.

Mark - it's not my usual area, but no-one else has picked up your
posting, so I'll poke my nose in. The other thing you might want to do
is post this on the performance list - that's probably the best place.
Might be worth talking to those at www.bizgres.org too (although I think
they all hang out on the performance list).

 > I can't say
> which sites i'm working on, but we get close to 3-5 million uniques
> users per day, so over time, that's a lot of unique users to keep
> around and de-dup your fact tables by.  Need to be able to query normal
> analytics like:
<snip>

> i've
> made a lot of optimizations in postgresql.conf by playing with work_mem
> and shared_buffers and such and i think the database is using as much
> as it can disk/memory/cpu wise.

Big work_mem, I'm guessing. Limiting factor is presumably disk I/O.

<snip>
> here's a sample query that takes a while to run... just a simple report that shows gender by area of the site.
>
> select A.gender as gender, B.area as area, sum(C.imps) as imps, sum(C.clicks) as clicks, count(distinct(C.uu_id)) as
users
> from uus as A, areas as B, daily_area_fact as C
> where A.uu_id = C.uu_id
> and B.area_id = C.area_id
> group by gender,area;
>
> so
> by just having one day of data, with 3,168,049 rows in the user
> dimension table (uus), 17,213,420 in the daily_area_fact table that
> joins all the dimension tables, takes about 15 minutes.  if i had 30-90
> days in this fact table, who knows how long this would take... i know
> doing a distinct on uu_id is very expensive, so that's the main problem
> here i guess and would want to know if anyone else is doing it this way
> or better.

In the end, I'd suspect the seq-scan over the fact table will be your
biggest problem. Can you pre-aggregate your fact-table into daily summaries?

See you over on the performance list, where there are more experienced
people than myself to help you.
--
   Richard Huxton
   Archonet Ltd






____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings






____________________________________________________________________________________
Yahoo! Music Unlimited
Access over 1 million songs.
http://music.yahoo.com/unlimited

Re: Fw: [GENERAL] Including unique users in huge data

From
"Luke Lonergan"
Date:
Mark,

This fits the typical pattern of the "Big Honking Datamart" for clickstream
analysis, a usage pattern that stresses the capability of all DBMS.  Large
companies spend $1M + on combinations of SW and HW to solve this problem,
and only the large scale parallel DBMS can handle the load.  Players in the
market include Oracle, IBM, Teradata, Netezza and of course Greenplum.

Unfortunately, techniques like bitmap indexes only improve things by factors
O(10).  Parallelism is the only proven answer to get O(10,000) improvements
in response time. Furthermore, simply speeding the I/O underneath one CPU
per query is insufficient, the query and loading engine need to scale CPU
and storage access together.

=========== start description of commercial Postgres solution ===========
=========== If commercial solutions offend, skip this section ===========

The parallel I/O and CPU of Greenplum DB (formerly Bizgres MPP) is designed
for exactly this workload, where a combination of scalable I/O and CPU is
required to speed these kinds of queries (sessionizing weblogs, creating
aggregates, direct ad-hoc analysis).

One of our customers doing clickstream analysis uses a combination of
sessionizing ELT processing with Greenplum DB + Bizgres KETL and
Microstrategy for the reporting frontend.  The complete system is 1/100 the
price of systems that are slower.

We routinely see speedups of over 100 compared to large scale multi-million
dollar commercial solutions and have reference customers who are regularly
working with Terabytes of data.

============ end commercial solution description ========================

- Luke

On 11/29/06 11:43 AM, "Mark Jensen" <musicnyman1974@yahoo.com> wrote:

> posting this here instead of the GENERAL list...richard is right, this is more
> of a performance question than a general question.
>
> thanks,
>
> ____________________________________
> Mark Jensen
>
> ----- Forwarded Message ----
> From: Mark Jensen <musicnyman1974@yahoo.com>
> To: Richard Huxton <dev@archonet.com>
> Cc: pgsql-general@postgresql.org
> Sent: Wednesday, November 29, 2006 2:40:58 PM
> Subject: Re: [GENERAL] Including unique users in huge data warehouse in
> Postgresql...
>
> thanks Richard.  I've talking to Ron Mayer about this as well offline.  I
> think the main problem is dedupping users, and not being able to aggregate
> visits in the fact table.  that's where most of the query time takes place.
> but the business guys just won't accept using visits, but not actual uniques
> dedupped.  if visits was in the fact table as an integer i could sum up, i'd
> be fine.  Ron also said he's put the unique user ids into arrays so it's
> faster to count them, but placing them into aggregate tables.  only problem is
> i'd still have to know what things to aggregate by to create these, which is
> impossible since we have so many dimensions and facts that are going to be
> ad-hoc.  i have about 20 summary tables i create per day, but most of the
> time, they have something new they want to query that's not in summary.  and
> will never come up again.
>
> I tried installing Bizgres using their Bizgres loader and custom postgresql
> package with bitmap indexes, but doesn't seem to increase performance "that"
> much.  or as much as i would like compared to the normal postgresql install.
> loads are pretty slow when using their bitmap indexes compared to just using
> btree indexes in the standard postgresql install.  Query time is pretty good,
> but i also have to make sure load times are acceptable as well.  and had some
> problems with the bizgres loader losing connection to the database for no
> reason at all, but when using the normal copy command in 8.2RC1, works fine.
> love the new query inclusion in the copy command by the way, makes it so easy
> to aggregrate hourly fact tables into daily/weekly/monthly in one shot :)
>
> and yes, work_mem is optimized as much as possible.  postgresql is using about
> 1.5 gigs of working memory when it runs these queries.  looking into getting
> 64 bit hardware with 16-32 gigs of RAM so i can throw most of this into memory
> to speed it up.  we're also using 3par storage which is pretty fast.  we're
> going to try and put postgresql on a local disk array using RAID 5 as well to
> see if it makes a difference.
>
> and yes, right now, these are daily aggregate tables summed up from the
> hourly.  so about 17 million rows per day.  hourly fact tables are impossible
> to query right now, so i have to at least put these into daily fact tables.
> so when you have 30 days in this database, then yes, table scans are going to
> be huge, thus why it's taking so long, plus dedupping on unique user id :)
>
> and you're right, i should put this on the performance mailing list... see you
> there :)
>
> thanks guys.
>
> ____________________________________
> Mark Jensen
>
> ----- Original Message ----
> From: Richard Huxton <dev@archonet.com>
> To: Mark Jensen <musicnyman1974@yahoo.com>
> Cc: pgsql-general@postgresql.org
> Sent: Wednesday, November 29, 2006 2:29:35 PM
> Subject: Re: [GENERAL] Including unique users in huge data warehouse in
> Postgresql...
>
> Mark Jensen wrote:
>> So i've been given the task of designing a data warehouse in
>> either Postgresql or Mysql for our clickstream data for our sites.  I
>> started with Mysql but the joins in Mysql are just way too slow
>> compared to Postgresql when playing with star schemas.
>
> Mark - it's not my usual area, but no-one else has picked up your
> posting, so I'll poke my nose in. The other thing you might want to do
> is post this on the performance list - that's probably the best place.
> Might be worth talking to those at www.bizgres.org too (although I think
> they all hang out on the performance list).
>
>> I can't say
>> which sites i'm working on, but we get close to 3-5 million uniques
>> users per day, so over time, that's a lot of unique users to keep
>> around and de-dup your fact tables by.  Need to be able to query normal
>> analytics like:
> <snip>
>
>> i've
>> made a lot of optimizations in postgresql.conf by playing with work_mem
>> and shared_buffers and such and i think the database is using as much
>> as it can disk/memory/cpu wise.
>
> Big work_mem, I'm guessing. Limiting factor is presumably disk I/O.
>
> <snip>
>> here's a sample query that takes a while to run... just a simple report that
>> shows gender by area of the site.
>>
>> select A.gender as gender, B.area as area, sum(C.imps) as imps, sum(C.clicks)
>> as clicks, count(distinct(C.uu_id)) as users
>> from uus as A, areas as B, daily_area_fact as C
>> where A.uu_id = C.uu_id
>> and B.area_id = C.area_id
>> group by gender,area;
>>
>> so
>> by just having one day of data, with 3,168,049 rows in the user
>> dimension table (uus), 17,213,420 in the daily_area_fact table that
>> joins all the dimension tables, takes about 15 minutes.  if i had 30-90
>> days in this fact table, who knows how long this would take... i know
>> doing a distinct on uu_id is very expensive, so that's the main problem
>> here i guess and would want to know if anyone else is doing it this way
>> or better.
>
> In the end, I'd suspect the seq-scan over the fact table will be your
> biggest problem. Can you pre-aggregate your fact-table into daily summaries?
>
> See you over on the performance list, where there are more experienced
> people than myself to help you.