Thread: Optimizing a VIEW

Optimizing a VIEW

From

Madison Kelly

Date:

15 August 2008, 15:36:27

Hi all,

   I've got a simple table with a lot of data in it:

CREATE TABLE customer_data (
    cd_id            int        primary key    default(nextval('cd_seq')),
    cd_cust_id        int        not null,
    cd_variable        text        not null,
    cd_value        text,
    cd_tag            text,
    added_user        int        not null,
    added_date        timestamp    not null    default now(),
    modified_user        int        not null,
    modified_date        timestamp    not null    default now(),

    FOREIGN KEY(cd_cust_id) REFERENCES customer(cust_id)
);

   The 'cust_id' references the customer that the given data belongs to.
The reason for this "data bucket" (does this structure have a proper
name?) is that the data I need to store on a give customer is quite
variable and outside of my control. As it is, there is about 400
different variable/value pairs I need to store per customer.

   This table has a copy in a second historical schema that matches this
one in public but with an additional 'history_id' sequence. I use a
simple function to copy an INSERT or UPDATE to any entry in the
historical schema.

   Now I want to graph a certain subset of these variable/value pairs,
so I created a simple (in concept) view to pull out the historical data
set for a given customer. I do this by pulling up a set of records based
on the name of the 'cd_variable' and 'cd_tag' and connect the records
together using a matching timestamp.

   The problem is that this view has very quickly become terribly slow.

   I've got indexes on the 'cd_variable', 'cd_tag' and the parent
'cust_id' columns, and the plan seems to show that the indexes are
indeed being used, but the query against this view can take up to 10
minutes to respond. I am hoping to avoid making a dedicated table as
what I use to build this dataset may change over time.

   Below I will post the VIEW and a sample of the query's EXPLAIN
ANALYZE. Thanks for any tips/help/clue-stick-beating you may be able to
share!

Madi

-=] VIEW

CREATE VIEW view_sync_rate_history AS
     SELECT
         a.cust_id       AS    vsrh_cust_id,
         a.cust_name     AS    vsrh_cust_name,
         a.cust_business AS    vsrh_cust_business,
         a.cust_nexxia_id||'-'||a.cust_nexxia_seq    AS    vsrh_cust_nexxia,
         a.cust_phone    AS    vsrh_cust_phone,
         b.cd_value      AS    vsrh_up_speed,
         b.history_id    AS    vsrh_up_speed_history_id,
         c.cd_value      AS    vsrh_up_rco,
         c.history_id    AS    vsrh_up_rco_history_id,
         d.cd_value      AS    vsrh_up_nm,
         d.history_id    AS    vsrh_up_nm_history_id,
         e.cd_value      AS    vsrh_up_sp,
         e.history_id    AS    vsrh_up_sp_history_id,
         f.cd_value      AS    vsrh_up_atten,
         f.history_id    AS    vsrh_up_atten_history_id,
         g.cd_value      AS    vsrh_down_speed,
         g.history_id    AS    vsrh_down_speed_history_id,
         h.cd_value      AS    vsrh_down_rco,
         h.history_id    AS    vsrh_down_rco_history_id,
         i.cd_value      AS    vsrh_down_nm,
         i.history_id    AS    vsrh_down_nm_history_id,
         j.cd_value      AS    vsrh_down_sp,
         j.history_id    AS    vsrh_down_sp_history_id,
         k.cd_value      AS    vsrh_down_atten,
         k.history_id    AS    vsrh_down_atten_history_id,
         l.cd_value      AS    vsrh_updated,
         l.history_id    AS    vsrh_updated_history_id
     FROM
         customer a,
         history.customer_data b,
         history.customer_data c,
         history.customer_data d,
         history.customer_data e,
         history.customer_data f,
         history.customer_data g,
         history.customer_data h,
         history.customer_data i,
         history.customer_data j,
         history.customer_data k,
         history.customer_data l
     WHERE
         a.cust_id=b.cd_cust_id AND
         a.cust_id=c.cd_cust_id AND
         a.cust_id=d.cd_cust_id AND
         a.cust_id=e.cd_cust_id AND
         a.cust_id=f.cd_cust_id AND
         a.cust_id=g.cd_cust_id AND
         a.cust_id=h.cd_cust_id AND
         a.cust_id=i.cd_cust_id AND
         a.cust_id=j.cd_cust_id AND
         a.cust_id=k.cd_cust_id AND
         a.cust_id=l.cd_cust_id AND
         b.cd_tag='sync_rate' AND
         c.cd_tag='sync_rate' AND
         d.cd_tag='sync_rate' AND
         e.cd_tag='sync_rate' AND
         f.cd_tag='sync_rate' AND
         g.cd_tag='sync_rate' AND
         h.cd_tag='sync_rate' AND
         i.cd_tag='sync_rate' AND
         j.cd_tag='sync_rate' AND
         k.cd_tag='sync_rate' AND
         l.cd_tag='sync_rate' AND
         b.cd_variable='upstream_speed' AND
         c.cd_variable='upstream_relative_capacity_occupation' AND
         d.cd_variable='upstream_noise_margin' AND
         e.cd_variable='upstream_signal_power' AND
         f.cd_variable='upstream_attenuation' AND
         g.cd_variable='downstream_speed' AND
         h.cd_variable='downstream_relative_capacity_occupation' AND
         i.cd_variable='downstream_noise_margin' AND
         j.cd_variable='downstream_signal_power' AND
         k.cd_variable='downstream_attenuation' AND
         l.cd_variable='sync_rate_updated' AND
         b.modified_date=c.modified_date AND
         b.modified_date=d.modified_date AND
         b.modified_date=e.modified_date AND
         b.modified_date=f.modified_date AND
         b.modified_date=g.modified_date AND
         b.modified_date=h.modified_date AND
         b.modified_date=i.modified_date AND
         b.modified_date=j.modified_date AND
         b.modified_date=k.modified_date AND
         b.modified_date=l.modified_date;

-=] EXPLAIN ANALYZE of a sample query
In case this is hard to read in the mail program, here is a link:
http://mizu-bu.org/misc/long_explain_analyze.txt


                QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Nested Loop  (cost=1263.93..3417.98 rows=1 width=262) (actual
time=88.005..248996.948 rows=131 loops=1)
    Join Filter: ("inner".modified_date = "outer".modified_date)
    ->  Nested Loop  (cost=577.65..1482.46 rows=1 width=154) (actual
time=58.664..5253.873 rows=131 loops=1)
          Join Filter: ("outer".modified_date = "inner".modified_date)
          ->  Nested Loop  (cost=369.98..870.28 rows=1 width=128)
(actual time=51.858..4328.108 rows=131 loops=1)
                Join Filter: ("inner".modified_date = "outer".modified_date)
                ->  Nested Loop  (cost=343.35..823.93 rows=1 width=116)
(actual time=42.851..3185.995 rows=131 loops=1)
                      Join Filter: ("inner".modified_date =
"outer".modified_date)
                      ->  Nested Loop  (cost=126.20..185.37 rows=1
width=90) (actual time=36.181..2280.245 rows=131 loops=1)
                            Join Filter: ("inner".modified_date =
"outer".modified_date)
                            ->  Nested Loop  (cost=99.57..139.02 rows=1
width=64) (actual time=27.918..1168.061 rows=131 loops=1)
                                  Join Filter: ("outer".modified_date =
"inner".modified_date)
                                  ->  Hash Join  (cost=72.94..92.67
rows=1 width=38) (actual time=17.769..18.572 rows=131 loops=1)
                                        Hash Cond:
("outer".modified_date = "inner".modified_date)
                                        ->  Bitmap Heap Scan on
customer_data i  (cost=26.63..46.30 rows=4 width=26) (actual
time=8.226..8.563 rows=131 loops=1)
                                              Recheck Cond:
((cd_variable = 'downstream_noise_margin'::text) AND (103 = cd_cust_id))
                                              Filter: (cd_tag =
'sync_rate'::text)
                                              ->  BitmapAnd
(cost=26.63..26.63 rows=5 width=0) (actual time=8.172..8.172 rows=0 loops=1)
                                                    ->  Bitmap Index
Scan on cd_variable_index  (cost=0.00..10.21 rows=918 width=0) (actual
time=6.409..6.409 rows=20981 loops=1)
                                                          Index Cond:
(cd_variable = 'downstream_noise_margin'::text)
                                                    ->  Bitmap Index
Scan on cd_id_index  (cost=0.00..16.17 rows=2333 width=0) (actual
time=0.502..0.502 rows=2619 loops=1)
                                                          Index Cond:
(103 = cd_cust_id)
                                        ->  Hash  (cost=46.30..46.30
rows=4 width=12) (actual time=9.526..9.526 rows=131 loops=1)
                                              ->  Bitmap Heap Scan on
customer_data e  (cost=26.63..46.30 rows=4 width=12) (actual
time=9.140..9.381 rows=131 loops=1)
                                                    Recheck Cond:
((cd_variable = 'upstream_signal_power'::text) AND (103 = cd_cust_id))
                                                    Filter: (cd_tag =
'sync_rate'::text)
                                                    ->  BitmapAnd
(cost=26.63..26.63 rows=5 width=0) (actual time=9.082..9.082 rows=0 loops=1)
                                                          ->  Bitmap
Index Scan on cd_variable_index  (cost=0.00..10.21 rows=918 width=0)
(actual time=7.298..7.298 rows=20981 loops=1)
                                                                Index
Cond: (cd_variable = 'upstream_signal_power'::text)
                                                          ->  Bitmap
Index Scan on cd_id_index  (cost=0.00..16.17 rows=2333 width=0) (actual
time=0.502..0.502 rows=2619 loops=1)
                                                                Index
Cond: (103 = cd_cust_id)
                                  ->  Bitmap Heap Scan on customer_data
c  (cost=26.63..46.30 rows=4 width=26) (actual time=8.492..8.693
rows=131 loops=131)
                                        Recheck Cond: ((cd_variable =
'upstream_relative_capacity_occupation'::text) AND (103 = cd_cust_id))
                                        Filter: (cd_tag = 'sync_rate'::text)
                                        ->  BitmapAnd
(cost=26.63..26.63 rows=5 width=0) (actual time=8.446..8.446 rows=0
loops=131)
                                              ->  Bitmap Index Scan on
cd_variable_index  (cost=0.00..10.21 rows=918 width=0) (actual
time=6.693..6.693 rows=20986 loops=131)
                                                    Index Cond:
(cd_variable = 'upstream_relative_capacity_occupation'::text)
                                              ->  Bitmap Index Scan on
cd_id_index  (cost=0.00..16.17 rows=2333 width=0) (actual
time=0.494..0.494 rows=2619 loops=131)
                                                    Index Cond: (103 =
cd_cust_id)
                            ->  Bitmap Heap Scan on customer_data b
(cost=26.63..46.30 rows=4 width=26) (actual time=8.216..8.405 rows=131
loops=131)
                                  Recheck Cond: ((cd_variable =
'upstream_speed'::text) AND (103 = cd_cust_id))
                                  Filter: (cd_tag = 'sync_rate'::text)
                                  ->  BitmapAnd  (cost=26.63..26.63
rows=5 width=0) (actual time=8.172..8.172 rows=0 loops=131)
                                        ->  Bitmap Index Scan on
cd_variable_index  (cost=0.00..10.21 rows=918 width=0) (actual
time=6.417..6.417 rows=20986 loops=131)
                                              Index Cond: (cd_variable =
'upstream_speed'::text)
                                        ->  Bitmap Index Scan on
cd_id_index  (cost=0.00..16.17 rows=2333 width=0) (actual
time=0.495..0.495 rows=2619 loops=131)
                                              Index Cond: (103 = cd_cust_id)
                      ->  Bitmap Heap Scan on customer_data l
(cost=217.14..637.28 rows=102 width=26) (actual time=6.653..6.843
rows=131 loops=131)
                            Recheck Cond: ((103 = cd_cust_id) AND
(cd_variable = 'sync_rate_updated'::text))
                            Filter: (cd_tag = 'sync_rate'::text)
                            ->  BitmapAnd  (cost=217.14..217.14 rows=117
width=0) (actual time=6.618..6.618 rows=0 loops=131)
                                  ->  Bitmap Index Scan on cd_id_index
(cost=0.00..16.17 rows=2333 width=0) (actual time=0.485..0.485 rows=2619
loops=131)
                                        Index Cond: (103 = cd_cust_id)
                                  ->  Bitmap Index Scan on
cd_variable_index  (cost=0.00..200.72 rows=21350 width=0) (actual
time=6.079..6.079 rows=20986 loops=131)
                                        Index Cond: (cd_variable =
'sync_rate_updated'::text)
                ->  Bitmap Heap Scan on customer_data k
(cost=26.63..46.30 rows=4 width=12) (actual time=8.442..8.638 rows=131
loops=131)
                      Recheck Cond: ((cd_variable =
'downstream_attenuation'::text) AND (103 = cd_cust_id))
                      Filter: (cd_tag = 'sync_rate'::text)
                      ->  BitmapAnd  (cost=26.63..26.63 rows=5 width=0)
(actual time=8.397..8.397 rows=0 loops=131)
                            ->  Bitmap Index Scan on cd_variable_index
(cost=0.00..10.21 rows=918 width=0) (actual time=6.624..6.624 rows=20986
loops=131)
                                  Index Cond: (cd_variable =
'downstream_attenuation'::text)
                            ->  Bitmap Index Scan on cd_id_index
(cost=0.00..16.17 rows=2333 width=0) (actual time=0.487..0.487 rows=2619
loops=131)
                                  Index Cond: (103 = cd_cust_id)
          ->  Bitmap Heap Scan on customer_data d  (cost=207.68..610.95
rows=98 width=26) (actual time=6.805..6.994 rows=131 loops=131)
                Recheck Cond: ((103 = cd_cust_id) AND (cd_variable =
'upstream_noise_margin'::text))
                Filter: (cd_tag = 'sync_rate'::text)
                ->  BitmapAnd  (cost=207.68..207.68 rows=112 width=0)
(actual time=6.769..6.769 rows=0 loops=131)
                      ->  Bitmap Index Scan on cd_id_index
(cost=0.00..16.17 rows=2333 width=0) (actual time=0.487..0.487 rows=2619
loops=131)
                            Index Cond: (103 = cd_cust_id)
                      ->  Bitmap Index Scan on cd_variable_index
(cost=0.00..191.26 rows=20360 width=0) (actual time=6.230..6.230
rows=20986 loops=131)
                            Index Cond: (cd_variable =
'upstream_noise_margin'::text)
    ->  Nested Loop  (cost=686.28..1935.49 rows=1 width=224) (actual
time=21.077..1860.475 rows=131 loops=131)
          ->  Seq Scan on customer a  (cost=0.00..5.22 rows=1 width=164)
(actual time=0.053..0.090 rows=1 loops=131)
                Filter: (cust_id = 103)
          ->  Nested Loop  (cost=686.28..1930.26 rows=1 width=76)
(actual time=21.014..1860.177 rows=131 loops=131)
                Join Filter: ("inner".modified_date = "outer".modified_date)
                ->  Nested Loop  (cost=472.13..1298.07 rows=1 width=50)
(actual time=14.460..971.017 rows=131 loops=131)
                      Join Filter: ("inner".modified_date =
"outer".modified_date)
                      ->  Hash Join  (cost=259.97..674.63 rows=1
width=38) (actual time=7.459..8.272 rows=131 loops=131)
                            Hash Cond: ("outer".modified_date =
"inner".modified_date)
                            ->  Bitmap Heap Scan on customer_data h
(cost=213.66..627.06 rows=100 width=26) (actual time=7.391..7.707
rows=131 loops=131)
                                  Recheck Cond: ((103 = cd_cust_id) AND
(cd_variable = 'downstream_relative_capacity_occupation'::text))
                                  Filter: (cd_tag = 'sync_rate'::text)
                                  ->  BitmapAnd  (cost=213.66..213.66
rows=115 width=0) (actual time=7.355..7.355 rows=0 loops=131)
                                        ->  Bitmap Index Scan on
cd_id_index  (cost=0.00..16.17 rows=2333 width=0) (actual
time=0.493..0.493 rows=2619 loops=131)
                                              Index Cond: (103 = cd_cust_id)
                                        ->  Bitmap Index Scan on
cd_variable_index  (cost=0.00..197.24 rows=20926 width=0) (actual
time=6.809..6.809 rows=20986 loops=131)
                                              Index Cond: (cd_variable =
'downstream_relative_capacity_occupation'::text)
                            ->  Hash  (cost=46.30..46.30 rows=4
width=12) (actual time=8.253..8.253 rows=131 loops=1)
                                  ->  Bitmap Heap Scan on customer_data
f  (cost=26.63..46.30 rows=4 width=12) (actual time=7.882..8.113
rows=131 loops=1)
                                        Recheck Cond: ((cd_variable =
'upstream_attenuation'::text) AND (103 = cd_cust_id))
                                        Filter: (cd_tag = 'sync_rate'::text)
                                        ->  BitmapAnd
(cost=26.63..26.63 rows=5 width=0) (actual time=7.832..7.832 rows=0 loops=1)
                                              ->  Bitmap Index Scan on
cd_variable_index  (cost=0.00..10.21 rows=918 width=0) (actual
time=6.065..6.065 rows=20981 loops=1)
                                                    Index Cond:
(cd_variable = 'upstream_attenuation'::text)
                                              ->  Bitmap Index Scan on
cd_id_index  (cost=0.00..16.17 rows=2333 width=0) (actual
time=0.489..0.489 rows=2619 loops=1)
                                                    Index Cond: (103 =
cd_cust_id)
                      ->  Bitmap Heap Scan on customer_data j
(cost=212.16..622.19 rows=100 width=12) (actual time=7.092..7.280
rows=131 loops=17161)
                            Recheck Cond: ((103 = cd_cust_id) AND
(cd_variable = 'downstream_signal_power'::text))
                            Filter: (cd_tag = 'sync_rate'::text)
                            ->  BitmapAnd  (cost=212.16..212.16 rows=114
width=0) (actual time=7.057..7.057 rows=0 loops=17161)
                                  ->  Bitmap Index Scan on cd_id_index
(cost=0.00..16.17 rows=2333 width=0) (actual time=0.493..0.493 rows=2619
loops=17161)
                                        Index Cond: (103 = cd_cust_id)
                                  ->  Bitmap Index Scan on
cd_variable_index  (cost=0.00..195.74 rows=20784 width=0) (actual
time=6.512..6.512 rows=20986 loops=17161)
                                        Index Cond: (cd_variable =
'downstream_signal_power'::text)
                ->  Bitmap Heap Scan on customer_data g
(cost=214.15..630.92 rows=101 width=26) (actual time=6.526..6.718
rows=131 loops=17161)
                      Recheck Cond: ((103 = cd_cust_id) AND (cd_variable
= 'downstream_speed'::text))
                      Filter: (cd_tag = 'sync_rate'::text)
                      ->  BitmapAnd  (cost=214.15..214.15 rows=116
width=0) (actual time=6.492..6.492 rows=0 loops=17161)
                            ->  Bitmap Index Scan on cd_id_index
(cost=0.00..16.17 rows=2333 width=0) (actual time=0.486..0.486 rows=2619
loops=17161)
                                  Index Cond: (103 = cd_cust_id)
                            ->  Bitmap Index Scan on cd_variable_index
(cost=0.00..197.73 rows=21067 width=0) (actual time=5.956..5.956
rows=20986 loops=17161)
                                  Index Cond: (cd_variable =
'downstream_speed'::text)
  Total runtime: 248997.571 ms
(114 rows)

Re: Optimizing a VIEW

From

Decibel!

Date:

16 August 2008, 18:53:33

On Aug 15, 2008, at 1:36 PM, Madison Kelly wrote:
> The 'cust_id' references the customer that the given data belongs
> to. The reason for this "data bucket" (does this structure have a
> proper name?) is that the data I need to store on a give customer
> is quite variable and outside of my control. As it is, there is
> about 400 different variable/value pairs I need to store per customer.

It's called Entity-Attribute-Value, and it's performance is pretty
much guaranteed to suck for any kind of a large dataset. The problem
is that you're storing a MASSIVE amount of extra information for
every single value. Consider:

If each data point was just a field in a table, then even if we left
cd_value as text, each data point would consume 4 bytes* + 1 byte per
character (I'm assuming you don't need extra UTF8 chars or anything).
Of course if you know you're only storing numbers or the like then
you can make that even more efficient.

* In 8.3, the text field overhead could be as low as 1 byte if the
field is small enough.

OTOH, your table is going to 32+24 bytes per row just for the per-row
overhead, ints and timestamps. Each text field will have 1 or 4 bytes
in overhead, then you have to store the actual data. Realistically,
you're looking at 60+ bytes per data point, as opposed to maybe 15,
or even down to 4 if you know you're storing an int.

Now figure out what that turns into if you have 100 data points per
minute. It doesn't take very long until you have a huge pile of data
you're trying to deal with. (As an aside, I once consulted with a
company that wanted to do this... they wanted to store about 400 data
points from about 1000 devices on a 5 minute interval. That worked
out to something like 5GB per day, just for the EAV table. Just
wasn't going to scale...)

So, back to your situation... there's several things you can do that
will greatly improve things.

Identify data points that are very common and don't use EAV to store
them. Instead, store them as regular fields in a table (and don't use
text if at all possible).

You need to trim down your EAV table. Throw out the added/modified
info; there's almost certainly no reason to store that *per data
point*. Get rid of cd_id; there should be a natural PK you can use,
and you certainly don't want anything else referring to this table
(which is a big reason to use a surrogate key).

cd_variable and cd_tag need to be ints that point at other tables.
For that matter, do you really need to tag each *data point*?
Probably not...

Finally, if you have a defined set of points that you need to report
on, create a materialized view that has that information.

BTW, it would probably be better to store data either in the main
table, or the history table, but not both places.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachment

smime.p7s

Re: Optimizing a VIEW

From

"Rodrigo E. De León Plicet"

Date:

16 August 2008, 22:33:12

On Sat, Aug 16, 2008 at 2:19 PM, Decibel! <decibel@decibel.org> wrote:
> You need to trim down your EAV table.

Egads! I'd say completely get rid of this beast and redesign it
according to valid relational concepts.

This post pretty much explains the whole issue with EAV:

http://groups.google.com/group/microsoft.public.sqlserver.programming/browse_thread/thread/df5bb99b3eaadfa9/6a160e5027ce3a80?lnk=st&q=eav#6a160e5027ce3a80

EAV is evil. Period.

Re: Optimizing a VIEW

From

"Rodrigo E. De León Plicet"

Date:

16 August 2008, 22:36:26

On Fri, Aug 15, 2008 at 1:36 PM, Madison Kelly <linux@alteeve.com> wrote:
>  The 'cust_id' references the customer that the given data belongs to. The
> reason for this "data bucket" (does this structure have a proper name?) is
> that the data I need to store on a give customer is quite variable and
> outside of my control. As it is, there is about 400 different variable/value
> pairs I need to store per customer.

For you very specific case, I recommend you check out contrib/hstore:
http://www.postgresql.org/docs/current/static/hstore.html

Re: Optimizing a VIEW

From

"Gurjeet Singh"

Date:

16 August 2008, 23:19:35

On Sun, Aug 17, 2008 at 7:06 AM, Rodrigo E. De León Plicet <rdeleonp@gmail.com> wrote:

On Fri, Aug 15, 2008 at 1:36 PM, Madison Kelly <linux@alteeve.com> wrote:
> The 'cust_id' references the customer that the given data belongs to. The
> reason for this "data bucket" (does this structure have a proper name?) is
> that the data I need to store on a give customer is quite variable and
> outside of my control. As it is, there is about 400 different variable/value
> pairs I need to store per customer.

For you very specific case, I recommend you check out contrib/hstore:
http://www.postgresql.org/docs/current/static/hstore.html

Awesome!!!! Any comments on the performance of hstore?

Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

Mail sent from my BlackLaptop device

Re: Optimizing a VIEW

From

Madison Kelly

Date:

17 August 2008, 12:21:11

Decibel! wrote:
> On Aug 15, 2008, at 1:36 PM, Madison Kelly wrote:
>> The 'cust_id' references the customer that the given data belongs to.
>> The reason for this "data bucket" (does this structure have a proper
>> name?) is that the data I need to store on a give customer is quite
>> variable and outside of my control. As it is, there is about 400
>> different variable/value pairs I need to store per customer.
>
>
> It's called Entity-Attribute-Value, and it's performance is pretty much
> guaranteed to suck for any kind of a large dataset. The problem is that
> you're storing a MASSIVE amount of extra information for every single
> value. Consider:
>
> If each data point was just a field in a table, then even if we left
> cd_value as text, each data point would consume 4 bytes* + 1 byte per
> character (I'm assuming you don't need extra UTF8 chars or anything). Of
> course if you know you're only storing numbers or the like then you can
> make that even more efficient.
>
> * In 8.3, the text field overhead could be as low as 1 byte if the field
> is small enough.
>
> OTOH, your table is going to 32+24 bytes per row just for the per-row
> overhead, ints and timestamps. Each text field will have 1 or 4 bytes in
> overhead, then you have to store the actual data. Realistically, you're
> looking at 60+ bytes per data point, as opposed to maybe 15, or even
> down to 4 if you know you're storing an int.
>
> Now figure out what that turns into if you have 100 data points per
> minute. It doesn't take very long until you have a huge pile of data
> you're trying to deal with. (As an aside, I once consulted with a
> company that wanted to do this... they wanted to store about 400 data
> points from about 1000 devices on a 5 minute interval. That worked out
> to something like 5GB per day, just for the EAV table. Just wasn't going
> to scale...)
>
> So, back to your situation... there's several things you can do that
> will greatly improve things.
>
> Identify data points that are very common and don't use EAV to store
> them. Instead, store them as regular fields in a table (and don't use
> text if at all possible).
>
> You need to trim down your EAV table. Throw out the added/modified info;
> there's almost certainly no reason to store that *per data point*. Get
> rid of cd_id; there should be a natural PK you can use, and you
> certainly don't want anything else referring to this table (which is a
> big reason to use a surrogate key).
>
> cd_variable and cd_tag need to be ints that point at other tables. For
> that matter, do you really need to tag each *data point*? Probably not...
>
> Finally, if you have a defined set of points that you need to report on,
> create a materialized view that has that information.
>
> BTW, it would probably be better to store data either in the main table,
> or the history table, but not both places.

This is a very long and thoughtful reply, thank you very kindly.

Truth be told, I sort of expected this would be what I had to do. I
think I asked this more in hoping that there might be some "magic" I
didn't know about, but I see now that's not the case. :)

As my data points grow to 500,000+, the time it took to return these
results grew to well over 10 minutes on a decent server and the DB size
was growing rapidly, as you spoke of.

So I did just as you suggested and took the variable names I knew about
specifically and created a table for them. These are the ones that are
being most often updated (hourly per customer) and made each column an
'int' or 'real' where possible and ditched the tracking of the
adding/modifying user and time stamp. I added those out of habit, more
than anything. This data will always come from a system app though, so...

Given that my DB is in development and how very long and intensive it
would have been to pull out the existing data, I have started over and
am now gathering new data. In a week or so I should have the same amount
of data as I had before and I will be able to do a closer comparison test.

However, I already suspect the growth of the database will be
substantially slower and the queries will return substantially faster.

Thank you again!

Madi

Re: Optimizing a VIEW

From

Matthew Wakeling

Date:

18 August 2008, 08:06:13

On Fri, 15 Aug 2008, Madison Kelly wrote:
>  Below I will post the VIEW and a sample of the query's EXPLAIN ANALYZE.
> Thanks for any tips/help/clue-stick-beating you may be able to share!

This query looks incredibly expensive:

>    SELECT
...
>    FROM
>        customer a,
>        history.customer_data b,
>        history.customer_data c,
>        history.customer_data d,
>        history.customer_data e,
>        history.customer_data f,
>        history.customer_data g,
>        history.customer_data h,
>        history.customer_data i,
>        history.customer_data j,
>        history.customer_data k,
>        history.customer_data l
>    WHERE
>        a.cust_id=b.cd_cust_id AND
>        a.cust_id=c.cd_cust_id AND
>        a.cust_id=d.cd_cust_id AND
>        a.cust_id=e.cd_cust_id AND
>        a.cust_id=f.cd_cust_id AND
>        a.cust_id=g.cd_cust_id AND
>        a.cust_id=h.cd_cust_id AND
>        a.cust_id=i.cd_cust_id AND
>        a.cust_id=j.cd_cust_id AND
>        a.cust_id=k.cd_cust_id AND
>        a.cust_id=l.cd_cust_id AND
...

I would refactor this significantly, so that instead of returning a wide
result, it would return more than one row per customer. Just do a single
join between customer and history.customer_data - it will run much faster.

Matthew

--
Here we go - the Fairy Godmother redundancy proof.
                                        -- Computer Science Lecturer

Re: Optimizing a VIEW

From

Decibel!

Date:

20 August 2008, 14:34:54

On Aug 16, 2008, at 9:19 PM, Gurjeet Singh wrote:
> For you very specific case, I recommend you check out contrib/hstore:
> http://www.postgresql.org/docs/current/static/hstore.html
>
>
> Awesome!!!! Any comments on the performance of hstore?

I've looked at it but haven't actually used it. One thing I wish it
did was to keep a catalog somewhere of the "names" that it's seen so
that it wasn't storing them as in-line text. If you have even
moderate-length names and are storing small values you quickly end up
wasting a ton of space.

BTW, now that you can build arrays of composite types, that might be
an easy way to deal with this stuff. Create a composite type of
(name_id, value) and store that in an array.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachment

smime.p7s

Re: Optimizing a VIEW

From

Decibel!

Date:

20 August 2008, 14:35:05

On Aug 17, 2008, at 10:21 AM, Madison Kelly wrote:
> Truth be told, I sort of expected this would be what I had to do. I
> think I asked this more in hoping that there might be some "magic"
> I didn't know about, but I see now that's not the case. :)
>
> As my data points grow to 500,000+, the time it took to return
> these results grew to well over 10 minutes on a decent server and
> the DB size was growing rapidly, as you spoke of.
>
> So I did just as you suggested and took the variable names I knew
> about specifically and created a table for them. These are the ones
> that are being most often updated (hourly per customer) and made
> each column an 'int' or 'real' where possible and ditched the
> tracking of the adding/modifying user and time stamp. I added those
> out of habit, more than anything. This data will always come from a
> system app though, so...
>
> Given that my DB is in development and how very long and intensive
> it would have been to pull out the existing data, I have started
> over and am now gathering new data. In a week or so I should have
> the same amount of data as I had before and I will be able to do a
> closer comparison test.
>
> However, I already suspect the growth of the database will be
> substantially slower and the queries will return substantially faster.


I strongly recommend you also re-think using EAV at all for this. It
plain and simple does not scale well. I won't go so far as to say it
can never be used (we're actually working on one right now, but it
will only be used to occasionally pull up single entities), but you
have to be really careful with it. I don't see it working very well
for what it sounds like you're trying to do.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachment

smime.p7s

Re: Optimizing a VIEW

From

Tom Lane

Date:

20 August 2008, 15:30:44

Decibel! <decibel@decibel.org> writes:
> On Aug 16, 2008, at 9:19 PM, Gurjeet Singh wrote:
>> Awesome!!!! Any comments on the performance of hstore?

> I've looked at it but haven't actually used it. One thing I wish it
> did was to keep a catalog somewhere of the "names" that it's seen so
> that it wasn't storing them as in-line text. If you have even
> moderate-length names and are storing small values you quickly end up
> wasting a ton of space.

> BTW, now that you can build arrays of composite types, that might be
> an easy way to deal with this stuff. Create a composite type of
> (name_id, value) and store that in an array.

If you're worried about storage space, I wouldn't go for arrays of
composite :-(.  The tuple header overhead is horrendous, almost
certainly a lot worse than hstore.

            regards, tom lane

Re: Optimizing a VIEW

From

Decibel!

Date:

22 August 2008, 13:29:09

On Aug 20, 2008, at 1:18 PM, Tom Lane wrote:
> If you're worried about storage space, I wouldn't go for arrays of
> composite :-(.  The tuple header overhead is horrendous, almost
> certainly a lot worse than hstore.


Oh holy cow, I didn't realize we had a big header in there. Is that
to allow for changing the definition of the composite type?
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Attachment

smime.p7s