Thread: Materialized view performance problems

Materialized view performance problems

From
Tom McLoughlin
Date:
Hello,

I'm having trouble with the performance from a query used to create a materialized view. 

I need to be able to build the keyword_performance_flat_matview view in around 2-4 hours overnight. However, it currently takes in excess of 24 hours. I'm wondering if there is anything I can do to improve the performance?

As you can see below it's a big query, and I didn't want to overwhelm everyone with the schema, so let me know what bits you might need to help!

Any help improving the performance will be greatly appreciated.

Thank you.


EngineYard M3 Extra Large instance:
  Intel Xeon CPU E5-2670 v2 @ 2.50GHz (quad core)
  15 GB Ram
  PostgreSQL 9.3.6


queries:

create materialized view
  campaign_category_lookup
as
select
  scenario_campaigns.id as campaign_id, 
  join_website_budget_labels.label_id as category_id
from scenario_campaigns
inner join scenario_campaign_vendor_instances
  on scenario_campaign_vendor_instances.campaign_id = scenario_campaigns.id
inner join join_website_budget_labels_campaign_vendor_instances_1f8636
  on join_website_budget_labels_campaign_vendor_instances_1f8636.vendor_instance_id = scenario_campaign_vendor_instances.id
inner join join_website_budget_labels
  on join_website_budget_labels_campaign_vendor_instances_1f8636.website_budget_label_id = join_website_budget_labels.id
inner join account_website_budgets
  on join_website_budget_labels.budget_id = account_website_budgets.id
  and account_website_budgets.state = 'approved'
where scenario_campaign_vendor_instances.campaign_id = scenario_campaigns.id
  and account_website_budgets.start_date <= current_date
  and account_website_budgets.end_date >= current_date
order by approved_at desc;

create materialized view temp_keyword_perf_unaggr
as
select
  account_websites.id as website_id,
  account_websites.namespace as website_namespace,
  scenario_keyword_vendor_instances.inventory_disabled as inventory_disabled,
  scenario_keyword_vendor_instances.condition_disabled as condition_disabled,
  scenario_ad_groups.campaign_id,
  scenario_ad_groups.id as ad_group_id,
  scenario_keywords.id as keyword_id,
  scenario_keyword_texts.value as keyword_name,
  scenario_keyword_performances.*,
  (select category_id from campaign_category_lookup where campaign_category_lookup.campaign_id = scenario_ad_groups.campaign_id limit 1) as category_id
from
  scenarios
  inner join account_websites
    on scenarios.website_id = account_websites.id
  inner join scenario_campaigns
    on scenario_campaigns.scenario_id = scenarios.id
  inner join scenario_ad_groups
    on scenario_ad_groups.campaign_id = scenario_campaigns.id
  inner join scenario_keywords
    on scenario_keywords.ad_group_id = scenario_ad_groups.id
  inner join scenario_keyword_texts
    on scenario_keyword_texts.id = scenario_keywords.text_id
  left outer join scenario_keyword_vendor_instances
    on scenario_keyword_vendor_instances.keyword_id = scenario_keywords.id
  left outer join scenario_keyword_performances
    on scenario_keyword_performances.api_id = scenario_keyword_vendor_instances.api_id
    and scenario_keyword_performances.date >= (date_trunc('month', now()) - '1 month'::interval)::date -- start of previous month
where
  scenarios.deleted_at is null
  and scenario_keyword_texts.value is not null
  and account_websites.active = 't';

create materialized view
  keyword_performance_flat_matview
as
select
  website_id,
  website_namespace,
  campaign_id,
  ad_group_id,
  keyword_name,
  keyword_id,
  network,
  device,
  inventory_disabled,
  condition_disabled,
  category_id,
  date,
  sum(impressions) as impressions,
  sum(clicks) as clicks,
  sum(conv_one) as conv_one,
  sum(conv_many) as conv_many,
  sum(cost) as cost,
  sum(conv_value) as conv_value,
  avg(avg_position) as avg_position
from temp_keyword_perf_unaggr
group by 
  website_id, 
  website_namespace,
  campaign_id,
  ad_group_id, 
  keyword_id, 
  keyword_name, 
  device, 
  network, 
  inventory_disabled, 
  condition_disabled,
  category_id, 
  date;


Explain output for temp_keyword_perf_unaggr:

 Merge Right Join  (cost=8796955.87..1685073792.18 rows=296873848 width=213)
   Merge Cond: (scenario_keyword_performances.api_id = scenario_keyword_vendor_instances.api_id)
   ->  Index Scan using index_keyword_performances_on_vendor_instance_id_and_date on scenario_keyword_performances  (cost=0.44..203167.46 rows=392586 width=144)
         Index Cond: (date >= ((date_trunc('month'::text, now()) - '1 mon'::interval))::date)
   ->  Materialize  (cost=8796955.43..8883724.51 rows=17353816 width=77)
         ->  Sort  (cost=8796955.43..8840339.97 rows=17353816 width=77)
               Sort Key: scenario_keyword_vendor_instances.api_id
               ->  Hash Join  (cost=2755544.36..5939172.05 rows=17353816 width=77)
                     Hash Cond: (scenario_keywords.text_id = scenario_keyword_texts.id)
                     ->  Hash Right Join  (cost=2171209.00..4417042.21 rows=17353816 width=48)
                           Hash Cond: (scenario_keyword_vendor_instances.keyword_id = scenario_keywords.id)
                           ->  Seq Scan on scenario_keyword_vendor_instances  (cost=0.00..821853.20 rows=33362520 width=14)
                           ->  Hash  (cost=1827291.60..1827291.60 rows=16931312 width=38)
                                 ->  Hash Join  (cost=219154.58..1827291.60 rows=16931312 width=38)
                                       Hash Cond: (scenario_keywords.ad_group_id = scenario_ad_groups.id)
                                       ->  Seq Scan on scenario_keywords  (cost=0.00..946491.60 rows=32550260 width=12)
                                       ->  Hash  (cost=186041.43..186041.43 rows=1712492 width=30)
                                             ->  Hash Join  (cost=6569.88..186041.43 rows=1712492 width=30)
                                                   Hash Cond: (scenario_ad_groups.campaign_id = scenario_campaigns.id)
                                                   ->  Seq Scan on scenario_ad_groups  (cost=0.00..133539.47 rows=3292247 width=8)
                                                   ->  Hash  (cost=5596.79..5596.79 rows=77847 width=26)
                                                         ->  Hash Join  (cost=100.50..5596.79 rows=77847 width=26)
                                                               Hash Cond: (scenario_campaigns.scenario_id = scenarios.id)
                                                               ->  Seq Scan on scenario_campaigns  (cost=0.00..4156.60 rows=149660 width=8)
                                                               ->  Hash  (cost=85.98..85.98 rows=1161 width=26)
                                                                     ->  Hash Join  (cost=16.43..85.98 rows=1161 width=26)
                                                                           Hash Cond: (scenarios.website_id = account_websites.id)
                                                                           ->  Seq Scan on scenarios  (cost=0.00..50.32 rows=2032 width=8)
                                                                                 Filter: (deleted_at IS NULL)
                                                                           ->  Hash  (cost=12.92..12.92 rows=281 width=22)
                                                                                 ->  Seq Scan on account_websites  (cost=0.00..12.92 rows=281 width=22)
                                                                                       Filter: active
                     ->  Hash  (cost=292793.16..292793.16 rows=14352816 width=37)
                           ->  Seq Scan on scenario_keyword_texts  (cost=0.00..292793.16 rows=14352816 width=37)
                                 Filter: (value IS NOT NULL)
   SubPlan 1
     ->  Limit  (cost=0.28..5.63 rows=1 width=4)
           ->  Index Scan using campaign_category_lookup_campaign_id_idx on campaign_category_lookup  (cost=0.28..10.99 rows=2 width=4)
                 Index Cond: (campaign_id = scenario_ad_groups.campaign_id)


--

Re: Materialized view performance problems

From
Andreas Kretschmer
Date:

> Tom McLoughlin <tom@dynamiccreative.com> hat am 6. Januar 2016 um 09:08
> geschrieben:
>
>

>
> As you can see below it's a big query, and I didn't want to overwhelm
> everyone with the schema, so let me know what bits you might need to help!
>
> Any help improving the performance will be greatly appreciated.

can you show us the EXPLAIN ANALYSE - Output? I see a LOT of seq-scans, maybe
you should create some indexes.


Re: Materialized view performance problems

From
Tom McLoughlin
Date:
Thank you very much for your help.

It's difficult for me to run analyse explain for the query given because it takes so long. However, the query below has a similar structure but has less data to process.

create materialized view temp_camp_perf_unaggr
as
select
  account_websites.id as website_id,
  account_websites.namespace as website_namespace,
  scenario_campaign_vendor_instances.inventory_disabled as inventory_disabled,
  scenario_campaign_vendor_instances.condition_disabled as condition_disabled,
  scenario_campaign_vendor_instances.manually_disabled as paused,
  scenario_campaigns.id as campaign_id,
  scenario_campaign_performances.*,
  (select campaign_category_lookup.category_id from campaign_category_lookup where campaign_category_lookup.campaign_id = scenario_campaigns.id limit 1) as category_id
from
  scenarios
  inner join account_websites
    on scenarios.website_id = account_websites.id
  inner join scenario_campaigns
    on scenario_campaigns.scenario_id = scenarios.id
  left outer join scenario_campaign_vendor_instances
    on scenario_campaigns.id = scenario_campaign_vendor_instances.campaign_id
  left outer join scenario_campaign_performances
    on scenario_campaign_performances.api_id = scenario_campaign_vendor_instances.api_id
    and scenario_campaign_performances.date >= (date_trunc('month', now()) - '1 month'::interval)::date -- start of previous month
where
  scenarios.deleted_at is null
  and scenario_campaign_performances.campaign_name is not null
  and account_websites.active = 't'; 


Here's it's EXPLAIN ANALYSE output:

 Hash Join  (cost=13094.58..3450145.63 rows=373025 width=220) (actual time=87677.770..226340.511 rows=232357 loops=1)
   Hash Cond: (scenario_campaign_performances.api_id = scenario_campaign_vendor_instances.api_id)
   ->  Seq Scan on scenario_campaign_performances  (cost=0.00..325848.93 rows=351341 width=191) (actual time=86942.746..221871.357 rows=230889 loops=1)
         Filter: ((campaign_name IS NOT NULL) AND (date >= ((date_trunc('month'::text, now()) - '1 mon'::interval))::date))
         Rows Removed by Filter: 77185
   ->  Hash  (cost=12250.80..12250.80 rows=67502 width=37) (actual time=709.034..709.034 rows=28545 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 1997kB
         ->  Hash Join  (cost=6621.17..12250.80 rows=67502 width=37) (actual time=164.772..690.399 rows=48805 loops=1)
               Hash Cond: (scenario_campaign_vendor_instances.campaign_id = scenario_campaigns.id)
               ->  Seq Scan on scenario_campaign_vendor_instances  (cost=0.00..3817.06 rows=130006 width=15) (actual time=0.049..405.396 rows=149939 loops=1)
               ->  Hash  (cost=5641.32..5641.32 rows=78388 width=26) (actual time=164.647..164.647 rows=49081 loops=1)
                     Buckets: 8192  Batches: 1  Memory Usage: 2839kB
                     ->  Hash Join  (cost=105.59..5641.32 rows=78388 width=26) (actual time=55.543..145.975 rows=49081 loops=1)
                           Hash Cond: (scenario_campaigns.scenario_id = scenarios.id)
                           ->  Seq Scan on scenario_campaigns  (cost=0.00..4185.71 rows=150971 width=8) (actual time=0.024..47.185 rows=150591 loops=1)
                           ->  Hash  (cost=90.56..90.56 rows=1202 width=26) (actual time=55.499..55.499 rows=1428 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 79kB
                                 ->  Hash Join  (cost=18.49..90.56 rows=1202 width=26) (actual time=48.435..54.931 rows=1428 loops=1)
                                       Hash Cond: (scenarios.website_id = account_websites.id)
                                       ->  Seq Scan on scenarios  (cost=0.00..52.15 rows=2108 width=8) (actual time=0.015..5.723 rows=2052 loops=1)
                                             Filter: (deleted_at IS NULL)
                                             Rows Removed by Filter: 201
                                       ->  Hash  (cost=14.54..14.54 rows=316 width=22) (actual time=48.402..48.402 rows=289 loops=1)
                                             Buckets: 1024  Batches: 1  Memory Usage: 16kB
                                             ->  Seq Scan on account_websites  (cost=0.00..14.54 rows=316 width=22) (actual time=26.373..48.259 rows=289 loops=1)
                                                   Filter: active
                                                   Rows Removed by Filter: 211
   SubPlan 1
     ->  Limit  (cost=0.28..8.30 rows=1 width=4) (actual time=0.014..0.014 rows=0 loops=232357)
           ->  Index Scan using campaign_category_lookup_campaign_id_idx on campaign_category_lookup  (cost=0.28..8.30 rows=1 width=4) (actual time=0.014..0.014 rows=0 loops=232357)
                 Index Cond: (campaign_id = scenario_campaigns.id)
 Total runtime: 228236.708 ms

On 6 January 2016 at 22:10, Andreas Kretschmer <andreas@a-kretschmer.de> wrote:


> Tom McLoughlin <tom@dynamiccreative.com> hat am 6. Januar 2016 um 09:08
> geschrieben:
>
>

>
> As you can see below it's a big query, and I didn't want to overwhelm
> everyone with the schema, so let me know what bits you might need to help!
>
> Any help improving the performance will be greatly appreciated.

can you show us the EXPLAIN ANALYSE - Output? I see a LOT of seq-scans, maybe
you should create some indexes.

Re: Materialized view performance problems

From
Andreas Kretschmer
Date:
Tom McLoughlin <tom@dynamiccreative.com> wrote:

> Thank you very much for your help.
>
> It's difficult for me to run analyse explain for the query given because it
> takes so long. However, the query below has a similar structure but has less
> data to process.

Seems okay, but it's better to analyse that for the real query. You can
install the auto_explain - module and enable auto_explain.log_analyze
and set auto_explain.log_min_duration to a proper value.

You can find later the explain analse - ouput in the log.

Consider all steps mit seq-scans, but as i see all seq-scan are on small
tables and the majority of rows are in the result, so an index can't
help much.


Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°