Home > mailing lists

Thread: Performance issue with NestedLoop query

Performance issue with NestedLoop query

From

Ram N

Date:

30 July 2015, 07:52:07

Hi,

I am trying to see if I can do anything to optimize the following plan.

I have two tables and I am doing a join between them. After joining it calculates aggregates (Sum and Count)

Table 1 : timestamp (one per day) for 2 years (730 records)

Table 2 : Window based validity records. Window here means start and end timestamp indicating a period of validity for a record.

Hash some 10 odd columns including start_time and end_time. (1 million records)

Machine has 244 GB RAM. Queries are taking more than a min and in some case 2-3 mins.

Below is the plan I am getting. The Nested loop blows up the number of records and we expect that. I have tried playing around work_mem and cache configs which hasn't helped.

Query

select sum(a), count(id), a.ts, st from table1 a, table2 b where a.ts > b.start_date and a.ts < b.end_date and a.ts > '2015-01-01 20:50:44.000000 +00:00:00' and a.ts < '2015-07-01 19:50:44.000000 +00:00:00' group by a.ts, st order by a.ts

Plan (EXPLAIN ANALYZE)

"Sort (cost=10005447874.54..10005447879.07 rows=1810 width=44) (actual time=178883.936..178884.159 rows=1355 loops=1)"

" Output: (sum(b.a)), (count(b.id)), a.ts, b.st"

" Sort Key: a.ts"

" Sort Method: quicksort Memory: 154kB"

" Buffers: shared hit=47068722 read=102781"

" I/O Timings: read=579.946"

" -> HashAggregate (cost=10005447758.51..10005447776.61 rows=1810 width=44) (actual time=178882.874..178883.320 rows=1355 loops=1)"

" Output: sum(b.a), count(b.id), a.ts, b.st"

" Group Key: a.ts, b.st"

" Buffers: shared hit=47068719 read=102781"

" I/O Timings: read=579.946"

" -> Nested Loop (cost=10000000000.43..10004821800.38 rows=62595813 width=44) (actual time=0.167..139484.854 rows=73112419 loops=1)"

" Output: a.ts, b.st, b.a, b.id"

" Buffers: shared hit=47068719 read=102781"

" I/O Timings: read=579.946"

" -> Seq Scan on public.table1 a (cost=0.00..14.81 rows=181 width=8) (actual time=0.058..0.563 rows=181 loops=1)"

" Output: a.ts"

" Filter: ((a.ts > '2015-01-01 20:50:44+00'::timestamp with time zone) AND (a.ts < '2015-07-01 19:50:44+00'::timestamp with time zone))"

" Rows Removed by Filter: 540"

" Buffers: shared read=4"

" I/O Timings: read=0.061"

" -> Index Scan using end_date_idx on public.table2 b (cost=0.43..23181.37 rows=345833 width=52) (actual time=0.063..622.274 rows=403936 loops=181)"

" Output: b.serial_no, b.name, b.st, b.end_date, b.a, b.start_date"

" Index Cond: (a.ts < b.end_date)"

" Filter: (a.ts > b.start_date)"

" Rows Removed by Filter: 392642"

" Buffers: shared hit=47068719 read=102777"

" I/O Timings: read=579.885"

"Planning time: 0.198 ms"

"Execution time: 178884.467 ms"

Any pointers on how to go about optimizing this?

--yr

Re: Performance issue with NestedLoop query

From

Qingqing Zhou

Date:

30 July 2015, 20:24:59

On Thu, Jul 30, 2015 at 12:51 AM, Ram N <yramiyer@gmail.com> wrote:
> "              ->  Index Scan using end_date_idx on public.table2 b
> (cost=0.43..23181.37 rows=345833 width=52) (actual time=0.063..622.274
> rows=403936 loops=181)"
> "                    Output: b.serial_no, b.name, b.st, b.end_date, b.a,
> b.start_date"
> "                    Index Cond: (a.ts < b.end_date)"
> "                    Filter: (a.ts > b.start_date)"
> "                    Rows Removed by Filter: 392642"

In your case, do you have index built for both b.end_date and
b.start_date? If so, can you try

set enable_index=off

to see if bitmap heap scan helps?

Regards,
Qingqing

Re: Performance issue with NestedLoop query

From

Ram N

Date:

31 July 2015, 17:56:10

Thanks Qingqing for responding. That didn't help. It in fact increased the scan time. Looks like a lot of time is being spent on the NestedLoop Join than index lookups though I am not sure how to optimize the join. I am assuming its in memory join, so I am not sure why it should take such a lot of time. Increase work_mem has helped in reducing the processing time but it's still > 1 min.

--yr

On Thu, Jul 30, 2015 at 1:24 PM, Qingqing Zhou <zhouqq.postgres@gmail.com> wrote:

On Thu, Jul 30, 2015 at 12:51 AM, Ram N <yramiyer@gmail.com> wrote:
> " -> Index Scan using end_date_idx on public.table2 b
> (cost=0.43..23181.37 rows=345833 width=52) (actual time=0.063..622.274
> rows=403936 loops=181)"
> " Output: b.serial_no, b.name, b.st, b.end_date, b.a,
> b.start_date"
> " Index Cond: (a.ts < b.end_date)"
> " Filter: (a.ts > b.start_date)"
> " Rows Removed by Filter: 392642"

In your case, do you have index built for both b.end_date and
b.start_date? If so, can you try

set enable_index=off

to see if bitmap heap scan helps?

Regards,
Qingqing

Re: Performance issue with NestedLoop query

From

Matheus de Oliveira

Date:

31 July 2015, 18:06:46

On Thu, Jul 30, 2015 at 4:51 AM, Ram N <yramiyer@gmail.com> wrote:

select sum(a), count(id), a.ts, st from table1 a, table2 b where a.ts > b.start_date and a.ts < b.end_date and a.ts > '2015-01-01 20:50:44.000000 +00:00:00' and a.ts < '2015-07-01 19:50:44.000000 +00:00:00' group by a.ts, st order by a.ts

You could try to use a range type:

CREATE INDEX ON table2 USING gin (tstzrange(start_date, end_date, '()'));

Then:

    select sum(a), count(id), a.ts, st
    from table1 a, table2 b
    where tstzrange(b.start_date, b.end_date, '()') @> a.ts
        and a.ts < '2015-07-01 19:50:44.000000 +00:00:00'
    group by a.ts, st
    order by a.ts

Regards,

Matheus de Oliveira

Re: Performance issue with NestedLoop query

From

Matheus de Oliveira

Date:

31 July 2015, 18:08:32

On Fri, Jul 31, 2015 at 3:06 PM, Matheus de Oliveira <matioli.matheus@gmail.com> wrote:

CREATE INDEX ON table2 USING gin (tstzrange(start_date, end_date, '()'));

The index should be USING GIST, not GIN. Sorry.

Matheus de Oliveira

Re: Performance issue with NestedLoop query

From

Qingqing Zhou

Date:

31 July 2015, 18:37:51

On Fri, Jul 31, 2015 at 10:55 AM, Ram N <yramiyer@gmail.com> wrote:
>
> Thanks Qingqing for responding. That didn't help. It in fact increased the
> scan time. Looks like a lot of time is being spent on the NestedLoop Join
> than index lookups though I am not sure how to optimize the join.
>

Good news is that optimizer is right this time :-). The NLJ here does
almost nothing but schedule each outer row to probing the inner index.
So the index seek is the major cost.

Have you tried build a two column index on (b.start_date, b.end_date)?

Regards,
Qingqing

Re: Performance issue with NestedLoop query

From

Ram N

Date:

05 August 2015, 03:40:41

Thanks much for responding guys. I have tried both, building multi column indexes and GIST, with no improvement. I have reduced the window from 180 days to 30 days and below are the numbers

Composite index - takes 30 secs

With Btree indexing - takes 9 secs

With GIST - takes >30 secs with kind of materialize plan in explain

Any other ideas I can do for window based joins.

--yr

On Fri, Jul 31, 2015 at 11:37 AM, Qingqing Zhou <zhouqq.postgres@gmail.com> wrote:

On Fri, Jul 31, 2015 at 10:55 AM, Ram N <yramiyer@gmail.com> wrote:
>
> Thanks Qingqing for responding. That didn't help. It in fact increased the
> scan time. Looks like a lot of time is being spent on the NestedLoop Join
> than index lookups though I am not sure how to optimize the join.
>

Good news is that optimizer is right this time :-). The NLJ here does
almost nothing but schedule each outer row to probing the inner index.
So the index seek is the major cost.

Have you tried build a two column index on (b.start_date, b.end_date)?

Regards,
Qingqing

Re: Performance issue with NestedLoop query

From

Qingqing Zhou

Date:

05 August 2015, 17:12:52

On Tue, Aug 4, 2015 at 8:40 PM, Ram N <yramiyer@gmail.com> wrote:
>
> Thanks much for responding guys. I have tried both, building multi column
> indexes and GIST, with no improvement. I have reduced the window from 180
> days to 30 days and below are the numbers
>
> Composite index -  takes 30 secs
>
> With Btree indexing  - takes 9 secs
>
> With GIST - takes >30 secs with kind of materialize plan in explain
>
> Any other ideas I can do for window based joins.
>

From this query:

select sum(a), count(id), a.ts, st from table1 a, table2 b where a.ts
> b.start_date and a.ts < b.end_date and a.ts > '2015-01-01
20:50:44.000000 +00:00:00' and a.ts < '2015-07-01 19:50:44.000000
+00:00:00' group by a.ts, st order by a.ts

We can actually derive that b.start_date > '2015-07-01 19:50:44.000000
+00:00:00' and b.end_date < '2015-01-01 20:50:44.000000 +00:00:00'. If
we add these two predicates to the original query, does it help?

Thanks,
Qingqing