Thread: query performance, though it was timestamps,maybe just table size?

query performance, though it was timestamps,maybe just table size?

From

Henry Drexler

Date:

30 November 2012, 13:22:10

Hello, and thank you in advance.

Beyond the date vs timestamp troubleshooting I did, I am not sure what else to look for, I know the increase of rows will have some affect but I just don't think the query should go from 4 minutes to over 50.

system:

laptop - ubuntu 12.04 lts

"PostgreSQL 9.2.1 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit"

Summary:
    I have two tables. I run a function called massive_expansion that will look at customer_id and the_range - it will use these to query the table massive to find all those dates for the customer_id that are in the_range and insert these into another table.

Problem:
    The table massive is about 65 million rows (about double what it was) The query now takes 50 minutes (it used to take 4 minutes with data 1/2 the size).

History:
    When (I first started out) I had everything as date the query went quite quickly - 4 minutes when the table massive is about 30 million rows.
    I then re-did everything but this time instead of date I used the actual timestamps (as a result there were a few more rows but only 2 million more) - the query went quickly - 5 minutes when the table massive is about 32 million rows.
    So I did not see an appreciable difference between the data having date vs timestamp (all other things kept the same.) So I supposed I can rule out date vs timestamp being an issue.

tsrange is always an interval of 7 days.

I have two tables:

First table:
    CREATE TABLE massive
    (
      source character varying,
      dateof timestamp without time zone,
      customer_id bigint,
      count_raw bigint
    );

    CREATE INDEX customer_id_dateof
      ON massive
      USING btree
      (customer_id, dateof);

Second table:
    CREATE TABLE critical_visitors
    (
      customer_id bigint,
      dateof timestamp without time zone,
      the_range tsrange
    );

    CREATE INDEX customer_id_range
      ON critical_visitors
      USING btree
      (customer_id, the_range);

The function:
    CREATE or replace FUNCTION massive_expansion(customer_id bigint,the_range tsrange) RETURNS void AS $$
        BEGIN

        INSERT INTO massive_expansion_from_critical_visitors
        (
        select
        massive.source,
        massive.dateof,
        massive.customer_id,
        massive.count_raw
        from
        massive
        where
        massive.customer_id = $1
        and
        massive.dateof <@ the_range) ;

        END;
    $$ LANGUAGE plpgsql;

The query:
    select
    massive_expansion(customer_id,the_range)
    from
    critical_visitors;

Additional Detail:
I did a query against the 30 million volume with this query:
    select
    massive.source,
    massive.dateof,
    massive.customer_id,
    massive.count_raw
    from
    massive
    where
    massive.customer_id = '<a customer_id goes here>'::bigint--$1
    and
    massive.dateof <@ '(2012-07-22 17:00:00,2012-07-29 17:00:00]'::tsrange;

With a query plan of:
    "Index Scan using customer_id_sourcee on massive_m (cost=0.00..113.98 rows=1 width=28)"
    " Index Cond: (customer_id = 9167174563::bigint)"
    " Filter: (dateof <@ '("2012-07-22 17:00:00","2012-07-29 17:00:00"]'::tsrange)"

Then on the 65 million volume table I did the same query and got a plan of:
    "Index Scan using customer_id_source on massive (cost=0.00..189.38 rows=1 width=28)"
    " Index Cond: (customer_id = 9167174563::bigint)"
    " Filter: (dateof <@ '("2012-07-22 17:00:00","2012-07-29 17:00:00"]'::tsrange)"

Re: query performance, though it was timestamps,maybe just table size?

From

Henry Drexler

Date:

30 November 2012, 13:32:30

On Fri, Nov 30, 2012 at 8:22 AM, Henry Drexler <alonup8tb@gmail.com> wrote:

Hello, and thank you in advance.

Beyond the date vs timestamp troubleshooting I did,

I realize this could be confusing - since I ruled out that difference, the real question is - given this setup, why would the query time go from 4 minutes to over 50, for an increase in table rows from 30 million to 65 million?

Re: query performance, though it was timestamps,maybe just table size?

From

Jeff Janes

Date:

30 November 2012, 18:42:52

On Fri, Nov 30, 2012 at 5:22 AM, Henry Drexler <alonup8tb@gmail.com> wrote:
> Hello, and thank you in advance.
>
>
> Beyond the date vs timestamp troubleshooting I did, I am not sure what else
> to look for, I know the increase of rows will have some affect but I just
> don't think the query should go from 4 minutes to over 50.

If the doubling of the size causes it to exceed the cache, when before
it did not, that could easily explain it.

...
>     and
>     massive.dateof <@ '(2012-07-22 17:00:00,2012-07-29 17:00:00]'::tsrange;

I don't think the <@ can use the btree index, but if you wrote it as a
"BETWEEN" it could.

> With a query plan of:
>     "Index Scan using customer_id_sourcee on massive_m  (cost=0.00..113.98
> rows=1 width=28)"

Can you report the EXPLAIN (ANALYZE, BUFFERS) instead?

Cheers,

Jeff

Re: query performance, though it was timestamps,maybe just table size?

From

Henry Drexler

Date:

30 November 2012, 20:22:50

On Fri, Nov 30, 2012 at 1:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Can you report the EXPLAIN (ANALYZE, BUFFERS) instead?

Thanks, here they are:

for the approx 65 million row approx 50 min version:

EXPLAIN (ANALYZE, BUFFERS)
select
massive_expansion(ctn,the_range)
from
critical_visitors;

"Seq Scan on critical_visitors (cost=0.00..168722.28 rows=628778 width=40) (actual time=0.655..3003921.066 rows=628778 loops=1)"
" Buffers: shared hit=4513040 read=1591722 dirtied=5234 written=10"
"Total runtime: 3004478.053 ms"

for the approx 30 million row approx 4 min version:

EXPLAIN (ANALYZE, BUFFERS)
select
massive_expansion(ctn,the_range)
from
critical_visitors;

"Seq Scan on critical_visitors (cost=0.00..746587.90 rows=2782315 width=40) (actual time=393.001..277108.379 rows=2782315 loops=1)"
" Buffers: shared hit=26370078 read=400301 dirtied=33772 written=1030"
"Total runtime: 278988.544 ms"

Re: query performance, though it was timestamps,maybe just table size?

From

Jeff Janes

Date:

02 December 2012, 05:45:01

On Fri, Nov 30, 2012 at 12:22 PM, Henry Drexler <alonup8tb@gmail.com> wrote:
> On Fri, Nov 30, 2012 at 1:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> Can you report the EXPLAIN (ANALYZE, BUFFERS) instead?
>
>
> Thanks, here they are:
>
> for the approx 65 million row approx 50 min version:
>
> EXPLAIN (ANALYZE, BUFFERS)
> select
> massive_expansion(ctn,the_range)
> from
> critical_visitors;
>
> "Seq Scan on critical_visitors  (cost=0.00..168722.28 rows=628778 width=40)
> (actual time=0.655..3003921.066 rows=628778 loops=1)"
> "  Buffers: shared hit=4513040 read=1591722 dirtied=5234 written=10"
> "Total runtime: 3004478.053 ms"
>
>
> for the approx 30 million row approx 4 min version:
>
> EXPLAIN (ANALYZE, BUFFERS)
> select
> massive_expansion(ctn,the_range)
> from
> critical_visitors;
>
> "Seq Scan on critical_visitors  (cost=0.00..746587.90 rows=2782315 width=40)
> (actual time=393.001..277108.379 rows=2782315 loops=1)"
> "  Buffers: shared hit=26370078 read=400301 dirtied=33772 written=1030"
> "Total runtime: 278988.544 ms"
>

I can't much sense out of those.  Could you do it for the recursive
SQL (the one inside the function) like you had previously done for the
regular explain?

Cheers,

Jeff

Re: query performance, though it was timestamps,maybe just table size?

From

Henry Drexler

Date:

03 December 2012, 14:30:52

On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

Could you do it for the recursive
SQL (the one inside the function) like you had previously done for the
regular explain?

Cheers,

Jeff

Here they are:

for the 65 million row table:
"Index Scan using ctn_source on massive (cost=0.00..189.38 rows=1 width=28) (actual time=85.802..85.806 rows=1 loops=1)"
" Index Cond: (customer_id = <some customer id>::bigint)"
" Filter: (dateof <@ '["2012-07-03 14:00:00","2012-07-10 14:00:00"]'::tsrange)"
" Buffers: shared read=6"
"Total runtime: 85.891 ms"

for the 30 million row table:
"Index Scan using ctn_dateof on massive (cost=0.00..80.24 rows=1 width=24) (actual time=0.018..0.020 rows=1 loops=1)"
" Index Cond: (customer_id = <some customer id>::bigint)"
" Filter: (dateof <@ '[2012-07-03,2012-07-11)'::

daterange)"
" Buffers: shared hit=5"
"Total runtime: 0.046 ms"

Thank you.

On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Nov 30, 2012 at 12:22 PM, Henry Drexler <alonup8tb@gmail.com> wrote:
> On Fri, Nov 30, 2012 at 1:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> Can you report the EXPLAIN (ANALYZE, BUFFERS) instead?
>
>
> Thanks, here they are:
>
> for the approx 65 million row approx 50 min version:
>
> EXPLAIN (ANALYZE, BUFFERS)
> select
> massive_expansion(ctn,the_range)
> from
> critical_visitors;
>
> "Seq Scan on critical_visitors (cost=0.00..168722.28 rows=628778 width=40)
> (actual time=0.655..3003921.066 rows=628778 loops=1)"
> " Buffers: shared hit=4513040 read=1591722 dirtied=5234 written=10"
> "Total runtime: 3004478.053 ms"
>
>
> for the approx 30 million row approx 4 min version:
>
> EXPLAIN (ANALYZE, BUFFERS)
> select
> massive_expansion(ctn,the_range)
> from
> critical_visitors;
>
> "Seq Scan on critical_visitors (cost=0.00..746587.90 rows=2782315 width=40)
> (actual time=393.001..277108.379 rows=2782315 loops=1)"
> " Buffers: shared hit=26370078 read=400301 dirtied=33772 written=1030"
> "Total runtime: 278988.544 ms"
>

I can't much sense out of those. Could you do it for the recursive
SQL (the one inside the function) like you had previously done for the
regular explain?

Cheers,

Jeff

Re: query performance, though it was timestamps,maybe just table size?

From

Jeff Janes

Date:

10 December 2012, 00:16:40

On Mon, Dec 3, 2012 at 5:56 AM, Henry Drexler <alonup8tb@gmail.com> wrote:
> On Sun, Dec 2, 2012 at 12:44 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> Could you do it for the recursive
>> SQL (the one inside the function) like you had previously done for the
>> regular explain?
>>
>> Cheers,
>>
>> Jeff
>
>
> Here they are:
>
> for the 65 million row table:
> "Index Scan using ctn_source on massive  (cost=0.00..189.38 rows=1 width=28)
> (actual time=85.802..85.806 rows=1 loops=1)"
> "  Index Cond: (ctn = 1302050134::bigint)"
> "  Filter: (dateof <@ '["2012-07-03 14:00:00","2012-07-10
> 14:00:00"]'::tsrange)"
> "  Buffers: shared read=6"
> "Total runtime: 85.891 ms"

If you execute it repeatedly (so that the data is in buffers the next
time) does it then get faster?

> for the 30 million row table:
> "Index Scan using ctn_dateof on massive  (cost=0.00..80.24 rows=1 width=24)
> (actual time=0.018..0.020 rows=1 loops=1)"
> "  Index Cond: (ctn = 1302050134::bigint)"
> "  Filter: (dateof <@ '[2012-07-03,2012-07-11)'::daterange)"
> "  Buffers: shared hit=5"
> "Total runtime: 0.046 ms"

The obvious difference is that this one finds all 5 buffers it needs
in buffers already, while the first one had to read them in.  So this
supports the idea that your data has simply grown too large for your
RAM.

Cheers,

Jeff

Re: query performance, though it was timestamps,maybe just table size?

From

Henry Drexler

Date:

10 December 2012, 13:26:42

On Sun, Dec 9, 2012 at 7:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

The obvious difference is that this one finds all 5 buffers it needs
in buffers already, while the first one had to read them in. So this
supports the idea that your data has simply grown too large for your
RAM.

Cheers,

Jeff

Jeff thanks for that explanation and taking the time to expose me to the explain analyze. I am currently reading through the docs so I can use them and understand them.

Thank you again for all of your help.